Building Web Crawlers In Python

Web crawlers, also known as spiders or bots, are automated programs that systematically browse the World Wide Web to collect information. They play a crucial role in various applications, from search engine indexing to data mining and web archiving. In this article, we’ll explore what web crawlers are, how they work, and how to build a basic crawler in Python. We’ll also discuss the importance of respecting robots.txt files when crawling websites.

What is a Web Crawler?

A web crawler is a software program that automatically navigates through web pages, following links to discover new content. The primary functions of a web crawler include:

  1. Discovering web pages
  2. Downloading page content
  3. Extracting links and other relevant information
  4. Indexing the collected data for later use

Web crawlers are essential components of search engines like Google, Bing, and DuckDuckGo. They help these services maintain up-to-date indexes of the web, enabling users to find relevant information quickly.

How Web Crawlers Work

The basic process of web crawling involves the following steps:

  1. Start with a list of seed URLs
  2. Fetch the web page content for each URL
  3. Parse the HTML to extract links and other relevant data
  4. Add new, unvisited links to the crawl queue
  5. Repeat steps 2-4 until the crawl is complete or a stopping condition is met

Web crawlers must handle various challenges, including:

  • Respecting website policies (robots.txt)
  • Managing crawl depth and breadth
  • Handling different content types and structures
  • Avoiding duplicate content and infinite loops
  • Implementing polite crawling practices to avoid overloading servers

Building a Basic Web Crawler in Python

Let’s create a simple web crawler using Python. We’ll use the requests library for making HTTP requests and BeautifulSoup for parsing HTML.

First, install the required libraries:

pip install requests beautifulsoup4

Now, let’s write our basic crawler:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

class WebCrawler:
    def __init__(self, seed_url, max_depth=3):
        self.seed_url = seed_url
        self.max_depth = max_depth
        self.visited_urls = set()
        self.to_visit = [(seed_url, 0)]

    def crawl(self):
        while self.to_visit:
            url, depth = self.to_visit.pop(0)
            if depth > self.max_depth or url in self.visited_urls:
                continue

            try:
                response = requests.get(url)
                self.visited_urls.add(url)
                print(f"Crawling: {url}")

                soup = BeautifulSoup(response.text, 'html.parser')
                self.process_page(url, soup)

                if depth < self.max_depth:
                    self.add_links_to_visit(url, soup, depth)

                time.sleep(1)  # Be polite, wait between requests
            except Exception as e:
                print(f"Error crawling {url}: {e}")

    def process_page(self, url, soup):
        # Extract and process data from the page
        title = soup.title.string if soup.title else "No title"
        print(f"Title: {title}")

    def add_links_to_visit(self, base_url, soup, current_depth):
        for link in soup.find_all('a', href=True):
            full_url = urljoin(base_url, link['href'])
            if full_url not in self.visited_urls:
                self.to_visit.append((full_url, current_depth + 1))

if __name__ == "__main__":
    crawler = WebCrawler("https://example.com")
    crawler.crawl()

This basic crawler starts from a seed URL, visits pages up to a specified depth, and prints the title of each page it crawls. It also implements a simple politeness policy by waiting one second between requests.

The above is a basic webcrawler but we need to also save the results somewhere then just printing it so let us add that feature

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.robotparser import RobotFileParser
import time
import csv
import asyncio
import aiohttp

class WebCrawler:
    def __init__(self, start_url, delay=1, user_agent='CustomCrawler/1.0'):
        self.start_url = start_url
        self.delay = delay
        self.user_agent = user_agent
        self.visited_urls = set()
        self.urls_to_visit = [start_url]
        self.rp = RobotFileParser()
        self.rp.set_url(urljoin(start_url, '/robots.txt'))
        self.rp.read()

    async def download_url(self, url):
        headers = {'User-Agent': self.user_agent}
        async with aiohttp.ClientSession() as session:
            async with session.get(url, headers=headers) as response:
                return await response.text()

    def get_linked_urls(self, url, html):
        soup = BeautifulSoup(html, 'html.parser')
        for link in soup.find_all('a'):
            path = link.get('href')
            if path and path.startswith('/'):
                path = urljoin(url, path)
            yield path

    def add_url_to_visit(self, url):
        if url not in self.visited_urls and url not in self.urls_to_visit:
            self.urls_to_visit.append(url)

    def can_fetch(self, url):
        return self.rp.can_fetch(self.user_agent, url)

    async def crawl(self, url):
        if not self.can_fetch(url):
            print(f'Robots.txt disallows crawling: {url}')
            return

        try:
            html = await self.download_url(url)
            self.extract_data(url, html)
            for linked_url in self.get_linked_urls(url, html):
                self.add_url_to_visit(linked_url)
        except Exception as e:
            print(f"Error crawling {url}: {e}")
        finally:
            self.visited_urls.add(url)

    def extract_data(self, url, html):
        soup = BeautifulSoup(html, 'html.parser')
        title = soup.title.string if soup.title else 'No title'
        paragraphs = [p.text for p in soup.find_all('p')]
        self.save_data([url, title, '\n'.join(paragraphs)])

    def save_data(self, data, filename='output.csv'):
        with open(filename, 'a', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(data)

    async def run(self):
        while self.urls_to_visit:
            url = self.urls_to_visit.pop(0)
            print(f'Crawling: {url}')
            await self.crawl(url)
            await asyncio.sleep(self.delay)

if __name__ == '__main__':
    async def main():
        crawler = WebCrawler(start_url='https://example.com')
        await crawler.run()

    asyncio.run(main())

With this new webcrawler we use the following steps

  • Asynchronous crawling using asyncio and aiohttp for improved performance.
  • Respects robots.txt rules.
  • Uses a custom user agent.
  • Implements rate limiting with a configurable delay between requests.
  • Extracts basic data (title and paragraphs) from each page.
  • Saves extracted data to a CSV file.
  • Handles errors and exceptions gracefully.

Respecting robots.txt

The robots.txt file is a standard used by websites to communicate with web crawlers about which parts of the site should not be accessed or indexed. It’s crucial for ethical web crawling to respect these guidelines.

To implement robots.txt parsing in our crawler, we can use the robotparser module from Python’s standard library:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from urllib.robotparser import RobotFileParser
import time
import csv

class RespectfulWebCrawler:
    def __init__(self, seed_url, max_depth=3, delay=1, user_agent='RespectfulCrawler/1.0'):
        self.seed_url = seed_url
        self.max_depth = max_depth
        self.delay = delay
        self.user_agent = user_agent
        self.visited_urls = set()
        self.to_visit = [(seed_url, 0)]
        self.rp = RobotFileParser()
        self.rp.set_url(urljoin(seed_url, "/robots.txt"))
        self.rp.read()

    def can_fetch(self, url):
        return self.rp.can_fetch(self.user_agent, url)

    def download_url(self, url):
        headers = {'User-Agent': self.user_agent}
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # Raise an exception for bad status codes
        return response.text

    def get_links(self, url, html):
        soup = BeautifulSoup(html, 'html.parser')
        for link in soup.find_all('a', href=True):
            full_url = urljoin(url, link['href'])
            yield full_url

    def add_url_to_visit(self, url, depth):
        if url not in self.visited_urls and (url, depth) not in self.to_visit:
            self.to_visit.append((url, depth))

    def extract_data(self, url, html):
        soup = BeautifulSoup(html, 'html.parser')
        title = soup.title.string if soup.title else 'No title'
        paragraphs = [p.text for p in soup.find_all('p')]
        return [url, title, '\n'.join(paragraphs)]

    def save_data(self, data, filename='output.csv'):
        with open(filename, 'a', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(data)

    def crawl(self):
        while self.to_visit:
            url, depth = self.to_visit.pop(0)
            if depth > self.max_depth or url in self.visited_urls:
                continue

            if not self.can_fetch(url):
                print(f"Robots.txt disallows crawling: {url}")
                continue

            try:
                print(f"Crawling: {url}")
                html = self.download_url(url)
                self.visited_urls.add(url)

                data = self.extract_data(url, html)
                self.save_data(data)

                if depth < self.max_depth:
                    for link in self.get_links(url, html):
                        self.add_url_to_visit(link, depth + 1)

                time.sleep(self.delay)  # Be polite, wait between requests
            except requests.exceptions.RequestException as e:
                print(f"Error crawling {url}: {e}")
            except Exception as e:
                print(f"Unexpected error crawling {url}: {e}")

if __name__ == "__main__":
    crawler = RespectfulWebCrawler("https://example.com", max_depth=3, delay=2)
    crawler.crawl()

This enhanced version of our crawler checks the robots.txt file before crawling each URL, ensuring that we respect the website’s crawling policies.

Conclusion

Web crawlers are powerful tools for collecting and indexing web data. When building a crawler, it’s important to consider ethical practices, such as respecting robots.txt files and implementing polite crawling techniques. The Python example provided here serves as a starting point for creating more sophisticated crawlers tailored to specific needs.

Remember that while web crawling can be a valuable technique for data collection and analysis, it’s crucial to use these tools responsibly and in compliance with website policies and legal regulations.

  • Happy Coding
  • Jesus Saves @JCharistech
  • By JCharisAI

Leave a Comment

Your email address will not be published. Required fields are marked *