Building a Web Crawlers or Web Bot using Rust

Web crawlers, also known as spiders or bots, are automated programs that systematically browse the World Wide Web to collect information. In this article, we’ll explore how to implement a basic web crawler in Rust, leveraging the language’s performance and safety features.

Why Use Rust for Web Crawling?

Rust is an excellent choice for building web crawlers due to its:

Performance: Rust’s zero-cost abstractions and efficient memory management make it ideal for handling large-scale crawling tasks.
Concurrency: Rust’s built-in support for safe concurrency allows for efficient parallel crawling.
Safety: Rust’s strict type system and ownership model help prevent common programming errors.

Prerequisites

Before we begin, make sure you have Rust installed on your system. You’ll also need to add the following dependencies to your Cargo.toml file:

[dependencies]
reqwest = { version = "0.11", features = ["blocking"] }
scraper = "0.12"
url = "2.2"

Implementing the Web Crawler

Let’s break down the implementation of our web crawler into several key components:

1. Setting Up the Project

First, create a new Rust project:

cargo new rust_web_crawler
cd rust_web_crawler

2. The Main Crawler Structure

We’ll start by defining our Crawler struct:

use std::collections::HashSet;
use url::Url;

struct Crawler {
    base_url: Url,
    visited_urls: HashSet<String>,
    to_visit: Vec<String>,
}

impl Crawler {
    fn new(start_url: &str) -> Result<Self, url::ParseError> {
        let base_url = Url::parse(start_url)?;
        Ok(Crawler {
            base_url,
            visited_urls: HashSet::new(),
            to_visit: vec![start_url.to_string()],
        })
    }
}

3. Fetching Web Pages

Next, we’ll implement a method to fetch web pages:

use reqwest::blocking::Client;

impl Crawler {
    fn fetch_url(&self, url: &str) -> Result<String, reqwest::Error> {
        let client = Client::new();
        let body = client.get(url).send()?.text()?;
        Ok(body)
    }
}

4. Parsing HTML and Extracting Links

We’ll use the scraper crate to parse HTML and extract links:

use scraper::{Html, Selector};

impl Crawler {
    fn extract_links(&self, body: &str) -> Vec<String> {
        let document = Html::parse_document(body);
        let selector = Selector::parse("a").unwrap();
        document
            .select(&selector)
            .filter_map(|element| element.value().attr("href"))
            .filter_map(|href| self.base_url.join(href).ok())
            .map(|url| url.to_string())
            .collect()
    }
}

5. The Crawling Loop

Now, let’s implement the main crawling logic:

impl Crawler {
    fn crawl(&mut self) {
        while let Some(url) = self.to_visit.pop() {
            if self.visited_urls.contains(&url) {
                continue;
            }

            println!("Crawling: {}", url);

            match self.fetch_url(&url) {
                Ok(body) => {
                    self.visited_urls.insert(url.clone());
                    let new_links = self.extract_links(&body);
                    for link in new_links {
                        if !self.visited_urls.contains(&link) {
                            self.to_visit.push(link);
                        }
                    }
                }
                Err(e) => println!("Error fetching {}: {}", url, e),
            }
        }
    }
}

6. Putting It All Together

Finally, let’s use our crawler in the main function:

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut crawler = Crawler::new("https://example.com")?;
    crawler.crawl();
    Ok(())
}

In order to build a more robust crawler we will need to add robots.txt checking as well as saving them. Let us add that.

use reqwest::blocking::Client;
use scraper::{Html, Selector};
use std::collections::HashSet;
use std::fs::File;
use std::io::Write;
use url::Url;
use robotstxt::RobotFileParser;

struct Crawler {
    base_url: Url,
    visited_urls: HashSet<String>,
    to_visit: Vec<String>,
    client: Client,
    robots_parser: RobotFileParser,
}

impl Crawler {
    fn new(start_url: &str) -> Result<Self, Box<dyn std::error::Error>> {
        let base_url = Url::parse(start_url)?;
        let client = Client::new();
        
        // Set up robots.txt parser
        let robots_url = base_url.join("/robots.txt")?;
        let robots_content = client.get(robots_url).send()?.text()?;
        let mut robots_parser = RobotFileParser::new(robots_url.as_str());
        robots_parser.parse(&robots_content);

        Ok(Crawler {
            base_url,
            visited_urls: HashSet::new(),
            to_visit: vec![start_url.to_string()],
            client,
            robots_parser,
        })
    }

    fn fetch_url(&self, url: &str) -> Result<String, reqwest::Error> {
        self.client.get(url).send()?.text()
    }

    fn extract_links(&self, body: &str) -> Vec<String> {
        let document = Html::parse_document(body);
        let selector = Selector::parse("a").unwrap();
        document
            .select(&selector)
            .filter_map(|element| element.value().attr("href"))
            .filter_map(|href| self.base_url.join(href).ok())
            .map(|url| url.to_string())
            .collect()
    }

    fn can_fetch(&self, url: &str) -> bool {
        self.robots_parser.can_fetch("*", url)
    }

    fn save_data(&self, url: &str, title: &str, content: &str) -> std::io::Result<()> {
        let filename = format!("crawled_data/{}.txt", url.replace("/", "_"));
        let mut file = File::create(filename)?;
        writeln!(file, "URL: {}", url)?;
        writeln!(file, "Title: {}", title)?;
        writeln!(file, "Content: {}", content)?;
        Ok(())
    }

    fn crawl(&mut self) {
        while let Some(url) = self.to_visit.pop() {
            if self.visited_urls.contains(&url) || !self.can_fetch(&url) {
                continue;
            }

            println!("Crawling: {}", url);

            match self.fetch_url(&url) {
                Ok(body) => {
                    self.visited_urls.insert(url.clone());
                    let document = Html::parse_document(&body);
                    
                    // Extract title
                    let title = document
                        .select(&Selector::parse("title").unwrap())
                        .next()
                        .map(|t| t.text().collect::<String>())
                        .unwrap_or_else(|| "No title".to_string());

                    // Extract content (simplified: just getting all paragraph text)
                    let content = document
                        .select(&Selector::parse("p").unwrap())
                        .map(|p| p.text().collect::<String>())
                        .collect::<Vec<String>>()
                        .join("\n");

                    // Save the data
                    if let Err(e) = self.save_data(&url, &title, &content) {
                        println!("Error saving data for {}: {}", url, e);
                    }

                    let new_links = self.extract_links(&body);
                    for link in new_links {
                        if !self.visited_urls.contains(&link) {
                            self.to_visit.push(link);
                        }
                    }
                }
                Err(e) => println!("Error fetching {}: {}", url, e),
            }
        }
    }
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    std::fs::create_dir_all("crawled_data")?;
    let mut crawler = Crawler::new("https://example.com")?;
    crawler.crawl();
    Ok(())
}

Our enhanced version includes the following additions:

Respecting robots.txt:
- We’ve added the robotstxt crate to parse robots.txt files.
- The Crawler::new method now fetches and parses the robots.txt file.
- A can_fetch method checks if a URL is allowed to be crawled.
Saving crawled data:
- A save_data method writes the crawled data to files.
- We extract the title and a simplified version of the content (all paragraph text).
- Each crawled page is saved as a separate file in a crawled_data directory.
Error handling:
- We’ve improved error handling throughout the code.

Conclusion

This web crawler demonstrates the fundamental concepts of web crawling in Rust. It efficiently fetches web pages, extracts links, and manages the crawling process. However, for a production-ready crawler, you’d want to consider additional features such as:

Respecting robots.txt files
Implementing rate limiting to avoid overloading servers
Handling different content types
Storing crawled data in a database
Implementing more sophisticated concurrency patterns

Rust’s performance and safety features make it an excellent choice for building robust and efficient web crawlers. As you expand on this basic implementation, you’ll find that Rust provides the tools and ecosystem support to create powerful web crawling solutions.

Happy Coding

Jesus Saves

By JCharisAI