Web crawlers, also known as spiders or bots, are automated programs that systematically browse the World Wide Web to collect information. In this article, we’ll explore how to implement a basic web crawler in Rust, leveraging the language’s performance and safety features.
Why Use Rust for Web Crawling?
Rust is an excellent choice for building web crawlers due to its:
- Performance: Rust’s zero-cost abstractions and efficient memory management make it ideal for handling large-scale crawling tasks.
- Concurrency: Rust’s built-in support for safe concurrency allows for efficient parallel crawling.
- Safety: Rust’s strict type system and ownership model help prevent common programming errors.

Prerequisites
Before we begin, make sure you have Rust installed on your system. You’ll also need to add the following dependencies to your Cargo.toml
file:
[dependencies]
reqwest = { version = "0.11", features = ["blocking"] }
scraper = "0.12"
url = "2.2"
Implementing the Web Crawler
Let’s break down the implementation of our web crawler into several key components:
1. Setting Up the Project
First, create a new Rust project:
cargo new rust_web_crawler
cd rust_web_crawler
2. The Main Crawler Structure
We’ll start by defining our Crawler
struct:
use std::collections::HashSet;
use url::Url;
struct Crawler {
base_url: Url,
visited_urls: HashSet<String>,
to_visit: Vec<String>,
}
impl Crawler {
fn new(start_url: &str) -> Result<Self, url::ParseError> {
let base_url = Url::parse(start_url)?;
Ok(Crawler {
base_url,
visited_urls: HashSet::new(),
to_visit: vec![start_url.to_string()],
})
}
}
3. Fetching Web Pages
Next, we’ll implement a method to fetch web pages:
use reqwest::blocking::Client;
impl Crawler {
fn fetch_url(&self, url: &str) -> Result<String, reqwest::Error> {
let client = Client::new();
let body = client.get(url).send()?.text()?;
Ok(body)
}
}
4. Parsing HTML and Extracting Links
We’ll use the scraper
crate to parse HTML and extract links:
use scraper::{Html, Selector};
impl Crawler {
fn extract_links(&self, body: &str) -> Vec<String> {
let document = Html::parse_document(body);
let selector = Selector::parse("a").unwrap();
document
.select(&selector)
.filter_map(|element| element.value().attr("href"))
.filter_map(|href| self.base_url.join(href).ok())
.map(|url| url.to_string())
.collect()
}
}
5. The Crawling Loop
Now, let’s implement the main crawling logic:
impl Crawler {
fn crawl(&mut self) {
while let Some(url) = self.to_visit.pop() {
if self.visited_urls.contains(&url) {
continue;
}
println!("Crawling: {}", url);
match self.fetch_url(&url) {
Ok(body) => {
self.visited_urls.insert(url.clone());
let new_links = self.extract_links(&body);
for link in new_links {
if !self.visited_urls.contains(&link) {
self.to_visit.push(link);
}
}
}
Err(e) => println!("Error fetching {}: {}", url, e),
}
}
}
}
6. Putting It All Together
Finally, let’s use our crawler in the main
function:
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut crawler = Crawler::new("https://example.com")?;
crawler.crawl();
Ok(())
}
In order to build a more robust crawler we will need to add robots.txt checking as well as saving them. Let us add that.
use reqwest::blocking::Client;
use scraper::{Html, Selector};
use std::collections::HashSet;
use std::fs::File;
use std::io::Write;
use url::Url;
use robotstxt::RobotFileParser;
struct Crawler {
base_url: Url,
visited_urls: HashSet<String>,
to_visit: Vec<String>,
client: Client,
robots_parser: RobotFileParser,
}
impl Crawler {
fn new(start_url: &str) -> Result<Self, Box<dyn std::error::Error>> {
let base_url = Url::parse(start_url)?;
let client = Client::new();
// Set up robots.txt parser
let robots_url = base_url.join("/robots.txt")?;
let robots_content = client.get(robots_url).send()?.text()?;
let mut robots_parser = RobotFileParser::new(robots_url.as_str());
robots_parser.parse(&robots_content);
Ok(Crawler {
base_url,
visited_urls: HashSet::new(),
to_visit: vec![start_url.to_string()],
client,
robots_parser,
})
}
fn fetch_url(&self, url: &str) -> Result<String, reqwest::Error> {
self.client.get(url).send()?.text()
}
fn extract_links(&self, body: &str) -> Vec<String> {
let document = Html::parse_document(body);
let selector = Selector::parse("a").unwrap();
document
.select(&selector)
.filter_map(|element| element.value().attr("href"))
.filter_map(|href| self.base_url.join(href).ok())
.map(|url| url.to_string())
.collect()
}
fn can_fetch(&self, url: &str) -> bool {
self.robots_parser.can_fetch("*", url)
}
fn save_data(&self, url: &str, title: &str, content: &str) -> std::io::Result<()> {
let filename = format!("crawled_data/{}.txt", url.replace("/", "_"));
let mut file = File::create(filename)?;
writeln!(file, "URL: {}", url)?;
writeln!(file, "Title: {}", title)?;
writeln!(file, "Content: {}", content)?;
Ok(())
}
fn crawl(&mut self) {
while let Some(url) = self.to_visit.pop() {
if self.visited_urls.contains(&url) || !self.can_fetch(&url) {
continue;
}
println!("Crawling: {}", url);
match self.fetch_url(&url) {
Ok(body) => {
self.visited_urls.insert(url.clone());
let document = Html::parse_document(&body);
// Extract title
let title = document
.select(&Selector::parse("title").unwrap())
.next()
.map(|t| t.text().collect::<String>())
.unwrap_or_else(|| "No title".to_string());
// Extract content (simplified: just getting all paragraph text)
let content = document
.select(&Selector::parse("p").unwrap())
.map(|p| p.text().collect::<String>())
.collect::<Vec<String>>()
.join("\n");
// Save the data
if let Err(e) = self.save_data(&url, &title, &content) {
println!("Error saving data for {}: {}", url, e);
}
let new_links = self.extract_links(&body);
for link in new_links {
if !self.visited_urls.contains(&link) {
self.to_visit.push(link);
}
}
}
Err(e) => println!("Error fetching {}: {}", url, e),
}
}
}
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
std::fs::create_dir_all("crawled_data")?;
let mut crawler = Crawler::new("https://example.com")?;
crawler.crawl();
Ok(())
}
Our enhanced version includes the following additions:
- Respecting robots.txt:
- We’ve added the
robotstxt
crate to parse robots.txt files. - The
Crawler::new
method now fetches and parses the robots.txt file. - A
can_fetch
method checks if a URL is allowed to be crawled.
- We’ve added the
- Saving crawled data:
- A
save_data
method writes the crawled data to files. - We extract the title and a simplified version of the content (all paragraph text).
- Each crawled page is saved as a separate file in a
crawled_data
directory.
- A
- Error handling:
- We’ve improved error handling throughout the code.
Conclusion
This web crawler demonstrates the fundamental concepts of web crawling in Rust. It efficiently fetches web pages, extracts links, and manages the crawling process. However, for a production-ready crawler, you’d want to consider additional features such as:
- Respecting
robots.txt
files - Implementing rate limiting to avoid overloading servers
- Handling different content types
- Storing crawled data in a database
- Implementing more sophisticated concurrency patterns
Rust’s performance and safety features make it an excellent choice for building robust and efficient web crawlers. As you expand on this basic implementation, you’ll find that Rust provides the tools and ecosystem support to create powerful web crawling solutions.
Happy Coding
Jesus Saves
By JCharisAI