I want to create a personal web crawler. As I download each page, I collect a new list of URLs. To increase the likelihood that I am downloading the most important pages first, I want to keep the list of URLs to visit sorted by length (shorter URLs are more likely to be closer to the front page). That presents a problem though, because I'm also adding URLs.
A naïve list (Vec<T>
in Rust), won't be very efficient. After every batch of inserts, I would need to re-sort the list. Wouldn't it be good if there were a data structure that could maintain the sort order itself? Yes! It's called a binary heap.
Rust's standard library provides a binary heap implementation, as std::collections::BinaryHeap
. In my case, I want to sort a list of url::Url
. By default, they are sorted in alphabetic order.
use url::Url;
use std::collections::BinaryHeap;
fn main() {
let raw_urls = vec![
"https://www.youtube.com/watch?v=ywskA8CoulM",
"https://tim.mcnamara.nz/",
"https://github.com/rust-lang/rust/issues?labels=E-easy&state=open",
];
let mut urls = BinaryHeap::new();
for url in raw_urls {
urls.push(Url::parse(url).unwrap());
}
while let Some(url) = urls.pop() {
println!("{:?}", url.as_str());
}
}
This prints our URLs, but in alphabetic order:
"https://www.youtube.com/watch?v=ywskA8CoulM"
"https://tim.mcnamara.nz/"
"https://github.com/rust-lang/rust/issues?labels=E-easy&state=open"
Unfortunately, it's a little bit complicated to ask BinaryHeap
to use a custom sort order. To change the default, I create a new type that wraps one as ShortestFirst(Url)
. Then, I implement Ord
and PartialOrd
traits.
use url::Url;
use std::collections::BinaryHeap;
#[derive(PartialEq, Eq, , Debug)]
struct ShortestFirst(Url);
impl PartialOrd for ShortestFirst {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
Some(self.cmp(other))
}
}
impl Ord for ShortestFirst {
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
let left = (other.0).as_str().len();
let right = (self.0).as_str().len();
left.cmp(&right)
}
}
fn main() {
let raw_urls = vec![
"https://www.youtube.com/watch?v=ywskA8CoulM",
"https://tim.mcnamara.nz/",
"https://github.com/rust-lang/rust/issues?labels=E-easy&state=open",
];
let mut urls = BinaryHeap::new();
for url in raw_urls {
urls.push(ShortestFirst(Url::parse(url).unwrap()));
}
while let Some(ShortestFirst(url)) = urls.pop() {
println!("{:?}", url.as_str());
}
}
This prints the URLs, in the order we wanted:
"https://tim.mcnamara.nz/"
"https://www.youtube.com/watch?v=ywskA8CoulM"
"https://github.com/rust-lang/rust/issues?labels=E-easy&state=open"
Top comments (0)