I implemented Silvermirror, a website mirroring tool. It’s got some interesting aspects that make it better than wget in some cases. Let’s dive in!
Silvermirror is a small project created because I find myself mirroring websites rather often. Sometimes it goes well without any special intervention:
And sometimes it leaves a lot to be desired:
By default, URLs are mapped according to a mostly reasonable set of rules. However, in some cases, it produces incorrect results (if
foo/index.html are different files, for instance). The URLs it produces aren’t consistent with the original URLs, which can be a problem. You need to add a special command line option to map URLs — it’s not included in
And that’s just the output. When running
wget --mirror, it has spammy output instead of useful output, and it consumes gigabytes of memory for nontrivial websites.
I implemented Silvermirror in D. It has two components: the crawler and the server.
The crawler is straightforward. It leverages Adam D Ruppe’s arsd-dom to parse HTML. It maintains a queue of URLs to process and a map of URL to metadata. The metadata includes the filename on disk, the source URL, and the content type.
The queue is a simple list of URLs. We store it on disk, so if we abort and restart mirroring, we can start off where we left off. When we enqueue a new URL, we append it to the on-disk queue, and every so often, we rewrite the entire queue so we don’t have to reparse excess URLs.
Serving the data is straightforward, too. We read our map file from disk, figure out which page the person was talking about, and serve it with the relevant headers.
Add templates onto that (and download limits so you can test your templates), and you’re in pretty good shape for grabbing only the data you need.
Future directions include:
- More flexible excludes
- Adding a search index
- More flexible template specification
But for now, I have a tool that does a lot of what I want.
Check it out! Silvermirror on Github