# Mirror, mirror on the wall — who’s the scrappiest scraper of all?

I implemented Silvermirror, a website mirroring tool. It’s got some interesting aspects that make it better than wget in some cases. Let’s dive in!

Silvermirror is a small project created because I find myself mirroring websites rather often. Sometimes it goes well without any special intervention:

Broken image links, but that’s fine.

And sometimes it leaves a lot to be desired:

The page layout is terrible and there are some encoding errors causing garbage characters to be printed.

By default, URLs are mapped according to a mostly reasonable set of rules. However, in some cases, it produces incorrect results (if foo/ and foo/index.html are different files, for instance). The URLs it produces aren’t consistent with the original URLs, which can be a problem. You need to add a special command line option to map URLs — it’s not included in --mirror.

And that’s just the output. When running wget --mirror, it has spammy output instead of useful output, and it consumes gigabytes of memory for nontrivial websites.

### Implementing Silvermirror

I implemented Silvermirror in D. It has two components: the crawler and the server.

The crawler is straightforward. It leverages Adam D Ruppe’s arsd-dom to parse HTML. It maintains a queue of URLs to process and a map of URL to metadata. The metadata includes the filename on disk, the source URL, and the content type.

The queue is a simple list of URLs. We store it on disk, so if we abort and restart mirroring, we can start off where we left off. When we enqueue a new URL, we append it to the on-disk queue, and every so often, we rewrite the entire queue so we don’t have to reparse excess URLs.

Serving the data is straightforward, too. We read our map file from disk, figure out which page the person was talking about, and serve it with the relevant headers.

Add templates onto that (and download limits so you can test your templates), and you’re in pretty good shape for grabbing only the data you need.

Future directions include:

• More flexible excludes