Subtex: doing a BBCode with LaTeX

I created Subtex, a LaTeX-inspired markup language for ebooks. Here, I’m going to talk about what caused me to create this language, what it does, and what’s interesting about it.

The status quo

I write fiction. (I’m bad at writing and bad at finishing, so you’ll probably never see my work.)

With LaTeX, I can use macros to express intent. I can also build in some safety against certain errors. For instance, I use the csquotes package extensively. With a bit of futzing, I can write:

As Suetonius wrote: \e{Whether he was divinely inspired or
showed peculiar foresight is an arguable point, but these
were his words:

\e{Very well then, you win! Take him! But never forget that
the man whom you want me to spare will one day prove the
ruin of the party which you and I have so long defended.
There are many Marius's in this fellow Caesar.}}

And that does proper quotes, multiline and nested and everything. And I can add other macros as pleases me. If I only use a handful of macros that I write, that enforces consistency. Then I can convert it to a beautiful PDF when I’m ready.

The problem

LaTeX does PDF natively. (Well, DVI, but also PDF.) But with PDF, I can’t resize fonts. These days, when a lot of people read using ereaders, tablets, and phones, that’s cutting into my potential readership.

If I want to produce actual ebooks, I need epub. From epub, I can convert to anything. How do I do this with LaTeX?

There are two paths. The first is to use latexml to produce HTML, and then convert HTML to an ebook using Calibre. The second is to use htlatex instead of latexml, then inline CSS before passing it off to Calibre. (Calibre will un-inline inline CSS, and it will happily ignore un-inlined CSS.) Latexml is slow — twenty seconds on my reference document of 75,000 words, 420 kilobytes — plus it doesn’t handle a lot of the stuff I use. Htlatex is a clever hack, and it’s a lot faster, but it’s still a little slow, and it’s convoluted.

The third path: Subtex

How much markup do you really need for a novel?

Most of the time, you can make do with just chapter headers, italics, and scene breaks. Throw in quote marks while we’re at it since we were talking about the csquotes package. Add something to set book metadata — author, title, that kind of thing. And let’s use some semantic aliases for things, so I can distinguish a character thinking from emphasis in narration.

From that, we get a lean, minimal markup language, much closer to BBCode in complexity than to LaTeX. I call it Subtex.

I shopped around for a parser generator and found Pegged. It was about an hour’s work to get it up and running, another hour to get HTML output, then I added HTML, ePub, and plain text output modes. And that worked well. However, it wasn’t any faster than the htlatex path. Don’t get me wrong, it was nice to have a single binary rather than a makefile, and that I didn’t have a bajillion temporary files everywhere, but on the whole, not as much as I was hoping for.

So I tried to optimize the output, worked on that a bit. No dice. D’s GC is pretty good at appending to arrays; switching from concatenation to Appender, reserving as much memory as I thought I’d need, and reducing allocations in general did almost nothing. Then I looked at Pegged.

Pegged was taking so much time, it was nearly indistinguishable parsing a document versus parsing and outputting. Disabling the GC from collecting while the Pegged parser was running cut the time in half. But that still left 0.8 seconds.

Finally I broke down and wrote my own parser. It took an hour. There were no bugs. It reported several errors that the LaTeX parser hadn’t caught. And it was easy.

D’s slicing is what made it so simple. My parser maintained a slice of the input — the portion that hadn’t yet been parsed. In other languages, I’d maintain an index into the input string because I can write some convenient helper functions to make it work that way. In D, it’s trivial to make slices, so just keeping a slice of the input for what I haven’t parsed is the easiest way.

Also, I would normally maintain line and column numbers for where in the input I am. But with the slice approach, I could just keep the full input string around, then, if I encountered an error, take the original input up to my slice and count the number of newlines in it (and characters since the last newline).

The result

What I have now is a language with about a dozen core functions — scene breaks, timeskips, thinking and emphasis, including images and CSS. It includes unrecognized commands as CSS classes, so it is trivially extensible. I also, obviously, have a compiler that can produce books plain text, Markdown, HTML, and ePub. (There are still some compatibility issues with older ereaders that I’m looking into.)

One of the original complaints was time. Two-ish seconds to convert to epub isn’t huge, but it’s significant. The 0.8 seconds with Pegged was still annoying.

Subtex now compiles the 420kb reference document in 0.04 seconds.

I think I can declare victory.

Leave a Reply