Even more efficient UTF8 processing in C#

Last time, we took a text processing program and slashed its string allocations by a factor of one thousand, with execution time going from 1.02s to 0.07s. That’s not good enough. Let’s make it even faster.

For those of you who are just joining us, the project we’re working with is SubTex, which converts from a pared down LaTeX-like language into HTML and epub. We have a version in native code using the D programming language, and now we’re rewriting it in C#. Also, we’re running this on Mono rather than dotnet-core because that’s what my IDE supports.

The D version produced an epub of our reference document (which is 413kb) in 0.02s, so our target is a 70% reduction in execution time. And last time, the bulk of our work was in producing a StringSlice struct that lets us take a substring without allocating new memory for it.

The first obvious point of inefficiency we have: we’re transcoding a UTF8 text file into UTF16, performing a bunch of operations on it, and then converting it back into UTF8 (since that’s the standard encoding and we don’t trust a random ereader to work with UTF-16 documents). Let’s fix that.

struct utf8

Unlike D, C# doesn’t have builtin support for UTF8. We’ve got to create that ourselves. And remembering the gains we got from StringSlice, let’s take that lesson here.

Our basic structure wraps an ArraySegment:

public struct utf8
  ArraySegment<byte> _bytes;
  public utf8(ArraySegment<byte> bytes);
  public utf8(byte[] bytes);
  public utf8(string str) : this(Encoding.UTF8.GetBytes(str)) {}

It’s unfortunate that we have to transcode when we’re given a UTF16 string, but that’s unavoidable.

We need to iterate through the UTF8 bytes rather often. The encoding is straightforward:

  • A byte whose high bit is not set is a single-byte code unit. Its meaning is identical to ASCII.
  • A byte beginning with 10 is a continuation of a multibyte value. The first two bits are ignored.
  • A byte beginning with n 1 bits is the first byte of a code unit comprising n bytes.
    The first n+1 bits are ignored.

This makes it straightforward to iterate forward and backward, and to find the nearest character to a byte offset.

We also need to implement our own version of StringBuilder, for obvious reasons.

This is a decent chunk of work. Mostly straightforward; the most complex parts are handling UTF8 encoding and implementing utf8.Format.

Rollout: HTML generation

I changed the HTML generation to use utf8, reran, and saw:

0.03user 0.00system 0:00.04elapsed 95%CPU (0avgtext+0avgdata 42980maxresident)k
0inputs+168outputs (0major+3806minor)pagefaults 0swaps

That’s not quite the 70% reduction we were hoping for, but it’s close. (Ten milliseconds off, give or take.)

The most unfortunate aspect of this is that we need to pull all our string literals out into static variables (because the CLR encodes string literals in UTF-16 and we don’t want to pay the transcoding cost more than necessary).

The moral

What can we take away from this?

Native code is fast, but not outrageously faster than Mono.

.NET’s string implementation is not designed for speed with UTF8. At all.

There are probably some more allocations we can seize back, but at this point, I think we’re satisfied.