Performant text processing in C#

I recently converted Subtex from D to C#. With the same amount of optimization, the C# version was thirty times slower. Let’s look at why.

Subtex is a command line utility that takes text in a simplified LaTeX-style language and produces output in several formats, most commonly HTML and epub (which is HTML).

Parsing and Substrings

A strategy I commonly use when writing a parser is to maintain a substring of the input representing the unparsed portion. This is quite convenient — it means I very rarely have bugs about reading something I already parsed.

Beyond that, I need to keep substrings for a lot of things. This language is for producing ebooks, so I need substrings representing data that will be in the output ebook. I need substrings for commands (though I could potentially avoid that with some work). I need substrings for links to external resources to include. Substrings everywhere.

Substring implementation

In D, it’s free to take a substring of a string. In C#, it’s expensive.

There are two ways you could implement the substring method:

  1. Create a new string pointing to the same data, but with start and length changed
  2. Copy the relevant portion of the character array and make a new string from that

We’ll call the first method slicing and the second copying.

Slicing is cheaper. It reduces the number of allocations you make, and if your string type is a value type, it doesn’t even touch the garbage collector. It’s pretty good for a lot of workflows.

Unfortunately, slicing has one downside: it forces you to keep the original string in memory until the slice goes out of scope.

Copying is the reverse: when you want to take a substring, it’s expensive. You need to call the garbage collector, which brings a bunch of large data structures into cache, might trigger a collection, and might even bring in the OS to allocate more memory. However, if you’re taking a substring of a large string and then throwing away the source, you’re not wasting tons of memory.

You can implement copying from slicing easily: slice and then duplicate. Unfortunately, you can’t easily implement slicing from copying.

The profile report for the first C# version started:

Allocation summary
     Bytes      Count  Average Type name
3795359440      24278   156329 System.String

Yes, that’s nearly 4GB allocated. The reference document is about 410 kilobytes.

Plus this took over one second to run without the profiler. Nowhere near good enough.

The solution

C# doesn’t give us slices natively, so we’ll create our own.

I created a class called StringView which is a view into a portion of a string:

public struct StringView
{
  private readonly string _str;
  private readonly int _start, _end;

  public StringView(string data);
  public StringView(string data, int start, int end);

  public StringView Substring(int start, int length)
  {
    return new StringView(
      _str,
      _start + start,
      _start + start + length);
  }
}

Then I added a bunch of methods for common string manipulation operations — StartsWith, IndexOf, Split, and so on.

I swapped out string for StringView in my parser and primary data structures, only converting back to string when building HTML documents.

Results!

Let’s check what the profiler says now:

Allocation summary
     Bytes      Count  Average Type name
   3837872      11847      323 System.String

We dropped our allocations by a factor of one thousand — we’ve got about ten bytes of string for each byte of input document across the lifetime of the application. But what did it do to our performance?

Before, it took us 1.02s to compile the reference document. Now, it’s 0.12s. As a bonus, I compiled everything using Mono’s AOT compiler, which shaved off about 50ms:

0.07user 0.01system 0:00.09elapsed 97%CPU (0avgtext+0avgdata 45976maxresident)k
0inputs+592outputs (0major+3913minor)pagefaults 0swaps

How does this stack up against the D version?

0.02user 0.00system 0:00.03elapsed 73%CPU (0avgtext+0avgdata 5616maxresident)k
5856inputs+928outputs (1major+729minor)pagefaults 0swaps

50ms slower than native code. That’s not bad at all. We’re still allocating nearly nine times as much — we’ll see what we can do about that.