Configuration as a service

I’m working on a rule engine targeted toward configuration-as-a-service and experiment configuration. Since it’s nontrivial and not much exists in this space, I thought I’d talk about it here for a bit.

Configuration as a service? Huh?

There are a few things this can be used for.

Recall when Google wanted to test out 41 different shades of blue for search result links? They used an experiment system to enroll randomized segments of the userbase into each treatment. That’s one use case we want to support.

Let’s say I’m implementing a phone app and it’s got a new feature that I want to get out as soon as possible. I need to QA it on each device, but I’m pretty sure it’ll just work. So I ship my update, but I keep the feature off by default. Then I add a rule to my configuration service to turn it on for the devices I’ve QA’ed it on. As I finish QA on a given device, I update the rule to turn the feature on for that device.

Or maybe I need to take legal steps in order to provide a feature in a given country. The client sends its location, and I’ve added rules to determine if that location is one where I can legally enable that feature. It might also include, for instance, which of my API endpoints it should use to store any server-side data — some countries require user data to remain in EU borders.

What are we implementing?

We want to offer a multitenant service so that you can pay us a bit of money and get our glorious configuration service.

You will submit JSON metadata to us and get JSON configuration back. You will enter in rules in a UI; we’ll execute those rules against the metadata to get your configuration. The rule UI will let you say: this rule comes into effect on this date, stops on that date; it’s got this priority; let’s test it against this sample configuration… Not too complex, but some complexity, because real people need it.

There are two basic parts: first, a service to execute rules; then, a website to manage rules. In between we have a rule engine.

Any significant caveats?

We’re running a configuration / experimentation service. We want third parties to use it. That means security.

We need to prevent you from calling System.exit() in the middle of your rules and bringing down our service. All that normal, lovely sandboxing stuff. Timeouts, too.

Also, you’re updating your rules pretty frequently. We need to be able to reload them on the fly.

Rules are code, and code can have bugs. We’ll have to watch for thrown exceptions and report them.

What’s already out there?

Drools

The heavy hitter, Drools has been around since the dinosaurs roamed the earth. It’s not easy to work with. It takes way too much code to initialize it, and most of thath code is creating sessions and factories and builders and containers that have no discernable purpose. If you try to read the code to figure out what it all means, prepare for disappointment: it’s a snarl of interfaces and fields set via dependency injection and implementations in separate repositories.

Drools rules accept Java objects and produce output by mutating their inputs. That means I need a real Java class for input and another for output. Their rule workbench lets you create your Java classes, but that means you need to publish your project to Maven. And loading multiple versions of a rule is an exercise in pain.

On the plus side, it gives you a rule workbench out of the box, and it has a reasonable security story. However, it doesn’t have any way to limit execution time that I’ve found, meaning you have to run rules in a separate thread and kill them if they take too long. This isn’t nice.

Easy Rules

The new kid on the block, it uses Java as a rule language, which brings us to JAR hell like Drools. Unfortunately, it doesn’t supply a workbench, it doesn’t offer a way to provide inputs and retrieve outputs, and it doesn’t have any sandboxing or time limits. At least the code is relatively straightforward to navigate.

Everyone else

OpenRules is based on Excel. Let’s not go there.

N-Cube uses Groovy as a DSL, which implies compiling to a JAR. It’s also got almost no documentation.

There are several others that haven’t been updated since 2008.

So they all suck?

No. They’re built for people who want to deploy a set of rules for their application within their application. They’re for people who trust the people writing business rules. We are building a service whose sole purpose is to supply a rule engine, where untrusted people are executing code.

When you are building a service specifically for one task, you shouldn’t be surprised when off-the-shelf components don’t cut it.

When you are building a multitenant service, libraries performing similar tasks often fall short of your needs.

What do we do?

The core thing that our service does is run user code. Let’s bring in a scripting engine. And since we’re going to accept JSON and emit JSON, let’s use a language that makes that natural. Let’s use Javascript.

The Rhino scripting engine makes it easy to run code and easy to filter which classes a script is allowed to use. Let’s just use that. Now we accept a rule from a user, wrap it in a light bit of code, and run it:

// we inject inputString as the raw json string
var input = JSON.parse(inputString);
var output = {};
// insert user code here

When we want to run it, we can just write:

Context ctx = Context.enter();
ctx.setClassShutter(name -> {
	// forbid it from accessing any java objects
	// (as a practical matter, I probably want to allow a JsonObject implementation)
	return false;
});
if (rule.compiledScript == null) {
	compile(rule);
}
Scriptable scope = ctx.initStandardObjects();
scope.put("inputString", scope, Context.toObject(inputString, scope));
rule.compiledScript.exec(ctx, scope);
response.write(scope.get("output", scope));

That’s not the whole story — we want to limit the amount of time it has to finish executing, set up logging and helper functions, all that jazz. We need to locate the rule somehow. We probably have multiple rules to run, and we have to propagate partial output objects between them (or merge them after). We also have to determine what order they should run in.

But, for what this does, it’s maybe half as much code as Drools takes.

What’s so much better about your approach?

The first huge advantage is that I’m using a scripting engine, one that doesn’t shove a bunch of classes into the global classloader. That means I can update everything on the fly. I’d get the same if I made Drools talk JSON, but that’s harder than writing my own engine.

Compared to Drools or EasyRules, I don’t have to maintain a build server and figure out how to build and package a java project I generate for each rule. I just shove some text into a database.

Javascript handles JSON objects quite well, which means not having to create a Java class for every input and output. That is the largest part of savings — Drools would be acceptable if it could talk JSON.

The people writing these rules are likely to be developers, not managers or analysts. They probably know Javascript, or can fake it pretty well.

What’s the catch?

Drools is huge and complex for three reasons.

First, it had significant development going on in an age when huge complex coding was de rigeur in Java.

Second, it had a separation between API and implementation enforced for historical and practical reasons.

And third, it solves complex problems.

You want your rules to just work. Drools has a lot of thought behind it to determine what “just working” should look like and make sure it happens. We haven’t put in that thought. I think the naive approach is pretty close to the intuitive result, but I haven’t verified that.

The rules accept and generate JSON. This means you lose type safety. On the other hand, the API accepts and generates JSON anyway, so this is pushing things a step further. Not great, but not the end of the world.

Javascript is kind of ugly, and we’re promoting its use. It’s going to be a bit crufty and verbose at times. The point of business rules in the Drools language or what-not is so that managers can read the rules, and we’re kind of missing that.

What do these rules look like?

An example rule:

if (input.device.name == 'bacon') {
	output.message = 'Congrats on your OnePlus One!';
}
if (input.device.name == 'bullhead') {
	output.message = 'Congrats on your Nexus 5X!';
}
if (input.device.uptime > 31 * 24 * 60 * 60) {
	output.sideMessage = "It's been a month. You might want to reboot your phone.";
}
output.homeScreenTreatment = Treatments.choose(
	'homeScreenTreatment',
	input.userId,
	{
		control:  {value: {backgroundColor: 'black'}},
		grayBg:   {percent: 5, value: {backgroundColor: 'gray'}},
		grayBold: {percent: 5, value: {backgroundColor: 'gray', bold: true}}
	}
);

I’ll talk a bit more about the experiment side next time.

Leave a Reply