Human experimentation for fun and profit

I want to experiment on my users. How do I do it?

Yesterday I talked about creating a configuration service. We’re going to leverage that. An experiment is just a configuration rule that’s sharded among your userbase.

But is it that simple? Usually not. Let’s dive in!

Choosing a treatment

Iacta alea est

The easiest way to go is to just toss the dice.

You define your treatments and their percentages and roll 1d100. The user gets into whatever treatment corresponds to the value on the die. For instance:

function getTreatment(treatments, control) {
	var value = Math.random() * 100;
	for (var i = 0; i < treatments.length; i++) {
		value -= treatments[i].percent;
		if (value < 0) {
			return treatments[i].value;
		}
	}
	return control;
}

What's this good for? Things where you're okay with changing behavior between requests. Things where your users don't need consistency. Probably where your users won't notice a lot. Like Google's 41 shades of blue.

Introduce a discriminator

So you determined you want each user to have a consistent experience. Once they enter an experiment, they're in it until the experiment finishes. How do we do that?

The simplest way is to introduce a pivot value, something unique to the user:

function toHash(str) {
	var hash = 1;
	for (var i = 0; i < str.length; i++) {
		hash = hash * 33 + str.charCodeAt(i);
	}
	return hash;
}

function getTreatment(pivot, treatments, control) {
	var value = pivot % 100;
	for (var i = 0; i < treatments.length; i++) {
		value -= treatments[i].percent;
		if (value <= 0) {
			return treatments[i];
		}
	}
	return control;
}

config.treatment = getTreatment(toHash(user.email), treatments, control);

What's great about this? It's simple, that's pretty much it.

What's terrible about it? The same users get the first treatment in every experiment. If you want to roll out updates to 1% of your users at a time, the same person always gets the least tested, bleeding edge stuff every time. That's not so nice, and it opens you up to luck effects much more.

The victorious solution

Quite simple: instead of basing your pivot only on the user, you base it on the user and the experiment. For instance:

var experiment = 'home screen titlebar style - 2016-06-12';
var pivot = toHash(user.email + experiment);
config.treatment = getTreatment(pivot, treatments, control);

This effectively randomizes your position between experiments but keeps it consistent for each experiment. We'll have to adjust the API to make it easier and more obvious how to do the right thing:

function getTreatment(userId, experimentId, treatments, control) { ... }

Dependencies

You will often have several simultaneous experiments. Sometimes you'll need a person to be enrolled in one specific experimental treatment for another experiment to even make sense. How do we do this?

First off, we'll adjust our treatment API so that, instead of an array of treatments, you send a JS object:

var homeScreenTreatments = {
	control: {value: {bgColor: 'black', fontSize: 10, bold: true}},
	t1: {value: {bgColor: 'black', fontSize: 12, bold: false}},
	t2: {value: {bgColor: 'gray', fontSize: 10, bold: true}}
};

Next, we'll stash our treatment decisions in the framework (in a new cache for each script run). Then we'll let you query that later. For instance:

var homeScreenExp = 'home screen titlebar style';
config.homeScreen = getTreatment(
	user.email,	homeScreenExp,	homeScreenTreatments);
// 50 lines later...
if (hasTreatment(homeScreenExp, 't2')) {
	config.fullNightModeEnabled = false;
}

We can alternatively bake experiments into the rule infrastructure, for instance, where a rule can specify a config section it supplies, treatments, and percentages. This will end up with a complex UI that does 90% of what users need in an inflexible way, but that's going to be troublesome.

However, what we want to do is store a collection of experimental treatments on the config object. We'll get into that later, but it looks like:

config.experiments = {
	'home screen titlebar style': 't2',
	'wake up message': 't5'
};

Incremental releases

Another common thing people want to do is roll out new features gradually. Sometimes I want to roll it out to fixed percentages of my users at fixed times. One option is to introduce a "rule series", which is a collection of rules, each with a start and end date. No two rules are allowed to overlap.

So I set up a rule series "roll-out-voice-search" with a simple set of rules:

// in the UI, I set this rule to be effective 2016-06-10 to 2016-06-15
config.voiceSearchEnabled = getTreatment(
	user.email,
	'roll-out-voice-search',
	{
		control: {value: false},
		enabled: {value: true, percent: 1}
	});

And I make a couple more rules, for 10%, 50%, and 100%, effective in adjacent date ranges.

But this is a common pattern. So we can simplify it:

config.voiceSearchEnabled = gradualRollout({
	user: user.email,
	rollout: 'roll-out-voice-search',
	start: '2016-06-10',
	finish: '2016-06-25',
	enabled: {value: true},
	disabled: {value: false}
});

And we can very easily interpret that to a linear rollout over the course of fifteen days based on the user's email address.

Metrics

You don't just assign experiment treatments to people and forget about it. You want to track things. And that means the client needs to know your entire configuration. But the entire configuration is sometimes obtuse to work with. So you want to see experimental treatments directly, by name, not as a bunch of configuration values that you have to backtrack into an actual value.

Separately, you need a system to record client events, and you submit the experiment treatments to it as tags. Then you can correlate treatments to behavior.

Speed

One complaint you might have is that this approach always fires every rule in sequence, and that's slow. The Rete algorithm is used in a wide variety of rule engines and is faster than naive reevaluation, so we should use that here, right?

Wrong. The Rete algorithm is complex and requires us to build up a large data structure. That data structure is used when a small portion of the input changes, letting me avoid recalculating the whole result.

In my case, I'm getting a series of new configurations, and each one is unrelated to the last. I might get a call for one collection of rules and then not get a call for it in the next hour. Or a rule might throw an error and leave the Rete data structure in an invalid state. Or I might have to abort processing, again leaving the data structure in an invalid state.

Future directions

The main target here is to look at what people are doing and try to provide more convenient ways of doing it.

We also want to provide the ability to mark portions of metadata as private information, to be redacted from our logs.

IP geolocation would be handy, allowing us to tell people where the client is located rather than relying on the client to self-report. We can grab a country-level GeoIP database for $25/month, city-level for $100/month. This would be strictly opt-in, possibly with an additional fee.

Finally, we have to turn this into a proper service. Slap a REST API in front of it, add in HMAC authentication and API usage reporting, service health metrics, and load balancers.

That concludes our short on creating an experiment system.

Leave a Reply