## DogDogFish

### Data Science, amongst other things.

#### Category: AB Testing

Hi all,

I’m on a train to the Fringe festival and I’ve managed to bag myself first class tickets! What this means is that I’ve got about 3 hours to kill, unlimited free wine (who knew that was thing!?) and a fairly flaky internet connection. All of that goes together to make now the perfect time to share with you a bit of work I’ve been up to recently.

Firstly, massive props to this particular book: Probabilistic Programming & Bayesian Methods for Hackers

It’s really really good and this testing is just a minor reworking of one of the examples in the book – they say imitation is the sincerest form of flattery no?

Anyway, enough of the wine-induced babbling, on with the Bayesian testing…

The setting of the scene

What I’ll be looking to do here is identify changes in conversion rate (can be any kind of conversion, I’ll use orders/visitors but it doesn’t really matter) in historical data. The reason I say historical data is thusly: this is a good technique for identifying a visualizing changes but it isn’t as good as running a legitimate A/B test. Ideally, we’d implement the change (whatever it is) for one group of customers and not for the rest of the customers and measure relative performance. However, let’s assume that, for whatever reason, you’re looking over a bunch of (conversion) data trying to identify a change.

I’ll artificially generate data so you can run similar examples and get an idea of what our data source looks like:

```#!/usr/bin/python

import random
import numpy as np

total_points = 50

## Generate a list containing the number of trials
trials = [random.randint(20,100) for _ in range(total_points)]
results = [np.random.binomial(value, 0.4) if total_point/2 else np.random.binomial(value, 0.3) for index, value in enumerate(trials)]

for trial, result in zip(trials, results):
print "%dt%d" % (trial, result)
```

where in our example, trials is going to be the number of visitors on successive days and results is going to be the number of orders.

Pipe that into a file (`python generate_conversions.py > conversion_data.txt `) and we’ve got ourself a nice list of conversion data with a change in the rate at some point (halfway in this example) through the data.

The Bayesian Bit

So the idea behind Bayesian statistics revolves around priors and posteriors – your prior is going to be a distribution that represents your (shocker) prior ideas about the result. You’re going to update this prior hypothesis with data as you get it and when you do so the resulting distribution is called the posterior. This is great for a number of reasons – my favourite two are:

1. You get to set a prior that influences the final outcome. If I’ve got fairly strong ideas about what the conversion rate is (say I think it very likely lies between 80% and 90%) I can reflect that in my prior. When we start any frequentist (non-Bayesian) calculations we assume every probability between 0% and 100% is equally likely – generally that’s not the case.
2. The final result is a distribution. Distributions are great for visualization, allow for easy comparisons against other distributions and are really easy to show uncertainty on. Do away with all the talk of p-values and show a graph with two ‘conversion distributions’ on and you’re on to a winner.

Shut up and Calculate

Code first – then commentary:

```#!/usr/bin/python

import pymc as pm
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import datetime

conversion_list = []

with open('conversion_data.txt', 'rb') as f:
for line in f:
conversion_list.append(int(conversions))

uniform_one_samples = []
uniform_two_samples = []
tau_samples = []
uniform_one = pm.Uniform('uniform_one', 0, 1)
uniform_two = pm.Uniform('uniform_two', 0, 1)

tau = pm.DiscreteUniform('tau', lower=0, upper=n_percent_list)

@pm.deterministic
def lambda_(tau = tau, uniform_one = uniform_one, uniform_two = uniform_two):
out = np.zeros(n_percent_list)
out[:tau] = uniform_one
out[tau:] = uniform_two
return out

observations = pm.Binomial('obs', n=basket_list, p=lambda_, value=conversion_list, observed=True)

model = pm.Model([observations, uniform_one, uniform_two, tau])
mcmc = pm.MCMC(model)
mcmc.sample(10000, 2500, 1)

uniform_one_samples = mcmc.trace('uniform_one')[:]
uniform_two_samples = mcmc.trace('uniform_two')[:]
tau_samples = mcmc.trace('tau')[:]

N = tau_samples.shape[0]

conversion_rate = np.zeros(n_percent_list)
for day in range(0, n_percent_list):
ix = day < tau_samples
conversion_rate[day] = (uniform_one_samples[ix].sum() + uniform_two_samples[~ix].sum()) / N

plt.subplot(411)
plt.tight_layout()

plt.plot(range(n_percent_list), 100.*conversion_rate, lw=4, color='#E24A33', label='Expected conversion rate')
plt.xlim([0, n_percent_list])
plt.ylim([0,100])
plt.xlabel('Day')
plt.title("Changes in the probability of conversion")
plt.ylabel('Expected conversion rate')
plt.legend(loc='upper right')

ax = plt.subplot(412)

#
plt.hist(uniform_one_samples, histtype='stepfilled', bins=50, alpha=0.85, label="posterior of initial conversion probability", color='#A60628', normed=True)
plt.legend(loc='upper left')

plt.xlim([min(uniform_one_samples),max(uniform_one_samples)])
plt.xlabel("Probability of Conversion")
#
ax = plt.subplot(413)
#
plt.hist(uniform_two_samples, histtype='stepfilled', bins=50, alpha=0.85, label="posterior of later conversion probability", color='#7A68A6', normed=True)
plt.legend(loc='upper left')
plt.xlim([min(uniform_two_samples),max(uniform_two_samples)])
plt.xlabel("Probability of Conversion")
#
#
plt.subplot(414)
w = 1.0/tau_samples.shape[0] * np.ones_like(tau_samples)
plt.hist(tau_samples, bins=n_percent_list, alpha=1, label="posterior of conversion change date", color='#467821', weights=w, rwidth=2.)
plt.xticks(np.arange(n_percent_list))
#
plt.legend(loc='upper left')
plt.xlim([0, n_percent_list])
plt.xlabel("Day of change")
plt.ylabel('P(change occurred)')
#

plt.show()
```

That’s a fair bit of code – what I’m doing isn’t that complicated. After importing, defining and grabbing the conversion data that we generated in the first script we declare `uniform_1` and `uniform_2`.

We’re going to say that our data ultimately comes from a Binomial distribution where people convert with probability p. However, we are going to say that at some time (given in this example as `tau`) the value of p changed. p is the conversion of a Binomial distribution and so saying that p changes at some point means we’re saying our conversion rate changed.

Firstly, I declare my priors – I say that `uniform_1` and `uniform_two`, or the conversion before and after the change we’re trying to identify, are both drawn from a uniform distribution that runs between 0 and 1. I’m saying I think the conversion is equally likely to everywhere between 0 and 100%. If you’re doing this properly I’m sure you can improve on this (I’d advise looking into the Beta distribution) but with sufficient data the uniform distribution should work fine.

So, at this point we’ve got the distributions from which our p-values are drawn described by uniform distributions – our first pair of priors.

Next, I declare `tau` – the time at which our conversion rate changed. I’m using a discrete uniform distribution and saying that it could have happened with equal probability at any time between the first day and the last day. If you’re looking to identify when the biggest change in conversion occurred I’d advise using a discrete uniform distribution. If you’re trying to identify the effect a change on a particular day had then you can be more creative with this prior.

Now we declare `_lambda`. For the eagle-eyes pseudo-coders amongst you, I’m sure it’s clear but this represents our complete prior belief about the conversion rate (p of a Binomial). We say we think the value of p follows `uniform_1` up until `tau` and then switches to `uniform_2`.

Then we let pymc take over – we specify our distribution is a binomial with as many trials as we generated in the first script, and with the probabilities given in `_lambda`. We tell pymc that we obtained the values we’re passing to it and created our distribution. I’m not going to go in what wizardy follows but check out chapter 3 of the book I mentioned at the start if you’re interested. What we end up with (when we grab the traces) is 10,000 numbers drawn from posterior distributions we’ve generated from `uniform_1, uniform_2` and `tau`.

The final calculation builds a vector of Booleans for each day of the data set specifying whether the day is less than each of the 10,000 samples drawn from the `tau` distribution, or greater than the `tau`. Then we go along that 10,000 element boolean vector and create a sum – if the day number was less than the `tau` we take whatever is in the `uniform_1` list at that point, otherwise taking what is in the `uniform_2` point. Finally dividing by the number of points we drew, we can get an idea of the expected conversion rate (expected value of `_lambda`).

I get that that section is a bit complicated but if you get it:
a.) I’ve not had enough wine
b.) You’ve got the idea behind the testing so far.

Finally, there’s a lot of Matplotlib plotting stuff all leading us towards the following beautiful picture:

Basic Bayesian Conversion Testing/Change Recognition

I hope that shows more clearly what I’ve been trying to explain. Firstly, the top graph shows our expected and our observed conversions. The bars show our ‘actual conversion rates’ on given days – these were the values we artificially generated. The red line is the output of our model discussed in the final point above. The second and third graph show the distributions of the conversion rate from before the change and after the change. If you’re looking to paint a compelling picture, I’d advise putting them both on the same axis, drawing them as densities (not histograms) and adding lines at the 95th percentiles. Finally, the bottom graph shows us when we change in conversion rate likely changed. As you can see, it’s fairly heavily concentrated around the middle (good) but there’s quite a wide range of days around there where it could be.

I’ve found that in practice, it makes sense to run these simulations multiple times (they don’t take very long) and concatenate the results together when working out when a change occurred. I’ll leave that as an exercise given how long this post already is – there’s  a picture of what it looks like at the bottom – the posteriors of the conversion probabilities get a bit messy.

So there we have it – it’s been a bit of a slog and the woman who brings the wine round the train has started laughing when she pours me a new glass (surely a troubling sign). However, we’re now able to run Bayesian conversion tests like a boss and visualize them in a really funky kind of way.

As always, comments are welcome, questions too.

The Lannisters send their regards.

p.s.

Picture when run with multiple iterations…

Mutliple Iterations of Bayesian Conversion Testing.

Hi all,

A reasonably big part of the job I do involves running AB tests and I think that’s usually something that falls under the remit of data scientists. Now I could go on and on about the difficulties around tagging tests, choosing test groups, choosing goals and the like. I won’t, but I’ll make at least one point. All of the aforementioned points are very important and are very non-trivial. If you don’t absolutely understand all of the steps a user might take, and how your test will handle these, work a bit harder on that. And outsourcing this problem to your AB testing framework provider…well, I’d not advise it.

Anyway, I recently read this blog post by the engineers over at Lyst and thought it was a really interesting idea. I’m really not a fan of p-values and think that a probability distribution is a much nicer way of communicating a result. With that in mind, the basic logic behind Lyst’s post (if you’ve not got the time/inclination to read it):

1.) You presumably have a reasonable idea about the distribution of whatever metric you’re plotting – for instance, you’re reasonably confident conversion % lies somewhere between 0 and 25% with the most likely value to be around 15% (as an example). You assume both test groups follow this distribution until you start getting data in to corroborate/contradict that.

Beta Distribution with alpha parameter 8 and beta parameter 40.

2.) On getting data in, you update your distribution to take into account these new data points – the distributions evolve with the new data.
3.) Your end result (see below) gives you a probability distribution for the conversion of both test groups.

The distribution of conversion for two (randomly generated) test datasets

I’d argue the above picture is much clearer than even the best explained set of p-values. It also really clearly lends itself to calculations along the lines of ‘What is the probability that test group A is better than test group B?’ or ‘how sure are you that test group A is 2% better than test group B?’

Enamoured with this new way of reporting test results, I figured I mayerswell build something to do so. Instead of writing something where the test set-up is tied in with the result plotter I wrote my plotter to take input from stdin.

First things first then, I want something to generate a stream of conversion events:

```import random
import time

for _ in range(2000):
group = 'test' if random.random() > 0.3 else 'control'
if group == 'test':
convert = 1 if random.random() < 0.16 else 0
else:
convert = 1 if random.random() < 0.10 else 0
print '%s:%d' % (group, convert)
```

Easy does it – we’ll look at varying those numbers later. For the uninitiated, that’ll give us fairly uneven test groups with reasonably different conversion percentages.

Now for the plotter. I’ll confess, this is still a work in progress. It currently doesn’t assume anything about the test groups, including the names (taking those from the input stream). However, in future I’m hoping to extend the program to be able to perform multivariate Bayesian AB tests. If it appears messy in places, that’s either because I’m expecting the poor coding practices to lead to me having an easier time extending the code to allow multivariate testing, or because I’m a messy coder.

At this point, massive props to this book: it forms the basis of almost all of this code.

```import pymc as pm
import numpy as np
from matplotlib import pyplot as plt
import sys

results_dictionary = {}

## Store all our test results in memory - doesn't allow real-time updating and could get a bit serious if we've got a big set of results
for line in sys.stdin:
if line == '':
break
group, conversion = line.strip().split(':')
try:
results_dictionary[group].append(int(conversion))
except:
results_dictionary[group] = [int(conversion)]

test_group_a, test_group_b = results_dictionary.keys()

## We'll run this twice, once with uniform prior and once with a beta prior
prior_dict = dict((group, pm.Uniform(group, 0, 1)) for group in results_dictionary.keys())
prior_dict_beta = dict((group, pm.Beta(group, 3, 50)) for group in results_dictionary.keys())

@pm.deterministic
def delta(p_A = prior_dict[test_group_a], p_B = prior_dict[test_group_b]):
return p_A - p_B

@pm.deterministic
def beta_delta(p_A = prior_dict_beta[test_group_a], p_B = prior_dict_beta[test_group_b]):
return p_A - p_B

## Bernoulli distribution with the events we've got
observations = dict((group, pm.Bernoulli('obs_%s' % str(group), prior_dict[group], value=events, observed=True)) for group, events in results_dictionary.items())
beta_observations = dict((group, pm.Bernoulli('obs_%s' % str(group), prior_dict_beta[group], value=events, observed=True)) for group, events in results_dictionary.items())

## Markov chain Monte-Carlo methods - returning samples from our updated distributions
mcmc = pm.MCMC([prior_dict[test_group_a], prior_dict[test_group_b], delta, observations[test_group_a], observations[test_group_b]])
mcmc_beta = pm.MCMC([prior_dict_beta[test_group_a], prior_dict_beta[test_group_b], beta_delta, observations[test_group_a], observations[test_group_b]])
mcmc.sample(20000,1000)
mcmc_beta.sample(20000,1000)

## Grab all the samples we need
samples = dict((key, mcmc.trace(key)[:]) for key in results_dictionary.keys())
delta_samples = mcmc.trace('delta')[:]
beta_samples = dict((key, mcmc_beta.trace(key)[:]) for key in results_dictionary.keys())
beta_delta_samples = mcmc_beta.trace('beta_delta')[:]

## It's this easy to work out probabilities of success
prob_a_better = (delta_samples < 0).mean()
prob_a_better_beta = (beta_delta_samples < 0).mean()

### Plotting
ax = plt.subplot(321)
plt.hist(samples[test_group_a], histtype='stepfilled', bins=50, alpha=0.85, label='Uniform posterior of %s' % test_group_a, color='#A60628', normed=True)
plt.suptitle('Posterior distributions of %s, %s, and \$Delta\$ unknowns' % (test_group_a, test_group_b))
plt.title('Uniform posterior of %s' % test_group_a)
plt.autoscale()
#
ax = plt.subplot(323)
plt.hist(samples[test_group_b], histtype='stepfilled', bins=25, alpha=0.85, label='Uniform posterior of %s' % test_group_b, color='#A60628', normed=True)
plt.title('Uniform posterior of %s' % test_group_b)
plt.autoscale()
#
ax = plt.subplot(325)
plt.hist(delta_samples, histtype='stepfilled', bins=25, alpha=0.85, label='Uniform posterior of \$Delta\$', color='#A60628', normed=True)
plt.vlines(0, 0, 50, linestyle='--', color='black')
plt.title('Uniform posterior of \$Delta\$')
plt.autoscale()
plt.annotate('Probability %s nis greater nthan %s: %.2f' % (test_group_a, test_group_b, prob_a_better), (0,30))
#
ax = plt.subplot(322)
plt.hist(beta_samples[test_group_a], histtype='stepfilled', bins=25, alpha=0.85, label='Beta posterior of %s' % test_group_a, color='#A60628', normed=True)
plt.title('Beta posterior of %s' % test_group_a)
plt.autoscale()
#
ax = plt.subplot(324)
plt.hist(beta_samples[test_group_b], histtype='stepfilled', bins=25, alpha=0.85, label='Beta posterior of %s' % test_group_b, color='#A60628', normed=True)
plt.title('Beta posterior of %s' % test_group_b)
plt.autoscale()
#
ax = plt.subplot(326)
plt.hist(beta_delta_samples, histtype='stepfilled', bins=25, alpha=0.85, label='Beta posterior of \$Delta\$', color='#A60628', normed=True)
plt.vlines(0, 0, 50, linestyle='--', color='black')
plt.autoscale()
plt.annotate('Probability %s nis greater nthan %s: %.2f' % (test_group_a, test_group_b, prob_a_better_beta), (0,30))
plt.title('Beta posterior of \$Delta\$')
#
plt.tight_layout()
plt.show()
```

Giving us:

Graphs comparing the conversion of one test group against another, using a Beta distribution and a Uniform distribution as a prior.

First things first, why are there 6 graphs? Well, realistically, given that this is designed to model website conversion, I know what the distribution is likely to look like. Therefore, I say that my initial priors are beta distributions with parameters alpha = 10 and beta = 30. However, I’m well aware that naysayers might quibble with the idea of making such strong assumptions before running the test. Alongside that prior, I’ve included the completely uninformative uniform prior. That basically says that the conversion is equally likely to fall anywhere between 0 and 100%. Another reason for doing this is to show what difference it makes – when we’re looking at < 10 data points, you’ll see fairly big differences between the different prior assumptions. Increase the number of events up past 1000 and the two prior assumptions converge to the same value.

Additionally, we’ve fed in information about the test and control group – where has this delta come from and what’s it all about? That’s simply the difference between the test groups, as a probability distribution. How good is that? Really, that’s what people are interested in and, instead of making them compare two distributions and look at overlaps, we’ve done that for them and presented it as a probability distribution. Well done us.

Move along there.