Statistics – DogDogFish

Hi all,

A reasonably big part of the job I do involves running AB tests and I think that’s usually something that falls under the remit of data scientists. Now I could go on and on about the difficulties around tagging tests, choosing test groups, choosing goals and the like. I won’t, but I’ll make at least one point. All of the aforementioned points are very important and are very non-trivial. If you don’t absolutely understand all of the steps a user might take, and how your test will handle these, work a bit harder on that. And outsourcing this problem to your AB testing framework provider…well, I’d not advise it.

Anyway, I recently read this blog post by the engineers over at Lyst and thought it was a really interesting idea. I’m really not a fan of p-values and think that a probability distribution is a much nicer way of communicating a result. With that in mind, the basic logic behind Lyst’s post (if you’ve not got the time/inclination to read it):

1.) You presumably have a reasonable idea about the distribution of whatever metric you’re plotting – for instance, you’re reasonably confident conversion % lies somewhere between 0 and 25% with the most likely value to be around 15% (as an example). You assume both test groups follow this distribution until you start getting data in to corroborate/contradict that.

Beta Distribution with alpha parameter 8 and beta parameter 40.

2.) On getting data in, you update your distribution to take into account these new data points – the distributions evolve with the new data.
3.) Your end result (see below) gives you a probability distribution for the conversion of both test groups.

The distribution of conversion for two (randomly generated) test datasets

I’d argue the above picture is much clearer than even the best explained set of p-values. It also really clearly lends itself to calculations along the lines of ‘What is the probability that test group A is better than test group B?’ or ‘how sure are you that test group A is 2% better than test group B?’

Enamoured with this new way of reporting test results, I figured I mayerswell build something to do so. Instead of writing something where the test set-up is tied in with the result plotter I wrote my plotter to take input from stdin.

First things first then, I want something to generate a stream of conversion events:

import random
import time

for _ in range(2000):
    group = 'test' if random.random() > 0.3 else 'control'
    if group == 'test':
        convert = 1 if random.random() < 0.16 else 0
    else:
        convert = 1 if random.random() < 0.10 else 0
    print '%s:%d' % (group, convert)

Easy does it – we’ll look at varying those numbers later. For the uninitiated, that’ll give us fairly uneven test groups with reasonably different conversion percentages.

Now for the plotter. I’ll confess, this is still a work in progress. It currently doesn’t assume anything about the test groups, including the names (taking those from the input stream). However, in future I’m hoping to extend the program to be able to perform multivariate Bayesian AB tests. If it appears messy in places, that’s either because I’m expecting the poor coding practices to lead to me having an easier time extending the code to allow multivariate testing, or because I’m a messy coder.

At this point, massive props to this book: it forms the basis of almost all of this code.

import pymc as pm
import numpy as np
from matplotlib import pyplot as plt
import sys

results_dictionary = {}

## Store all our test results in memory - doesn't allow real-time updating and could get a bit serious if we've got a big set of results
for line in sys.stdin:
    if line == '':
        break
    group, conversion = line.strip().split(':')
    try:
        results_dictionary[group].append(int(conversion))
    except:
        results_dictionary[group] = [int(conversion)]

test_group_a, test_group_b = results_dictionary.keys()

## We'll run this twice, once with uniform prior and once with a beta prior
prior_dict = dict((group, pm.Uniform(group, 0, 1)) for group in results_dictionary.keys())
prior_dict_beta = dict((group, pm.Beta(group, 3, 50)) for group in results_dictionary.keys())

@pm.deterministic
def delta(p_A = prior_dict[test_group_a], p_B = prior_dict[test_group_b]):
    return p_A - p_B

@pm.deterministic
def beta_delta(p_A = prior_dict_beta[test_group_a], p_B = prior_dict_beta[test_group_b]):
    return p_A - p_B

## Bernoulli distribution with the events we've got
observations = dict((group, pm.Bernoulli('obs_%s' % str(group), prior_dict[group], value=events, observed=True)) for group, events in results_dictionary.items())
beta_observations = dict((group, pm.Bernoulli('obs_%s' % str(group), prior_dict_beta[group], value=events, observed=True)) for group, events in results_dictionary.items())

## Markov chain Monte-Carlo methods - returning samples from our updated distributions
mcmc = pm.MCMC([prior_dict[test_group_a], prior_dict[test_group_b], delta, observations[test_group_a], observations[test_group_b]])
mcmc_beta = pm.MCMC([prior_dict_beta[test_group_a], prior_dict_beta[test_group_b], beta_delta, observations[test_group_a], observations[test_group_b]])
mcmc.sample(20000,1000)
mcmc_beta.sample(20000,1000)

## Grab all the samples we need
samples = dict((key, mcmc.trace(key)[:]) for key in results_dictionary.keys())
delta_samples = mcmc.trace('delta')[:]
beta_samples = dict((key, mcmc_beta.trace(key)[:]) for key in results_dictionary.keys())
beta_delta_samples = mcmc_beta.trace('beta_delta')[:]

## It's this easy to work out probabilities of success
prob_a_better = (delta_samples < 0).mean()
prob_a_better_beta = (beta_delta_samples < 0).mean()

### Plotting
ax = plt.subplot(321)
plt.hist(samples[test_group_a], histtype='stepfilled', bins=50, alpha=0.85, label='Uniform posterior of %s' % test_group_a, color='#A60628', normed=True)
plt.suptitle('Posterior distributions of %s, %s, and $Delta$ unknowns' % (test_group_a, test_group_b))
plt.title('Uniform posterior of %s' % test_group_a)
plt.autoscale()
#
ax = plt.subplot(323)
plt.hist(samples[test_group_b], histtype='stepfilled', bins=25, alpha=0.85, label='Uniform posterior of %s' % test_group_b, color='#A60628', normed=True)
plt.title('Uniform posterior of %s' % test_group_b)
plt.autoscale()
#
ax = plt.subplot(325)
plt.hist(delta_samples, histtype='stepfilled', bins=25, alpha=0.85, label='Uniform posterior of $Delta$', color='#A60628', normed=True)
plt.vlines(0, 0, 50, linestyle='--', color='black')
plt.title('Uniform posterior of $Delta$')
plt.autoscale()
plt.annotate('Probability %s nis greater nthan %s: %.2f' % (test_group_a, test_group_b, prob_a_better), (0,30))
#
ax = plt.subplot(322)
plt.hist(beta_samples[test_group_a], histtype='stepfilled', bins=25, alpha=0.85, label='Beta posterior of %s' % test_group_a, color='#A60628', normed=True)
plt.title('Beta posterior of %s' % test_group_a)
plt.autoscale()
#
ax = plt.subplot(324)
plt.hist(beta_samples[test_group_b], histtype='stepfilled', bins=25, alpha=0.85, label='Beta posterior of %s' % test_group_b, color='#A60628', normed=True)
plt.title('Beta posterior of %s' % test_group_b)
plt.autoscale()
#
ax = plt.subplot(326)
plt.hist(beta_delta_samples, histtype='stepfilled', bins=25, alpha=0.85, label='Beta posterior of $Delta$', color='#A60628', normed=True)
plt.vlines(0, 0, 50, linestyle='--', color='black')
plt.autoscale()
plt.annotate('Probability %s nis greater nthan %s: %.2f' % (test_group_a, test_group_b, prob_a_better_beta), (0,30))
plt.title('Beta posterior of $Delta$')
#
plt.tight_layout()
plt.show()

Giving us:

Graphs comparing the conversion of one test group against another, using a Beta distribution and a Uniform distribution as a prior.

First things first, why are there 6 graphs? Well, realistically, given that this is designed to model website conversion, I know what the distribution is likely to look like. Therefore, I say that my initial priors are beta distributions with parameters alpha = 10 and beta = 30. However, I’m well aware that naysayers might quibble with the idea of making such strong assumptions before running the test. Alongside that prior, I’ve included the completely uninformative uniform prior. That basically says that the conversion is equally likely to fall anywhere between 0 and 100%. Another reason for doing this is to show what difference it makes – when we’re looking at < 10 data points, you’ll see fairly big differences between the different prior assumptions. Increase the number of events up past 1000 and the two prior assumptions converge to the same value.

Additionally, we’ve fed in information about the test and control group – where has this delta come from and what’s it all about? That’s simply the difference between the test groups, as a probability distribution. How good is that? Really, that’s what people are interested in and, instead of making them compare two distributions and look at overlaps, we’ve done that for them and presented it as a probability distribution. Well done us.

Move along there.

So we’ve got our data set, we’ve had a cursory investigation and now we’re ready to see if we can find anything interesting. I’m going to proceed in a fairly methodical way and be precise in the way I do things – let’s do this scientific like.

So let’s start with a null hypothesis: “The month alone has no impact on the average selling price of houses“.

At this point, I don’t really know whether or not that’s true but it seems likely that it’s not. I can imagine that the housing market is, to some extent, cyclical. The first thing I’ll do is plot the data – there are a number of reasons why this is a good idea and I’d advise plotting your data wherever possible as a first step.

I’ll run this example investigation in R – it’s great for exploratory analysis and it allows me to produce graphics that I can share really easily.

housingData <- read.csv('pp-2013.csv', header=TRUE)
housingData$date <- as.Date(housingData$date, "%Y-%m-%d %H:%M")
# We're only interested in the month at this point
housingData$month <- strftime(housingData$date, "%m")
housingDataSummary <- data.frame(aggregate(housingData$price, by=list(housingData$month), FUN=mean)
colnames(housingDataSummary) <- c("Month", "Price")
# It's nice to have a look at the data before we perform our test on it, just to get an idea of how it looks and to check we think what we've done up to this point is reasonable.
# Let's take advantage of a very commonly used R library - ggplot2
library(ggplot2)
ggplot(housingDataSummary, aes(x=housingDataSummary$Month, y=housingDataSummary$Price, fill=housingDataSummary$Price)) + geom_bar(stat="identity", width=0.4, position=position_dodge(width=0.5)) + guides(fill=FALSE) + xlab("Month") + ylab("Average price") + ggtitle("Average GB house sale price in 2013")+ scale_y_continuous(labels=comma) + coord_fixed(ratio=0.000035)

Giving us:

Average house price by month in the UK in 2013

O.K – I don’t know about you but looking at the graph I’d say it looks like there might be something to the theory that the month is important in house sales. In this year, we can see a dip in the early months and then a peak when we get to summer.

Let’s shore up the mathematics behind this – I’m going to imagine the situation where I’ve got twelve test groups (one for each month) where my values are the sold house prices. There’s a whole lot of statistical tests designed for this situation. To compare the means of these groups with each other I’d perform a one-way ANOVA (analysis of variance) – a multivariate extension of the t-test. Technically, the assumptions made in performing a one-way are independence of measurements (I’m happy that the sale price of one house is independent of the sale of another), continuity of the dependent variable (house price is continuous) and the dependent variables come from a normal distribution. A simple density plot shows us that the house prices aren’t normally distributed:

2013 house price density plot

However, fear not. The t-test is still a good choice of test under violations of normality, especially so when there are lots of data points (we’ve got over 700,000) . As a little check, let’s also have a go at comparing the medians of the test groups. To do this we can use the Mann-Whitney U test and its multivariate brother, the Kruskal-Willis test. These are non-parametric tests (don’t require normally distributed data) and so if these say the medians of the groups are significantly different and the one-way ANOVA has shown the means of the groups are significantly different, we can be fairly confident they are!

month_aov <- aov(price ~ month, data=housingData)
summary(month_aov)
print(model.tables(month_aov, "means"), digits=2)
kruskal.test(price ~ as.factor(month), data=housingData)

Giving us:

Df Sum Sq Mean Sq F value Pr(>F) month 11 4.463e+13 4.057e+12 46.38 <2e-16 *** Residuals 781167 6.834e+16 8.748e+10 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Tables of means Grand mean


246834.8

month 01 02 03 04 05 06 07 08 09 10 11 12 245200 234879 235136 242679 237509 243942 257421 256289 256988 249722 245077 246392 rep 43404 45238 54692 51025 66357 66175 73526 79119 69314 76042 80986 75301
Kruskal-Wallis rank sum test

data: price by as.factor(month) Kruskal-Wallis chi-squared = 1642.505, df = 11, p-value < 2.2e-16
There’s an awful lot of that that we’re not interested in – the key bits for us to pick out are the p-values on the ANOVA and the Kruskal-Wallis tests – both of which are < 2e-16. We can fairly conclusively say that there are statistically significant differences between the mean and median of the average house prices by month.

However, let’s have a think what assumptions we’re making and whether we’re comfortable making them. Firstly, a previous investigation has hinted that the region the house is in makes a difference to the sale price. This could impact our data in any number of ways – it could be that each region sells their houses at different times and as average house value per region is different then the average house value per month ends up being different. Obviously, region isn’t the only concern – I singled that out as I’ve previously looked at it and know it’s a contributing factor to house price.

Additionally, we’re only looking in one year. That means that any overall change in house prices (ala the property market crash) will completely throw our results off and give us the (possibly false) impression of a cyclical housing market.

There’s an awful lot more we’ll have to consider if we’re to answer our question satisfactorily: does the month impact the selling price of a house? In the next post in this series I’ll be looking at ways of unpicking dependencies (MANOVA and MANCOVA) and will likely have to do bits of the analysis in a distributed way as there’s no way the entire dataset (since 1995) is going to fit in my RAM!

Tag: Statistics

Bayesian AB Testing using Python

Monthly House Price Variation – an adventure in R

Bloggers

Recent Posts

Categories