DogDogFish

Data Science, amongst other things.

Page 2 of 3

Bayesian Testing of Conversion Rate

Hi all,

I’m on a train to the Fringe festival and I’ve managed to bag myself first class tickets! What this means is that I’ve got about 3 hours to kill, unlimited free wine (who knew that was thing!?) and a fairly flaky internet connection. All of that goes together to make now the perfect time to share with you a bit of work I’ve been up to recently.

Firstly, massive props to this particular book: Probabilistic Programming & Bayesian Methods for Hackers

It’s really really good and this testing is just a minor reworking of one of the examples in the book – they say imitation is the sincerest form of flattery no?

Anyway, enough of the wine-induced babbling, on with the Bayesian testing…

The setting of the scene

What I’ll be looking to do here is identify changes in conversion rate (can be any kind of conversion, I’ll use orders/visitors but it doesn’t really matter) in historical data. The reason I say historical data is thusly: this is a good technique for identifying a visualizing changes but it isn’t as good as running a legitimate A/B test. Ideally, we’d implement the change (whatever it is) for one group of customers and not for the rest of the customers and measure relative performance. However, let’s assume that, for whatever reason, you’re looking over a bunch of (conversion) data trying to identify a change.

I’ll artificially generate data so you can run similar examples and get an idea of what our data source looks like:

#!/usr/bin/python

import random
import numpy as np

total_points = 50

## Generate a list containing the number of trials
trials = [random.randint(20,100) for _ in range(total_points)]
results = [np.random.binomial(value, 0.4) if total_point/2 else np.random.binomial(value, 0.3) for index, value in enumerate(trials)]

for trial, result in zip(trials, results):
    print "%dt%d" % (trial, result)

where in our example, trials is going to be the number of visitors on successive days and results is going to be the number of orders.

Pipe that into a file (python generate_conversions.py > conversion_data.txt ) and we’ve got ourself a nice list of conversion data with a change in the rate at some point (halfway in this example) through the data.

The Bayesian Bit

So the idea behind Bayesian statistics revolves around priors and posteriors – your prior is going to be a distribution that represents your (shocker) prior ideas about the result. You’re going to update this prior hypothesis with data as you get it and when you do so the resulting distribution is called the posterior. This is great for a number of reasons – my favourite two are:

  1. You get to set a prior that influences the final outcome. If I’ve got fairly strong ideas about what the conversion rate is (say I think it very likely lies between 80% and 90%) I can reflect that in my prior. When we start any frequentist (non-Bayesian) calculations we assume every probability between 0% and 100% is equally likely – generally that’s not the case.
  2. The final result is a distribution. Distributions are great for visualization, allow for easy comparisons against other distributions and are really easy to show uncertainty on. Do away with all the talk of p-values and show a graph with two ‘conversion distributions’ on and you’re on to a winner.

Shut up and Calculate

Code first – then commentary:

#!/usr/bin/python

import pymc as pm
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import datetime

basket_list = []
conversion_list = []

with open('conversion_data.txt', 'rb') as f:
    for line in f:
        baskets, conversions = line.strip().split()
        basket_list.append(int(baskets))
        conversion_list.append(int(conversions))

n_percent_list = len(basket_list)

uniform_one_samples = []
uniform_two_samples = []
tau_samples = []
uniform_one = pm.Uniform('uniform_one', 0, 1)
uniform_two = pm.Uniform('uniform_two', 0, 1)

tau = pm.DiscreteUniform('tau', lower=0, upper=n_percent_list)

@pm.deterministic
def lambda_(tau = tau, uniform_one = uniform_one, uniform_two = uniform_two):
    out = np.zeros(n_percent_list)
    out[:tau] = uniform_one
    out[tau:] = uniform_two
    return out

observations = pm.Binomial('obs', n=basket_list, p=lambda_, value=conversion_list, observed=True)

model = pm.Model([observations, uniform_one, uniform_two, tau])
mcmc = pm.MCMC(model)
mcmc.sample(10000, 2500, 1)

uniform_one_samples = mcmc.trace('uniform_one')[:]
uniform_two_samples = mcmc.trace('uniform_two')[:]
tau_samples = mcmc.trace('tau')[:]

N = tau_samples.shape[0]

conversion_rate = np.zeros(n_percent_list)
for day in range(0, n_percent_list):
    ix = day < tau_samples
    conversion_rate[day] = (uniform_one_samples[ix].sum() + uniform_two_samples[~ix].sum()) / N

plt.subplot(411)
plt.tight_layout()

plt.plot(range(n_percent_list), 100.*conversion_rate, lw=4, color='#E24A33', label='Expected conversion rate')
plt.xlim([0, n_percent_list])
plt.ylim([0,100])
plt.xlabel('Day')
plt.title("Changes in the probability of conversion")
plt.ylabel('Expected conversion rate')
plt.bar(np.arange(n_percent_list), [100.0*convert/basket for convert,basket in zip(conversion_list, basket_list)], color='#348ABD', alpha=0.65, label='Observed conversion rate')
plt.legend(loc='upper right')

ax = plt.subplot(412)

#
plt.hist(uniform_one_samples, histtype='stepfilled', bins=50, alpha=0.85, label="posterior of initial conversion probability", color='#A60628', normed=True)
plt.legend(loc='upper left')

plt.xlim([min(uniform_one_samples),max(uniform_one_samples)])
plt.xlabel("Probability of Conversion")
#
ax = plt.subplot(413)
#
plt.hist(uniform_two_samples, histtype='stepfilled', bins=50, alpha=0.85, label="posterior of later conversion probability", color='#7A68A6', normed=True)
plt.legend(loc='upper left')
plt.xlim([min(uniform_two_samples),max(uniform_two_samples)])
plt.xlabel("Probability of Conversion")
#
#
plt.subplot(414)
w = 1.0/tau_samples.shape[0] * np.ones_like(tau_samples)
plt.hist(tau_samples, bins=n_percent_list, alpha=1, label="posterior of conversion change date", color='#467821', weights=w, rwidth=2.)
plt.xticks(np.arange(n_percent_list))
#
plt.legend(loc='upper left')
plt.xlim([0, n_percent_list])
plt.xlabel("Day of change")
plt.ylabel('P(change occurred)')
#


plt.show()

That’s a fair bit of code – what I’m doing isn’t that complicated. After importing, defining and grabbing the conversion data that we generated in the first script we declare uniform_1 and uniform_2.

We’re going to say that our data ultimately comes from a Binomial distribution where people convert with probability p. However, we are going to say that at some time (given in this example as tau) the value of p changed. p is the conversion of a Binomial distribution and so saying that p changes at some point means we’re saying our conversion rate changed.

Firstly, I declare my priors – I say that uniform_1 and uniform_two, or the conversion before and after the change we’re trying to identify, are both drawn from a uniform distribution that runs between 0 and 1. I’m saying I think the conversion is equally likely to everywhere between 0 and 100%. If you’re doing this properly I’m sure you can improve on this (I’d advise looking into the Beta distribution) but with sufficient data the uniform distribution should work fine.

So, at this point we’ve got the distributions from which our p-values are drawn described by uniform distributions – our first pair of priors.

Next, I declare tau – the time at which our conversion rate changed. I’m using a discrete uniform distribution and saying that it could have happened with equal probability at any time between the first day and the last day. If you’re looking to identify when the biggest change in conversion occurred I’d advise using a discrete uniform distribution. If you’re trying to identify the effect a change on a particular day had then you can be more creative with this prior.

Now we declare _lambda. For the eagle-eyes pseudo-coders amongst you, I’m sure it’s clear but this represents our complete prior belief about the conversion rate (p of a Binomial). We say we think the value of p follows uniform_1 up until tau and then switches to uniform_2.

Then we let pymc take over – we specify our distribution is a binomial with as many trials as we generated in the first script, and with the probabilities given in _lambda. We tell pymc that we obtained the values we’re passing to it and created our distribution. I’m not going to go in what wizardy follows but check out chapter 3 of the book I mentioned at the start if you’re interested. What we end up with (when we grab the traces) is 10,000 numbers drawn from posterior distributions we’ve generated from uniform_1, uniform_2 and tau.

The final calculation builds a vector of Booleans for each day of the data set specifying whether the day is less than each of the 10,000 samples drawn from the tau distribution, or greater than the tau. Then we go along that 10,000 element boolean vector and create a sum – if the day number was less than the tau we take whatever is in the uniform_1 list at that point, otherwise taking what is in the uniform_2 point. Finally dividing by the number of points we drew, we can get an idea of the expected conversion rate (expected value of _lambda).

I get that that section is a bit complicated but if you get it:
a.) I’ve not had enough wine
b.) You’ve got the idea behind the testing so far.

Finally, there’s a lot of Matplotlib plotting stuff all leading us towards the following beautiful picture:

https://dogdogfish.com/wp-content/uploads/2014/08/simple_bayesian_testing.png

Basic Bayesian Conversion Testing/Change Recognition

I hope that shows more clearly what I’ve been trying to explain. Firstly, the top graph shows our expected and our observed conversions. The bars show our ‘actual conversion rates’ on given days – these were the values we artificially generated. The red line is the output of our model discussed in the final point above. The second and third graph show the distributions of the conversion rate from before the change and after the change. If you’re looking to paint a compelling picture, I’d advise putting them both on the same axis, drawing them as densities (not histograms) and adding lines at the 95th percentiles. Finally, the bottom graph shows us when we change in conversion rate likely changed. As you can see, it’s fairly heavily concentrated around the middle (good) but there’s quite a wide range of days around there where it could be.

I’ve found that in practice, it makes sense to run these simulations multiple times (they don’t take very long) and concatenate the results together when working out when a change occurred. I’ll leave that as an exercise given how long this post already is – there’s  a picture of what it looks like at the bottom – the posteriors of the conversion probabilities get a bit messy.

So there we have it – it’s been a bit of a slog and the woman who brings the wine round the train has started laughing when she pours me a new glass (surely a troubling sign). However, we’re now able to run Bayesian conversion tests like a boss and visualize them in a really funky kind of way.

As always, comments are welcome, questions too.

The Lannisters send their regards.

p.s.

Picture when run with multiple iterations…

Multiple Iteration Bayesian Conversion Plots

Mutliple Iterations of Bayesian Conversion Testing.

Drugs prescribed by the NHS

Hi All,

As always, apologies for the length of time between posts – think this is a record. I was working on stuff I’m not allowed to share (work stuff, then Kaggle stuff) and got minorly derailed by Game of Thrones. Finished all the TV series, and one of the books. And Life of Pi. And I’m working my way through the Book Thief. So yeah, a tad derailed.

However, I’ve got a lazy Sunday and I saw an advert on TV that really annoyed me. It was basically along the lines of ‘did your doctor mess up? Why not sue them?’ and I think that it takes us towards where the Americans are. Which, when it comes to healthcare, is not where we want to go. Given that in the UK, if you’re being treated by a doctor it’s almost certainly an NHS doctor it seems pretty sucky that people are being encouraged to sue them. Puts the price up for everybody and makes people less likely to become doctors and whatnot. Anyway, this isn’t a political rant blog – I thought I’d have a look and see what data is available on the NHS to see if I could show the effects of an increase in litigation on the standard on medical care provided/costs (and ideally, contrast with America). In short, I couldn’t. The data that I wanted just wasn’t there. However, there was data on the drugs that the GPs for the NHS prescribes (at least between Jan and June in 2012) by practice with cost data. That seemed pretty interesting and so here we are.

I’ve been playing around with Google’s coLaboratory (check it out here) and would have loved to use this to do this particular bit of analysis. However, after playing round with it for a bit and struggling with external documents, finding my Google docs and various libraries I wanted I’ve decided to leave it for a while until coLaboratory becomes a tad more mature. Lots of promise there and with good Google Analytics API integration we could transform analytics practices at my company. Certainly one to watch.

Anyway, without that I’ll think of the questions I want to answer first and then pick my tool. Firstly, the data…

The data

As ever, data.gov.uk to the rescue – head here and download yourself a nice copy of the data:

http://data.gov.uk/dataset/gp-practice-prescribing-data

Once you’ve got a copy of all the data and the list of practices in England (I used the most recent one) we’re ready to start asking some questions of the data…

The Analysis

First off, let’s pick something simple – which drug costs the NHS most in each of the months of our test set, and overall.

Total Drug Cost

This actually seems to lend itself to the mapreduce paradigm pretty nicely – the mapper seems pretty unnecessary and as I’ve not got a cluster to hand (and this doesn’t warrant me spinning one up with AWS) I’ll just write a quick reducer in Python and use the Unix sort. For what it’s worth, I think I might write something about spinning up a quick cluster on AWS in the near future. It’s a fairly useful skill to have and given the increasing reluctance of my computer to perform the most basic of tasks, I think a fair bit of my future data analysis might have to happen in the cloud. Anyway, this is what my command will look like once I’m done:

awk 'FNR>1{print}' T201202PDP IEXT.CSV | sort -t , -k5 | python spending_reducer.py

Nothing too complicated there. I’m ignoring the first line (the awk command), then sorting the whole file based on the 4th column (the drug id) and piping the whole thing into the following reducer:

#!/usr/bin/python

import sys

current_drug = None
current_cost = 0.0

for line in sys.stdin:
    authority, trust, practice, drug_code, drug_name, number_bought, ni_cost, act_cost, period = line.strip().split(',')
    act_cost = float(act_cost.lstrip('0'))
    if drug_name == current_drug:
        current_cost += act_cost
    else:
        if current_drug:
            print current_drug + "t" + str(current_cost)
        current_drug = drug_name
        current_cost = act_cost

print current_drug + "t" + str(current_cost)

Nothing too difficult there – we’re just keeping track of the drug we’re on and adding up as we go. If I had no fear for the amount of RAM I had we could’ve accomplished the same thing without the laborious sorting step using associative arrays in awk. But for the next stage – the total across all 6 months, I am very afraid (RAM wise) and so we can run the same query with a bit of wildcarding:

awk 'FNR>1{print}' T20120[1-6]P*.CSV | sort -t , -k5 | python spending_reducer.py > drugs_by_spend.txt

This runs the same calculation over every file matching that wildcard pattern (all the data between 2012/01 and 2012/06. Note that this’ll take a little time – that sort is reasonably expensive over the 4 or so gigabytes of data we’ve got. Now we’ll pull together a few graphs and for this I think we’ll use R…

my_frame <- data.frame(read.csv('drugs_by_spend.txt', header=F, sep="t", colnames=c("Drug", "Spend")))
my_frame <- my_frame[order(-my_frame$Spend),]
head(my_frame)

The top 5 drugs, by spend, in the first half of 2012 were:

  1. Fluticasone Propionate (Inh)
  2. Atorvastatin
  3. Enteral Nutrition
  4. Pregabalin
  5. Budesonide

Between them these cost: £738,620,789

Wow – that’s a hell of a lot. In 6 months, the actual costs of these drugs alone was more than £700million!

The total cost of drugs prescribed in that time period: £2,785,540,256

So I think we can surmise that lots of money is spent by the NHS – OK, I suppose that’s no surprise. For the non-doctors amongst us (that includes me) that list features two anti-asthmatic treatments i.e. those inhalers that I’m sure a lot of you have (also includes me). A quick Wikipedia shows that Pfizer holds the patent to at least a couple of those drugs (or at least did, Atorvastatin has expired) – it might be interesting to stick the patent holder next to these drugs. Maybe later…

Right, there’s lots that we could do here but I’m going to call it a day for now. In the future I think I’ll try and get more months of data and then start to look at evolving trends. To do that, I’ll use an AWS cluster and so will write something and using that.

Until then.

Kaggle Digit Recognizer Problem with Adaboost

Hi everybody,

Hi Dr Nick. But enough of that – today I’m going to be working through a Kaggle problem. For those of you who don’t know Kaggle, I can’t advise in favour of it strongly enough. It’s a great place to have a go at using real data sets to apply various machine learning techniques. There’s a leaderboard, discussions on methods and some non-too shabby prizes. I’ll come clean at this point – I’m not a natural salesperson.

I believe there’s something of a taboo against posting solutions/methods for Kaggle – however, I think I’m good to write about a method of solving this particular problem. The digit recognizer problem seems to be a rolling competition with a bunch of already published results and a few training classes on how to solve it. Let me know if you think this is overstepping the mark.

So, the problem:
Given a big set (42,000) of labelled training data (28 x 28 black and white images) of handwritten images (0-9) are we able to correctly identify other (identically dimensioned) handwritten digits.

There are a whole bunch of ways of doing this and the method I’ve had best success with is Support Vector Machines (using LibSVM). I may post an example of how to run that for this particular example but today I’d like to look at Adaboost (using the MultiBoost package)…

Until fairly recently I was entirely ignorant of Adaboost – I came across it on a different Kaggle problem (the Higgs one). There, a number of ‘out of the box’ methods were showcased – the most successful of which was Adaboost. A bit of reading on Adaboost suggests that it’s a fairly well-regarded, and successful method of performing a range of machine learning tasks. It’s also sometimes cited as being the best ‘out of the box’ (not specifically designed for the task at hand) algorithm in machine learning.

My current intuition on Adaboost is that it’s basically a ‘rule of thumb’ algorithm. It takes a lot of very simple decision boundaries and uses them to create a more complicated decision space. I say rule of thumb because I imagine a car mechanic or a doctor trying to diagnose a fault. The patient presents with symptom x, that makes a whole bunch of things less likely. However, if the patient falls into this age bucket and this ethnic group, some of the previously discounted things become more likely. I don’t know if that sort of explanation helps you but I quite like it. Basically, you create a simple rule that’s more often right than wrong. However, you can then update it with as many exceptions as you’ve got other bits of data. I think that’s a lot how the human decision-making process goes.

Anyway, all this talking isn’t getting us closer to a juicy set of predictions. Mad props to whoever first generated this particular procedure for the Higgs problem – I’ve shamelessly ripped it off, only making changes where necessary for this problem.

#!/usr/bin/python

import random
import csv
import subprocess
import numpy as np

def DataToArff(dataset, labels, header, title, filename):
    """
    With this data structure we're able to turn an arbitrary string of data into a .arff file
    These files allow us to import the data into Multiboost or Weka (amongst other machine learning libraries
    """
    with open(filename + ".arff", 'w') as f:
        f.write('@RELATION ' + title + 'nn')
        for feature in header:
            f.write('@ATTRIBUTE ' + feature + ' NUMERICn')
        f.write('@ATTRIBUTE class {0,1,2,3,4,5,6,7,8,9}n')
        f.write('n@DATAn')
        ## We could do this using all_data - however, we need the labels for further work
        ## Additionally, if the labels were numeric variables we'd be able to leave the rest of our work unchanged and handle them here
        for datarow, label in zip(dataset, labels):
            for value in datarow:
                f.write(str(value) + ',')
            f.write(str(label) + 'n')

all_data = list(csv.reader(open('train.csv', 'rb'), delimiter=','))
header = np.array(all_data[0][1:])
dataset = np.array([map(float, row[1:]) for row in all_data[1:]])
(numpoints, numfeatures) = dataset.shape

# Labels on the first column of the line
labels = np.array([row[0] for row in all_data[1:]])

randomPermutation = random.sample(range(len(dataset)), len(dataset))
## If this breaks halfway through, we'll be glad to be able to load our random permutation
np.savetxt('randomPermutation.csv', randomPermutation, fmt='%d', delimiter=',')

## I'll change the proportion of the train set and see how we get on.
numpointsTrain = int(numpoints*0.75)
numpointsValidatin = numpoints - numpointsTrain

## Because we've got a random permutation there's no problem taking slices of the total set to sort into train and validation
datasetTrain = dataset[randomPermutation[:numpointsTrain]]
datasetValidation = dataset[randomPermutation[numpointsTrain:]]

labelsTrain = labels[randomPermutation[:numpointsTrain]]
labelsValidation = labels[randomPermutation[numpointsTrain:]]

DataToArff(datasetTrain, labelsTrain, header, 'DigitsML_challenge_train', 'training')
DataToArff(datasetValidation, labelsValidation, header, 'DigitsML_challenge_validation', 'validation')

## Our Adaboost parameters are wholly contained in the relevant config files
p1 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'config.txt'])
p1.wait()
p2 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'configScoresValidation.txt'])
p2.wait()

testText = list(csv.reader(open('test.csv', 'rb'), delimiter=','))
datasetTest = np.array([map(float, row[:]) for row in testText[1:]])
labelsTest = np.repeat('0', len(testText) - 1)

DataToArff(datasetTest, labelsTest, header, 'Digit_challenge_test', 'test')

p3 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'configScoresTest.txt'])
p3.wait()

testScoresText = list(csv.reader(open('scoresTest.txt', 'rb'), delimiter=','))
with open('submission.csv', 'w') as f:
    f.write('ImageId,Labeln')
    for index, entry in enumerate(testScoresText):
        ## Take the index of the maximum value for a given row - this is the most likely value
        f.write(str(index + 1) + "," + str(np.argmax(entry)) + 'n')

I’d like to think that that bit of code is reasonably transparent and clear on what it’s doing. If I’m wrong, a basic explanation:
1.) Randomly split the data into a training and validation set
2.) Create .arff files for both of these sets
3.) Run Multiboost (our Adaboost implementation) on the training set and validation set
4.) Using the files created from our train/validation Adaboost, get the test set and generate predictions (again using Multiboost)
5.) Generate a submission file in the required format

Easy does it. Now for the configuration files that I’m using:

config.txt

fileformat arff
verbose 2
learnertype TreeLearner
constant
seed 50
weightpolicy balanced
baselearnertype SingleStumpLearner 8
outputinfo results.dta e01w01auc
traintest training.arff validation.arff 5000
shypname shyp.xml

configScoresValidation.txt

posteriors validation.arff shyp.xml scoresValidation.txt 5000
fileformat arff
verbose 2
learnertype TreeLearner
baselearnertype SingleStumpLearner 8

configScoresTest.txt

posteriors test.arff shyp.xml scoresTest.txt 5000
fileformat arff
verbose 2
learnertype TreeLearner
baselearnertype SingleStumpLearner 8

I’ll be honest and say I’ve not really found the ideal set-up for this problem here. I’m able to get a score of around 0.965 using the ones here but if you look at the leader board you’ll see that’s not all that good. Certainly the LibSVM method performed much better (something like 0.99). Not to worry, it’s doing the right thing, generating good results and is another tool in our arsenal.

The World Cup may stymie my blog posts for a bit – then again, supporting England, it might only be 3 games.

Football’s coming home.

Bayesian AB Testing using Python

Hi all,

A reasonably big part of the job I do involves running AB tests and I think that’s usually something that falls under the remit of data scientists. Now I could go on and on about the difficulties around tagging tests, choosing test groups, choosing goals and the like. I won’t, but I’ll make at least one point. All of the aforementioned points are very important and are very non-trivial. If you don’t absolutely understand all of the steps a user might take, and how your test will handle these, work a bit harder on that. And outsourcing this problem to your AB testing framework provider…well, I’d not advise it.

Anyway, I recently read this blog post by the engineers over at Lyst and thought it was a really interesting idea. I’m really not a fan of p-values and think that a probability distribution is a much nicer way of communicating a result. With that in mind, the basic logic behind Lyst’s post (if you’ve not got the time/inclination to read it):

1.) You presumably have a reasonable idea about the distribution of whatever metric you’re plotting – for instance, you’re reasonably confident conversion % lies somewhere between 0 and 25% with the most likely value to be around 15% (as an example). You assume both test groups follow this distribution until you start getting data in to corroborate/contradict that.

Beta Distribution

Beta Distribution with alpha parameter 8 and beta parameter 40.

2.) On getting data in, you update your distribution to take into account these new data points – the distributions evolve with the new data.
3.) Your end result (see below) gives you a probability distribution for the conversion of both test groups.

Bayesian Test Result

The distribution of conversion for two (randomly generated) test datasets

I’d argue the above picture is much clearer than even the best explained set of p-values. It also really clearly lends itself to calculations along the lines of ‘What is the probability that test group A is better than test group B?’ or ‘how sure are you that test group A is 2% better than test group B?’

Enamoured with this new way of reporting test results, I figured I mayerswell build something to do so. Instead of writing something where the test set-up is tied in with the result plotter I wrote my plotter to take input from stdin.

First things first then, I want something to generate a stream of conversion events:

import random
import time

for _ in range(2000):
    group = 'test' if random.random() > 0.3 else 'control'
    if group == 'test':
        convert = 1 if random.random() < 0.16 else 0
    else:
        convert = 1 if random.random() < 0.10 else 0
    print '%s:%d' % (group, convert)

Easy does it – we’ll look at varying those numbers later. For the uninitiated, that’ll give us fairly uneven test groups with reasonably different conversion percentages.

Now for the plotter. I’ll confess, this is still a work in progress. It currently doesn’t assume anything about the test groups, including the names (taking those from the input stream). However, in future I’m hoping to extend the program to be able to perform multivariate Bayesian AB tests. If it appears messy in places, that’s either because I’m expecting the poor coding practices to lead to me having an easier time extending the code to allow multivariate testing, or because I’m a messy coder.

At this point, massive props to this book: it forms the basis of almost all of this code.

import pymc as pm
import numpy as np
from matplotlib import pyplot as plt
import sys

results_dictionary = {}

## Store all our test results in memory - doesn't allow real-time updating and could get a bit serious if we've got a big set of results
for line in sys.stdin:
    if line == '':
        break
    group, conversion = line.strip().split(':')
    try:
        results_dictionary[group].append(int(conversion))
    except:
        results_dictionary[group] = [int(conversion)]

test_group_a, test_group_b = results_dictionary.keys()

## We'll run this twice, once with uniform prior and once with a beta prior
prior_dict = dict((group, pm.Uniform(group, 0, 1)) for group in results_dictionary.keys())
prior_dict_beta = dict((group, pm.Beta(group, 3, 50)) for group in results_dictionary.keys())

@pm.deterministic
def delta(p_A = prior_dict[test_group_a], p_B = prior_dict[test_group_b]):
    return p_A - p_B

@pm.deterministic
def beta_delta(p_A = prior_dict_beta[test_group_a], p_B = prior_dict_beta[test_group_b]):
    return p_A - p_B

## Bernoulli distribution with the events we've got
observations = dict((group, pm.Bernoulli('obs_%s' % str(group), prior_dict[group], value=events, observed=True)) for group, events in results_dictionary.items())
beta_observations = dict((group, pm.Bernoulli('obs_%s' % str(group), prior_dict_beta[group], value=events, observed=True)) for group, events in results_dictionary.items())

## Markov chain Monte-Carlo methods - returning samples from our updated distributions
mcmc = pm.MCMC([prior_dict[test_group_a], prior_dict[test_group_b], delta, observations[test_group_a], observations[test_group_b]])
mcmc_beta = pm.MCMC([prior_dict_beta[test_group_a], prior_dict_beta[test_group_b], beta_delta, observations[test_group_a], observations[test_group_b]])
mcmc.sample(20000,1000)
mcmc_beta.sample(20000,1000)

## Grab all the samples we need
samples = dict((key, mcmc.trace(key)[:]) for key in results_dictionary.keys())
delta_samples = mcmc.trace('delta')[:]
beta_samples = dict((key, mcmc_beta.trace(key)[:]) for key in results_dictionary.keys())
beta_delta_samples = mcmc_beta.trace('beta_delta')[:]

## It's this easy to work out probabilities of success
prob_a_better = (delta_samples < 0).mean()
prob_a_better_beta = (beta_delta_samples < 0).mean()

### Plotting
ax = plt.subplot(321)
plt.hist(samples[test_group_a], histtype='stepfilled', bins=50, alpha=0.85, label='Uniform posterior of %s' % test_group_a, color='#A60628', normed=True)
plt.suptitle('Posterior distributions of %s, %s, and $Delta$ unknowns' % (test_group_a, test_group_b))
plt.title('Uniform posterior of %s' % test_group_a)
plt.autoscale()
#
ax = plt.subplot(323)
plt.hist(samples[test_group_b], histtype='stepfilled', bins=25, alpha=0.85, label='Uniform posterior of %s' % test_group_b, color='#A60628', normed=True)
plt.title('Uniform posterior of %s' % test_group_b)
plt.autoscale()
#
ax = plt.subplot(325)
plt.hist(delta_samples, histtype='stepfilled', bins=25, alpha=0.85, label='Uniform posterior of $Delta$', color='#A60628', normed=True)
plt.vlines(0, 0, 50, linestyle='--', color='black')
plt.title('Uniform posterior of $Delta$')
plt.autoscale()
plt.annotate('Probability %s nis greater nthan %s: %.2f' % (test_group_a, test_group_b, prob_a_better), (0,30))
#
ax = plt.subplot(322)
plt.hist(beta_samples[test_group_a], histtype='stepfilled', bins=25, alpha=0.85, label='Beta posterior of %s' % test_group_a, color='#A60628', normed=True)
plt.title('Beta posterior of %s' % test_group_a)
plt.autoscale()
#
ax = plt.subplot(324)
plt.hist(beta_samples[test_group_b], histtype='stepfilled', bins=25, alpha=0.85, label='Beta posterior of %s' % test_group_b, color='#A60628', normed=True)
plt.title('Beta posterior of %s' % test_group_b)
plt.autoscale()
#
ax = plt.subplot(326)
plt.hist(beta_delta_samples, histtype='stepfilled', bins=25, alpha=0.85, label='Beta posterior of $Delta$', color='#A60628', normed=True)
plt.vlines(0, 0, 50, linestyle='--', color='black')
plt.autoscale()
plt.annotate('Probability %s nis greater nthan %s: %.2f' % (test_group_a, test_group_b, prob_a_better_beta), (0,30))
plt.title('Beta posterior of $Delta$')
#
plt.tight_layout()
plt.show()

Giving us:

Bayesian AB Testing Graphs

Graphs comparing the conversion of one test group against another, using a Beta distribution and a Uniform distribution as a prior.

First things first, why are there 6 graphs? Well, realistically, given that this is designed to model website conversion, I know what the distribution is likely to look like. Therefore, I say that my initial priors are beta distributions with parameters alpha = 10 and beta = 30. However, I’m well aware that naysayers might quibble with the idea of making such strong assumptions before running the test. Alongside that prior, I’ve included the completely uninformative uniform prior. That basically says that the conversion is equally likely to fall anywhere between 0 and 100%. Another reason for doing this is to show what difference it makes – when we’re looking at < 10 data points, you’ll see fairly big differences between the different prior assumptions. Increase the number of events up past 1000 and the two prior assumptions converge to the same value.

Additionally, we’ve fed in information about the test and control group – where has this delta come from and what’s it all about? That’s simply the difference between the test groups, as a probability distribution. How good is that? Really, that’s what people are interested in and, instead of making them compare two distributions and look at overlaps, we’ve done that for them and presented it as a probability distribution. Well done us.

Move along there.

Finding and Trading Volatile Stocks in Python

Hi all,

Can’t promise that this post won’t be bitty – I’m trying to simultaneously run an SVM and a random forest on a bunch of particle physics data for the Kaggle competition. Check it out, it’s pretty cool. Anyway, my computer is straining under the weight of those calculations and so while that was happening I decided to have a look at stock prices using Python/Pandas again.

After chatting with co-blogger Sean, and based on my (limited, and hilariously bad) experiences of stock trading we decided it’d be interesting to identify volatile stocks that don’t seem to have greatly varying fundamental value. We’re basically looking for the position of a harmonic oscillator in a stock. I’m not graphing that – look it up yourself. The logic being, there’ll be a point at which it it’s profitable to buy a stock like this on a down and sell again when it’s back up. Of course, this requires the assumption that the stock itself isn’t having a fundamental value shift – it’s just suffering from cyclicity. I don’t really know if/how this’ll work but that’s half the fun…

Right, back to it (I’ve caught up with Game of Thrones – get it watched). I’ve thought a reasonable amount about this and have decided our first job is to look at maximizing the following quantity:

frac{Volatility}{Change_{daily}^n Change_{weekly}}

I might also throw in an additional concern – I’d like to be able to enter and exit the market whenever I want – I don’t see this being a big problem for me (I’m not going to be using a lot of money) but it’ll certainly be a concern for bigger players. Let’s cross that bridge if we need to.

So, to start off with, my daily volatility I’m going to define as
frac{sum_{i={day_1}}^{Today} frac{HighPrice_i - LowPrice_i}{ClosePrice_i}}{NumberOfDays}

Hopefully nothing earth-shattering there, just want to see how much it varies over a day. Now while I want the stock price to vary a lot, I want it to head back to where it started. A rapidly increasing/decreasing stock is going to have wildly varying days. However, it’s also going to have a large overall trend. That’s no good for the purposes of finding stocks to buy/sell on a short time period.

Change_{daily} = sqrt{frac{sum_{i={day_1}}^{Today} (frac{ClosePrice_i - OpenPrice_i}{OpenPrice_i})^2}{NumberOfDays}}

Change_{weekly} = sqrt{frac{sum_{i={week_1}}^{Today} (frac{ClosePrice_i - OpenPrice_i}{OpenPrice_i})^2}{NumberOfWeeks}}

Easy does it – the reason I’ve squared the result is basically that I don’t care whether the stock is rising or falling. I’m trying to minimize the overall long-term variations from the mean.

So, how easy is this in Python? Remarkably so. Let’s start off by plotting a scatter graph of some of the more promising stocks.

import numpy as np
from pandas.io.data import DataReader
import pandas as pd
from datetime import datetime
from pylab import savefig

## A list of American Stock Symbols
company_information = pd.read_csv('allcompany.csv')

volatility_measure = []
daily_change_measure = []
weekly_change_measure = []
labels = []

## Let's start out with the biggest 10 companies in my list

for company in company_information.sort(['MarketCap'], ascending=False).head(10)['Symbol']:
    try:
        company_frame = DataReader(company.strip(), 'yahoo', datetime(2013,1,1), datetime.now().date())
        company_frame['Volatility'] = (company_frame['High'] - company_frame['Low'])/company_frame['Close']
        volatility_measure.append(company_frame['Volatility'].mean())
        company_frame['Daily_Change'] = company_frame['Close'].diff()
        daily_change_measure.append(np.sqrt(np.mean(company_frame['Daily_Change']**2)))
        ## Take every 5th row
        weekly_company_frame = company_frame[::5]
        weekly_company_frame['Change'] = weekly_company_frame['Close'].diff()
        weekly_change_measure.append(np.sqrt(np.mean(weekly_company_frame['Change']**2)))
        labels.append(company.strip())
    except:
        print "Problem parsing %s" % company.strip()

for i in range(1,7):
    change_metric = [daily * (weekly ** (1./i)) for daily, weekly in zip(daily_change_measure, weekly_change_measure)]
    ax = plt.subplot(3,2,i)
    plt.xlabel('Log of overall change metric')
    plt.ylabel('Volatility metric')
    plt.title('Weekly power %.2f' % float(1./i))
    plt.scatter(change_metric, volatility_measure, c = volatility_measure, cmap=plt.get_cmap('Spectral'), label='Weekly power %.2f' % float(1./i))
    for label, x, y in zip(labels, change_metric, volatility_measure):
        plt.annotate(label, (x,y), xytext=(0,8), textcoords='offset points')

    plt.gca().set_xscale('log')
    plt.gca().legend_ = None

plt.suptitle('Daily Volatility of Stocks versus Overall Trend')
plt.tight_layout()
plt.show()
savefig('StockVolatility.png')

OK – it’s not especially pretty but it gives us the following picture:

Stock Volatility

The 10 biggest US stocks – their daily & weekly change versus their daily volatility

You could also make a fair point that I’ve formatted it poorly. Open it up as big as your browser will let you and you’ll be able to see it nicely. Or, just run the code and create your own picture. It’s dead easy. I promise.

So what can we infer from that picture? I’m going to go ahead and say not a huge deal. Apple & Google have made some crazy ups and downs over the last year or two (mostly ups) and hence I’ve been forced to use a log plot. Other than that, we can see a cluster of the remaining companies with GE seeming the most stable all round. One point I’d like to make now: by defining my metrics in such a way that they don’t really match to anything in reality, I’ve lost the ability to understand exactly what I’ve plotted. What I’m trying to say, is that the log of an overall change metric isn’t an intuitive quantity. Usually, it’s a good idea to pick metrics that have a fairly firm grounding in the real world unless you’ve got a really good reason not to. In my case, my reason is that all I’m trying to do here is identify stocks in the upper left most corner – I don’t care what their values are yet.

I’d also like to make the point here that for this data set, the change of power associated with the weekly metric seems to make no difference. I put it there to express the idea that we’re likely to want a different weighting on the daily and weekly variability depending on how often we want to trade the stock. As I’m hoping to trade multiple times daily, the daily variability is more important to me than the weekly variability (hence my choice of fractional powers of the weekly variable). If you’re looking at trading less regularly, change your parameters accordingly.

Now I’m going to go out on a limb and say that, when looking for daily volatility, the biggest companies in America aren’t the place to go looking. I’m sure that the algorithmic trading people are all over this kind of trade with fancy-pants C++ code designed to execute multiple trades/second. To do this at a reasonably large scale (and to overcome transaction/infrastructure costs) I’m going to say those guys will play with these big companies where a purchase of £1 million+ of shares isn’t going to be such a big deal. Playing in those markets must be the equivalent of going barracuda fishing with a thumb tack and a tie. I think we should start our search towards the lower market caps and work our way up until we’ve got a few hopefuls.

volatility_measure = []
daily_change_measure = []
weekly_change_measure = []
labels = []

for company in company_information[company_information['MarketCap'] > 10000000].sort(['MarketCap']).head(25)['Symbol']:
    try:
        company_frame = DataReader(company.strip(), 'yahoo', datetime(2013,1,1), datetime.now().date())
        company_frame['Volatility'] = (company_frame['High'] - company_frame['Low'])/company_frame['Close']
        volatility_measure.append(company_frame['Volatility'].mean())
        company_frame['Daily_Change'] = company_frame['Close'].diff()
        daily_change_measure.append(np.sqrt(np.mean(company_frame['Daily_Change']**2)))
        ## Take every 5th row
        weekly_company_frame = company_frame[::5]
        weekly_company_frame['Change'] = weekly_company_frame['Close'].diff()
        weekly_change_measure.append(np.sqrt(np.mean(weekly_company_frame['Change']**2)))
        labels.append(company.strip())
    except:
        print "Problem parsing %s" % company.strip()

for i in range(1,7):
    change_metric = [daily * (weekly ** (1./i)) for daily, weekly in zip(daily_change_measure, weekly_change_measure)]
    ax = plt.subplot(3,2,i)
    plt.xlabel('Log of overall change metric')
    plt.ylabel('Volatility metric')
    plt.title('Weekly power %.2f' % float(1./i))
    plt.scatter(change_metric, volatility_measure, c = volatility_measure, cmap=plt.get_cmap('Spectral'), label='Weekly power %.2f' % float(1./i))
    for label, x, y in zip(labels, change_metric, volatility_measure):
        plt.annotate(label, (x,y), xytext=(0,8), textcoords='offset points')

    plt.gca().set_xscale('log')
    plt.gca().legend_ = None
    plt.autoscale(tight=True)

plt.suptitle('Daily Volatility of Stocks versus Overall Trend')
plt.tight_layout()
plt.show()
savefig('SmallerCompanies.png')
Small Companies Volatility

Volatility versus overall change for American companies with Market Caps > $10,000,000

Well bugger me. I don’t know about you but that looks pretty cool to me. Ignore all the gumph in the middle and look at the outliers – AMCO, GRVY, PRLS, DGLY and EEME. These are great examples of companies that are going to be either maximums or minimums for our given metric.

OK – I’m going to call it a night for now but just think of the possibilities open to us now! We can change our date ranges, play around with our metrics and loop through as many stocks as we can find symbols for (harder than you’d think!) until we’ve got a reasonable amount of stocks that we think are great candidates for regularly buying and selling.

Next time, I’ll finalize my list of stocks and hopefully start to gain an idea of when one of these stocks becomes a buy, and when it becomes a sell. That sounds fairly difficult actually. Ah well, that’s the fun.

Winter is coming.

Hadoop wordcount in Python

Hi all,

There’ll be a follow up post to this detailing how to run a mapreduce using Eclipse and Java but, as I’ve found myself in permissions hell in running that, I’ll go with the easy one first. Hadoop comes with a streaming jar that allows you to write your mappers and reducers in any language you like – just take input from stdin and output to stdout and you’re laughing. I’ll show you how to achieve this using Python.

Cluster Set-up

I’m going to assume you’ve followed a tutorial and have got Hadoop installed and working – if you haven’t, follow one (maybe even mine) and then come back here. Make sure you’ve got HDFS and Yarn running by executing the following commands:

su - hduser ## Only need this if you created a user called hduser to interface with Hadoop
cd /usr/local/hadoop ## If you followed the tutorial - otherwise, wherever your Hadoop home directory is
sbin/start-all.sh

Let’s see about putting a text file into HDFS for us to perform a word count on – I’m going to use The Count of Monte Cristo because it’s amazing. Honestly, get it read if you haven’t. It’s really really good. Anywho, enough fandom – this little command will download the whole book and stick it into whichever directory you happen to be in when you run the command.

 cd ~
wget -O 'count_of_monte_cristo.txt' http://www.gutenberg.org/cache/epub/1184/pg1184.txt

Now we’ve got the file in our home directory (really, it was that easy, check it out if you don’t believe me – then read the book). However, that’s not in HDFS – we need to explicitly put it there. I’m going to create a directory in HDFS called input and then put the file in there:

/usr/local/hadoop/bin/hadoop fs -mkdir /input
/usr/local/hadoop/bin/hadoop fs -put ~/count_of_monte_cristo.txt /input

Has it worked?

Run this command:

 /usr/local/hadoop/bin/hadoop fs -ls /input | grep count_of_monte_cristo | awk -F '/' '{print $3}' | cut -d '.' -f1 

If it returns a warning followed by ‘count_of_monte_cristo’ then you’re in the money. If you don’t understand the commands above, don’t worry. But do find out about them.

Otherwise, drop me a comment and I’ll see what can be done.

The Mapper

With this bit of code we’re going to go over every line in the text file and output the word and the number of instances of that word (one, for now) – easy does it:

#!/usr/bin/python

import sys

for line in sys.stdin:
    for word in line.strip().split():
        print "%st%d" % (word, 1)

Save that file as something sensible at a sensible location – I’m going to use /home/hduser/word_mapper.py.
Also, make sure it’s executable:

chmod +x /home/hduser/word_mapper.py

Has it worked?
Run this little beaut’ of a command:

 /usr/local/hadoop/bin/hadoop fs -cat /input/count_of_monte_cristo.txt | /home/hduser/word_mapper.py 

If you’ve gone maverick and used a different filename or file location then that’s fine – just substitute that in where I’ve used

/home/hduser/word_mapper.py

. If you’ve gone maverick but don’t really know what you’re doing and don’t know what I’ve just said, that’s basically on you. Keep trooping on, you’ll get there.

Either way, don’t stop until that code outputs a stream of words followed by the number 1. Don’t worry – you can stop it by pressing Ctrl and C together.

The Reducer

We’ve got ourselves a nice stream of words. The Hadoop streaming jar will take care of the sorting for us (though we can override the default behaviour should we choose) so we just need to decide what to do with that stream of words. I’m going to propose this:

#!/usr/bin/python

import sys

current_word = None
current_count = 1

for line in sys.stdin:
    word, count = line.strip().split('t')
    if current_word:
        if word == current_word:
            current_count += int(count)
        else:
            print "%st%d" % (current_word, current_count)
            current_count = 1

    current_word = word

if current_count > 1:
    print "%st%d" % (current_word, current_count)

Follow the code through and try to think of the different cases it’s trying to catch. The first and last lines are tricky but play around with it – what happens if I just feed a file containing one word? What about a file with no duplicate words? Think about all the different cases and hopefully – the above code handles them all as you’d expect. If not, please let me know. That’d be real embarrassing.

Has it worked?

Make sure that file is executable:

 chmod +x /home/hduser/word_reducer.py 

Run this:

 /usr/local/hadoop/bin/hadoop fs -cat /input/count_of_monte_cristo.txt | /home/hduser/word_mapper.py | head -n 100 | sort | /home/hduser/word_reducer.py 

If everything’s gone to plan you should see a bunch of lines and associated counts – some of them should be non-one.

Super.

Run the Mapreduce

This is what you’ve been waiting for. Well – it’s what I’ve been waiting for at least. Run this command and you’ll basically be a Hadoop hero:

 cd /usr/local/hadoop
bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -files /home/hduser/word_mapper.py,/home/hduser/word_reducer.py -mapper /home/hduser/word_mapper.py -reducer /home/hduser/word_reducer.py -input /input/count_of_monte_cristo.txt -output /output

And off it goes – enjoy watching your mapreduce race through at what I’m sure is a barely tolerable crawl.

Has it worked?

Run this beauty:

 /usr/local/hadoop/bin/hadoop fs -cat /output/part-00000 

If you see a stream of likely looking results – you’re golden. If you want to get the file out of HDFS for inspection run something like this:

 /usr/local/hadoop/bin/hadoop fs -get /output/part-00000 /home/hduser/monte_cristo_counted.txt
less /home/hduser/monte_cristo_counted.txt 

Hope that’s worked well for you – it’s not the most glamorous of Hadoop jobs but it’s a good stepping stone. In a post coming to you soon I should be able to show you how to get Eclipse set up to run Hadoop jobs and give you an example or two in Java.

(Pseudo) Distributed Wishes

Installing Eclipse 4.2.3 (Kepler) on Ubuntu 14.04

Hi all,

If you’ve followed any of my other posts you’ll know I recently wiped my OS (something I do alarmingly regularly) – as such, I’m reinstalling a bunch of stuff. Given that I do this so often, it makes sense for me to have a repository of tutorials for doing so!

Today I’m on Eclipse – I’m by no means an Eclipse regular. I do most of my coding in Python and find Vim works well enough for most of what I need to do. RStudio for R, Vim/IPython for Python, but when I’m doing anything in Java (inc. Android stuff) I’ll go with Eclipse. Now installing Eclipse on Ubuntu is really easy – there’s a version in the software centre that’ll work just fine. However, if you want a more up to date version then there’s a degree of hacking about required. Let’s go:

Check you’ve got Java installed

Give this command a cheeky little run (from terminal – press Ctrl + shift + t to get terminal on the screen):

which java
java -version

If they both worked you should have seen the path to where your Java is installed (mine was /usr/bin/java) and the version of Java you have installed (1.7 OpenJDK for me). If both commands worked (gave you a path and a version) then excellent, carry on wayward son. It doesn’t matter if your results are different to mine, just keep trooping on. If they didn’t work, you’ll want to install Java before continuing. I’ll not deal with that here but I mention how to do so in my post on installing Hadoop on Ubuntu.

Download Eclipse

As with installing Hadoop, there are a couple of ways to do this. I’m going to advise heading on over to the Eclipse download link:

Eclipse

Pick the top version (Eclipse standard) and decide which bit version you want (32 or 64). If you don’t know which you should be using, I’d advise running the following command:

 uname -a | awk '{print $12}' 

If the output is

x86_64

you’ll likely want 64 bit,

i386

tells me you’re after 32 bit. If it says something else entirely then I’m flummoxed – have a Google around to find out what bit your Ubuntu installation is.

If you’ve not got a GUI then I’m going to decide you shouldn’t be installing Eclipse. Wget the .tar.gz file if that’s you but really, what are you doing? Actually, maybe you’re setting up a bunch of computers over SSH which will have monitors in the future but don’t now. OK – wget this link if you’re on 64 bit:

http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/kepler/SR2/eclipse-standard-kepler-SR2-linux-gtk-x86_64.tar.gz&r=1

and this link if you’re not:

http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/kepler/SR2/eclipse-standard-kepler-SR2-linux-gtk.tar.gz&r=1

Installing Eclipse

At this point I’m assuming you’ve got a tarred Eclipse in your downloads folder and you know where Java is installed.

Open up a terminal, head over to your downloads directory and untar the Eclipse file:

cd ~/Downloads
tar -xzf eclipse-standard-kepler-SR2-linux-gtk-x86_64.tar.gz

Next we’re going to put it into a more sensible directory and make it easily launchable from terminal:

sudo mv eclipse /usr/local/eclipse
sudo ln -s /usr/local/eclipse/eclipse /usr/bin/eclipse

At this point, if you could kindly run the following command from terminal:

 eclipse 

I’d expect you to see Eclipse pop up and for you to be able to start developing.

Estimating Pi using the Monte Carlo Method in Python

Hi all,

If you were especially upset then I’m sorry it’s been a while since I posted – I discovered Game of Thrones. In a later post I’ll chart the effect of Game of Thrones on overall productivity. I think there’ll be some unsurprising results. Anyway, I spend a reasonable amount of time on the train with my (oft abused) laptop each morning/evening; I don’t have the internet and I don’t have any textbooks so it’s basically a question of what I can work on given only the documentation on my computer and whatever I can remember about programming/maths/stuff.

I was having a think and remembered that you could estimate Pi using a Monte Carlo method and thought like that sounded like the sort of thing I should do. The logic is basically as follows:

Let’s draw a square of side length 2r and a circle centred exactly in the middle of the square with radius r. A well organised blogger would show you a diagram of this set-up, screw it, this is the code to do it and this is what it looks like:

import matplotlib.pyplot as plt
fig = plt.figure()
axis = fig.add_subplot(1,1,1)
circle = plt.Circle((0,0), 1)
axis.add_patch(circle)
axis.set_xlim([-1,1])
axis.set_ylim([-1,1])
axis.set_title('A Circle in a Square')
plt.show()
A Circle in a Square

A Circle in a Square

Brilliant – was it worth it? Probably not. But there you have it – with that set up we can now start the Monte Carlo bit. We’ll throw darts at that picture randomly; you’d expect the number of darts in the circle to be proportional to the area of the circle and the number of darts in the square to be proportional to the area of the square. Using that fact and the formulae for the areas of a circle and a square you can estimate Pi using the ratio of darts in the circle and in the square.

Sound good? It’s fairly easy to run this in Python and graph the output using Matplotlib. You’ll see I’ve used Object Oriented Python for this particular exercise, I don’t really know why. Especially because I had a chance to use inheritance and didn’t. Well done me. I’ve let everybody down. Anyway – this is the code I came up with and the graph below shows what I ended up with:

#!/usr/bin/python

import numpy as np
import math
import matplotlib.pyplot as plt

"""
Calculate pi using Monte-Carlo Simulation
"""

"""
First - the maths:
A circle has area Pi*r^2
A square wholly enclosing above circle has area 4r^2
If we randomly generate points in that square we'd expect the ratio of points in the square/points in the circle to equal the area of the square divided by the circle.
By that logic n_in_sq/n_in_cir = 4/Pi and so Pi = (4 * n_in_cir)/n_in_sq
"""

class pi_calculator(object):

    def __init__(self, radius, iterations):
        self.radius = radius
        self.iterations = iterations
        self.square = square(radius)
        self.circle = circle(radius)

    def scatter_points(self):
        for _ in range(self.iterations):
            point_x, point_y = ((2*self.radius) * np.random.random_sample(2)) - self.radius
            self.square.increment_point_count(point_x, point_y)
            self.circle.increment_point_count(point_x, point_y)

    def return_pi(self):
        return (4.0*self.circle.return_point_count())/self.square.return_point_count()

    def calculate_accuracy(self, calc_pi):
        absolute_error = math.pi - calc_pi
        percent_error = 100*(math.pi - calc_pi)/math.pi
        return (absolute_error, percent_error)

    def return_iterations(self):
        return self.iterations

class square(object):

    def __init__(self, radius):
        self.radius = radius
        self.lower_x = -radius
        self.lower_y = -radius
        self.upper_x = radius
        self.upper_y = radius
        self.point_count = 0


    def in_square(self, point_x, point_y):
        return (self.upper_x > point_x > self.lower_x) and (self.upper_y > point_y > self.lower_y)


    def increment_point_count(self, point_x, point_y, increment = 1):
        if self.in_square(point_x, point_y):
            self.point_count += increment

    def return_point_count(self):
        return self.point_count

class circle(object):

    def __init__(self, radius):
        self.radius = radius
        self.point_count = 0

    def in_circle(self, point_x, point_y):
        return point_x**2 + point_y**2 < self.radius**2

    def increment_point_count(self, point_x, point_y, increment=1):
        if self.in_circle(point_x, point_y):
            self.point_count += increment

    def return_point_count(self):
        return self.point_count


if __name__ == '__main__':
    axis_values = []
    pi_values = []
    absolute_error_values = []
    percent_error_values = []
    for _ in range(1,3000,30):
        pi_calc = pi_calculator(1, _)
        pi_calc.scatter_points()
        print "Number of iterations: %d    Accuracy: %.5f" % (pi_calc.return_iterations(), math.fabs(pi_calc.calculate_accuracy(pi_calc.return_pi())[0]))
        axis_values.append(_)
        pi_values.append(pi_calc.return_pi())
        absolute_error_values.append(math.fabs(pi_calc.calculate_accuracy(pi_calc.return_pi())[0]))
        percent_error_values.append(math.fabs(pi_calc.calculate_accuracy(pi_calc.return_pi())[1]))

    improvement_per_iteration = [absolute_error_values[index] - absolute_error_values[index-1] for index, value in enumerate(absolute_error_values) if index > 0]
    fig = plt.figure()
    fig.suptitle('Calculating Pi - Monte Carlo Method')
    ax1 = fig.add_subplot(2,2,1)
    ax2 = fig.add_subplot(2,2,2)
    ax3 = fig.add_subplot(2,2,3)
    ax4 = fig.add_subplot(2,2,4)
    plt.subplots_adjust(wspace=0.3, hspace=0.3)
    ax1.set_xticklabels([str(entry) for entry in axis_values[::len(axis_values)/5]], rotation=30, fontsize='small')
    ax1.set_xlabel('Iterations')
    ax1.set_ylabel('Calculated value of Pi')
    ax1.plot(pi_values, 'k')
    ax1.plot([math.pi for entry in axis_values], 'r')
    ax2.set_ylabel('Absolute error')
    ax2.set_xticklabels([str(entry) for entry in axis_values[::len(axis_values)/5]], rotation=30, fontsize='small')
    ax2.set_xlabel('Iterations')
    ax2.plot(absolute_error_values, 'k', label="Total Error")
    ax3.set_ylabel('Absolute percentage error (%)')
    ax3.set_xticklabels([str(entry) for entry in axis_values[::len(axis_values)/5]], rotation=30, fontsize='small')
    ax3.set_xlabel('Iterations')
    ax3.plot(percent_error_values, 'k', label="Percent Error")
    ax4.set_ylabel('Absolute improvement per iteration')
    ax4.set_xticklabels([str(entry) for entry in axis_values[::len(axis_values)/5]], rotation=30, fontsize='small')
    ax4.set_xlabel('Iterations')
    ax4.plot(improvement_per_iteration, 'k', label="Absolute change")
    plt.savefig('pi_calculation.png')
    plt.show()


giving us:

Monte Carlo estimation of Pi - an Investigation

Monte Carlo estimation of Pi – an Investigation

I can only apologise for any dodgy code in there – in my defence, it was early in the morning. As you can see, it only takes around 100 ‘darts thrown at the board’ to start to see a reasonable value for Pi. I ran it up to about 10,000 iterations without hitting any significant calculation time. The fourth graph doesn’t really show anything interesting – I just couldn’t think of anything to put there.

That’ll do for now – I built something that’ll stream tweets on the Scottish Independence Referendum but don’t know what to do with it yet; there’ll likely be some sort of blog post. There’s a chance I’ll do some sentiment analysis but I’m not sure yet.

When you play the Game of Thrones, you win or you die.

UK Housing Data – Data Munging for Machine Learning

Hey all,

I’ve been tentatively threatening to write this post for a while now and I’ve been itching to do a bit of machine learning – I’m going to be walking through the steps required to run linear regression and an SVM on our housing sale data to try to predict future house sale prices. I’m not overly confident that it’s going to give us a huge deal of predictive power – any example I see online uses useful things like area of house or number of bedrooms. All I’ve got is when you sold it, where you sold it and the type of house (new or old, detached or terraced, freehold or leasehold) e.t.c. Not to worry – it’ll be a blast all the same.

First things first, which machine learning technique should we use? We’ll be using a supervised algorithm as we have a labelled training set (we can tell our classifier what the right answer is). We’re looking at a regression problem, not a classification problem (we’ve trying to predict a continuous variable, not a discrete one). All in all, that’s screaming linear regression to me. As an additional bonus though, I also happen to know that SVMs work very well for this kind of problem and have used Libsvm for similar things in the past.

In terms of the bits of kit I’ll be using – let’s start off, as I always seem to, in bash.

cut -d ',' -f3- pp-all.csv | cut -d ',' -f1-13 | tr -d '"' > pp_all_for_weka.txt 

Just a bit of formatting but, if you’ve followed the posts through then you’ll have pp-all.csv on your computer. If you’ve not see here for how to get it.

Now we’re here, you may notice the strange filename – Weka is a machine learning library for Java. We’ve been using it a bit at work recently and I fancied getting a bit more experience of it. I’ll be using Weka for the linear regression; I’ve not got the heart to do a Weka installation post just yet (it’s not difficult, I’m just tired) but will do one if there’s any demand. I’m going to be using LibSVM for my support vector machine calculations and again, not going to talk you through the install unless you fancy it. The reason I’ve told you about that is because I’m now going to convert my CSV file into an ARFF file and a libsvm formatted file. While I’m at it, I’m going to convert all of my values (postcodes, dates e.t.c.) into numbers. Doing this allows me to very easily feed this data set into the above programs and get an answer out relatively easily.

Could we write the algorithms ourselves? Sure – but not nearly as well as they’ve already been written. Sidebar: if you’re interested in understanding how all these algorithms work I’d encourage you to check out Andrew Ng’s lecture series on Coursera. It’s excellent.

Anyway, to run the conversion and to output the two different files I wrote the following Python script:

#!/usr/bin/python

class mapping_dictionary(object):

    def __init__(self, output_file, svm_file):
        self.mapping = {}
        self.mapping_count = {}
        self.column_names = ['date', 'postcode', 'f_map', 'n_map', 'l_map', 'addy', 'addy1', 'addy2', 'addy3', 'addy4', 'addy5', 'addy6', 'price']
        self.writer_file = open(output_file, 'w')
        self.svm_file = open(svm_file, 'w')

    def shut_file(self):
        self.writer_file.close()
        self.svm_file.close()

    def add_to_dictionary(self, column, key):
        if column not in self.mapping:
            self.mapping[column] = {}
            self.mapping_count[column] = 0
        if key not in self.mapping[column]:
            self.mapping[column][key] = self.mapping_count[column]
            self.mapping_count[column] += 1

    def interpret_list(self, listy):
        if len(listy) != len(self.column_names):
            print "Error - unexpected number of columns in the line: %d" % len(listy)
            return None
        for index, value in enumerate(listy):
            if index == 12:
                try:
                    value = int(value.strip().strip('n'))
                except:
                    break
            self.add_to_dictionary(self.column_names[index], value)
        self.write_to_file(listy)
        self.write_libsvm_file(listy)

    def write_to_file(self, listy):
        string_to_write = ','.join([str(self.mapping[self.column_names[index]][entry]) for index, entry in enumerate(listy) if index != 12])
        string_to_write += ",%sn" % str(listy[-1])
        self.writer_file.write(string_to_write)

    def write_libsvm_file(self, listy):
        string_to_write = ' '.join([str(index + 1) + ":" + str(self.mapping[self.column_names[index]][entry]) for index, entry in enumerate(listy) if index != 12])
        string_to_write = str(listy[-1]) + " " + string_to_write + "n"
        self.svm_file.write(string_to_write)

mapping = mapping_dictionary('nice_weka_output.txt', 'nice_libsvm_output.txt')
with open('pp_for_weka.txt', 'rb') as f:
    while True:
        line = f.next()
        mapping.interpret_list(line.strip().strip('n').split(','))
    mapping.shut_file()

A reasonable amount of Python happening there – if you were worrying about it, I really wouldn’t. All we’re doing is replacing every field (apart from price) with an integer and outputting that in two different formats as we work our way through the file. If I was being thorough I’d remove the hardcoded list at the top, require a header row, take the filenames as command line arguments and then it’d work as a general tool for formatting CSVs as libsvm and arff files. Actually, that doesn’t sound like a bad idea at all.

Now you’ve got your input files, it’s child’s play to run our algorithms to create models that we can pass new data to. I’ll create a separate post detailing the output of the above two classifiers but it looks like I’ll have to leave them running through the night!

Until then.

Best Housing Investments of the last 20 years

Hey all,

So from that long list I posted I’ve decided I’m most interested in the fastest growing and falling towns in the UK, as measured by the average house selling price. Once we’ve got a bunch of hot/not towns we might even be able to have a look at what these towns looked like before their boom. The hope is then that we’ll be able to find towns in that situation right now and boom, we’re housing moguls.

Given my poor excuse for hardware, we’ll have to start in bash:

 awk -F, '{print $3"-"$4"-"($(NF-3))}' pp-all.csv | tr -d '"' | cut -d '-' -f1,2,3,5 | tr '-' 't' | awk '{summary_array[$2"-"$3"-"$4] += $1; count_array[$2"-"$3"-"$4]++} END {for (region in summary_array) print region"t"summary_array[region]/count_array[region]}' | tr '-' 't' | sort -nk1 -nk2 > average_sale_price_by_town_and_month.txt 

where pp-all.csv is all the UK housing data as downloaded from data.gov.uk. The format of the data and what the above code does is inferrable from the following R code:

library(reshape2)
library(plyr)
library(ggplot2)
myData <- read.csv('average_sale_price_by_town_and_month.txt', header=F, sep='t')
colnames(myData) <- c("Year", "Month", "Town", "Average_Price")
## Let's not make the same mistakes as we did last time - which data should we remove pre-analysis?
summary(count(myData, c('Town'))$freq)
## From that, I'm going to say let's remove any town without 229 points.
myData <- myData[!(myData$Town %in% levels(myData$Town)[(count(myData, c('Town'))$freq < 229)]),]

I’m going to break out of code mode to explain that last line because it is horrible. I’m first grouping my data by town and counting the number of entries – in SQL this’d be SELECT count(date) FROM myData GROUP BY Town . I’m then comparing every entry in the list to 229 (the max number of points each town can have) and producing a logical list of the same size as the number of towns indicating whether the town at that index has a full complement of points. levels(myData$Town) gives me a list of all the towns which is placed against the previously calculated logical list – only towns matching a TRUE are kept. At that point, we’ve got a list containing all the towns we want to keep – myData$Town %in% compares the Town column of myData against this list and acts like a SQL where clause. Finally, in confusing developments, I’ve inverted all of the above logic with an ! – this’ll now only keep columns where the number of entries per town is greater than 228. The comma before the square bracket says to include all columns (we could add filters there if we liked). We assign all of this to itself, in effect filtering the original data frame by removing any lines belonging to towns which don’t have a full complement of points. All in that one line.

In practical advice for the novice R coder (which I’d class myself as) – start with the smallest bit of code you can and then add bits on when you’re comfortable with what they’re doing. Actually, that’s not just true of R – the bash one-liner above would likely be best constructed in exactly the same way.

Anyway, where were we?

myData$Datey <- as.Date(paste(myData$Year, myData$Month, 1, sep='-'), '%Y-%m-%d')
myData <- myData[,!(names(myData) %in% c("Year", "Month"))]
ts_frame <- dcast(myData, Datey ~ Town, value.var="Average_Price")
row.names(ts_frame) <- ts_frame$Datey
ts_frame <- ts_frame[,!(names(ts_frame) %in% c("Datey"))]
growth_frame = data.frame(matrix(0, ncol=1, nrow=length(ncol(ts_frame))))
row.names(growth_frame) <- names(ts_frame)
for (i in 1:ncol(ts_frame)) {
  total_growth <- 100*((ts_frame[length(ts_frame[,i]),i] - ts_frame[1,i])/ts_frame[1,i])
  growth_frame[[names(ts_frame)[i]]] = total_growth
}
long_growth_frame <- melt(growth_frame)
colnames(long_growth_frame) <- c("Town", "Percentage_growth")
long_growth_frame <- long_growth_frame[long_growth_frame$Percentage_growth > 0.1,]
ggplot(long_growth_frame, aes(x=Percentage_growth)) + geom_density() + ggtitle("Percent Uplift in UK housing prices between 1995 and 2014")

giving us:

Growth in UK House Prices

Density of the Percentage Growth of Average House Price (by town) in the UK between 1995 and 2014

As we can see, the bulk of towns experienced between 100% and 400% growth in that time – if anybody can point me towards UK salary figures over that time period I think that’d be a nice set to join this with.

Anyway, let’s take what we’ve done in the previous post on house prices and plot the data on a UK map. There are too many points for me to reasonably plot all of them – let’s go with the top 20 (Red diamonds) and bottom 20 (black crosses):

ordered_growth_frame <- long_growth_frame[order(long_growth_frame$Percentage_growth),]
library(maps)
library(mapdata)
library(RCurl)
library(RJSONIO)
## A couple of functions allowing us to dynamically get the longitude and latitude of regions
construct.geocode.url <- function(address, return.call = "json", sensor = "false") {
  root <- "http://maps.google.com/maps/api/geocode/"
  u <- paste(root, return.call, "?address=", address, "&sensor=", sensor, sep = "")
  return(URLencode(u))
}

gGeoCode <- function(address,verbose=FALSE) {
  if(verbose) cat(address,"n")
  u <- construct.geocode.url(address)
  doc <- getURL(u)
  x <- fromJSON(doc,simplify = FALSE)
  if(x$status=="OK") {
    lat <- x$results[[1]]$geometry$location$lat
    lng <- x$results[[1]]$geometry$location$lng
    return(c(lat, lng))
  } else {
    return(c(NA,NA))
  }
}

map('worldHires',  c('UK', 'Ireland', 'Isle of Man','Isle of Wight'), xlim=c(-7,2), ylim=c(50.1,58.7))  
long_and_lat <- data.frame(sapply(paste(head(ordered_growth_frame, n=20)$Town, ", UK", sep=''), function(x) gGeoCode(x)))
row.names(long_and_lat) <- c("Latitude", "Longitude")
long_and_lat <- data.frame(t(long_and_lat))
points(long_and_lat$Longitude, long_and_lat$Latitude, col=1, pch=4)
long_and_lat <- data.frame(sapply(paste(tail(ordered_growth_frame, n=20)$Town, ", UK", sep=''), function(x) gGeoCode(x)))
row.names(long_and_lat) <- c("Latitude", "Longitude")
long_and_lat <- data.frame(t(long_and_lat))
points(long_and_lat$Longitude, long_and_lat$Latitude, col=2, pch=5)
title('Fastest/Slowest Growing House Prices - UK (1995-2014)')
legend("topright", legend=c("Fastest", "Slowest"), title="Legend", bty="n", pch=c(5,4), col=c("red", "black"), inset=c(-0.05,0))

giving us:

UK Growth Map

The towns with the fastest and slowest growth in average house price since 1995.

So it looks like the South Coast has been the place to buy houses in the last 20 or so years. And the North East/North West were the places to be avoided.

OK – that’s all well and good but it doesn’t really tell us anything about the area we should be buying houses in now. Hold your horses. I’m getting to that. Obviously we’re not really going to be able to learn anything looking at the price difference between the end of our ‘test’ period and the start of our ‘test’ period. We need to build our model over a subset of this data, and test it against the remaining data.

For my next trick (blog post) I’ll look at predicting the fastest growing regions. As a sneak peek, to do we’ll use growth % as the metric we’re trying to predict (a continuous variable) and we’ll create features out of the input data set. I don’t know which method we’ll use yet but it’ll be one of linear regression, SVM regression or neural networks. Likely whichever is best supported by the language I choose to use. I’ve used Libsvm before and found it very good so maybe that.

What we do in life echoes in eternity.

« Older posts Newer posts »

© 2024 DogDogFish

Theme by Anders NorenUp ↑