Data Science, amongst other things.

Author: Matthew Sharpe (Page 1 of 3)

Centre of the Edinburgh Fringe Festival 2017

Centre of the Fringe Festival 2017

See a few posts below for the full details of the map – basically it’s the weighted centre of the Fringe festival.


  • That looks obviously wrong. Are you sure it’s right?
    • Honestly, no. Seems pretty fishy to me too. I’ll investigate if I get time but in all honesty, I just ran the same scripts as I wrote last year and it all seemed to work and output answers. This was honestly not much effort at all.
  • Did anybody get you a drink last year as a result of this?
    • No.
  • So it looks like the overall centre of the Fringe moved. Where has it moved to?
    • Basically directly over my work. If you’re in the area and fancy a coffee give me a shout.
  • Can you really call them FAQs if nobody has ever asked any of them?
    • You wouldn’t have thought so, and yet here we are.
  • How do you calculate this again?
    • See the post from last year. Any more questions and let me know.
  • Can I have the scripts so I can run it and do a better job than you?
    • Yes, but let me know. It kind of involves scraping the Fringe website and I don’t want to put that out in the public domain.
  • Given it seems obviously wrong, couldn’t you have checked before posting?
    • Follow-up questions aren’t encouraged. But, it turns out by chance that I posted last years version of this exactly a year ago and I thought I mayerswell try to get this out tonight.


So it goes.

UK Political Manifestos

Hey all,

Standard apologies about not posting for ages and all that. However, assuming you don’t care – let’s get onto the business of things. Firstly – this is a plot of the document similarities between all of the Tory/Labour manifestos from 1979 to 2017:

Document Similarity

Party manifesto cosine similarity

Is that interesting? Maybe. In general, a party’s manifesto is most similar to the same party’s manifesto of the previous election. The most similar manifestos are Labour’s of 2005 and 2010, closely followed by the Tory’s 1992 and 1997 efforts (if it works, it turns out you might still need to fix it!)

Most interestingly for me is the Tory shift in 2010 – the manifestos it was most similar to is Labour starting at 1997 and going all the way through to 2015. If you think that David Cameron and Tony Blair were political bedfellows then the manifestos might not disagree with that view.

So what about in 2017? Labour’s manifesto has managed something that no other manifesto has managed – it’s more similar to every other Labour manifesto than it is to any Tory manifesto. In contrast, Theresa May’s effort leans on Labour’s recent history as well as David Cameron’s time in charge of the Conservatives.

Similar Words

When we say a document is similar to another document, is there a way for us to see what that means and how it works? Well let’s look (for chosen pairs of manifestos) which words are shared. I’m interested in which words are most important to two documents but which aren’t important (or don’t feature) in any of the other documents. Based on that intuition, let’s look at some examples…

Labour 2017 – Tory 2017

– Brexit
– leave European Union
– protections
– devolved administrations

Labour 1983 – Labour 2017

– workers rights
– education service
– publicly owned

Labour 1997 – Tory 2010

– low carbon
– change society
– welfaretowork

I think that’s pretty cool! So we can see which themes are shared across manifestos! Could we build a predictive model to work out what makes a winning manifesto? Absolutely! Would it likely lead to massive overfitting and be practically useless? I’d have thought so.


For those interested in exactly how I did the above, see this repo: repo

As a general summary:

1.) Clean the manifesto text and remove any ‘words’ that are just numbers.
2.) Remove English stopwords and generate 3-grams (“I went to”, “went to the”, “to the shop”).
3.) Perform TF-IDF (work out which words are most important in a document, and across all documents).
4.) Calculate cosine similarity to work out how similar each document is to each other document
For the common words I’ve done something a bit different and written my own algorithm (because I couldn’t find one that did what I wanted it to). If you know of a better way of doing this (finding terms for maximising similarity between documents while minimising for all other documents) let me know!

1.) For each combination of manifesto pairs, generate a column vector with a 1 when the entry corresponds to one of the manifesto pairs and a -1 when the entry doesn’t.
e.g. (1,1,-1,-1,-1…) would be the vector for (Labour1979, Labour1983) (1,-1,-1,-1,1,-1,…) would be the vector for (Labour1979,Labour1997).
2.) Stick all of those columns together to form a matrix (A).
3.) Multiply the transpose of the TFIDF matrix (B) with the combinations matrix (A).
4.) Each column in the resultant matrix (C) now has a score for every single word present across all of the corpuses. The highest scoring words will be ones that have a large TF-IDF score in the two documents we’re interested in, but a small TF-IDF score in all the other documents.
5.) As such, pick to the top N for each column in the resultant matrix (C) and map them back to the feature names that generated the TFIDF matrix (A) in the first place.

Sorry if that explanation isn’t especially clear – have a look at the code and let me know if there’s a better way of doing things!

Choose love, Manchester.

Centre of the Edinburgh Fringe Festival

What is that map all about?

It shows the weighted centre of all of the shows at this year’s Edinburgh Fringe Festival. If you want to be at the centre of the action, staying near there isn’t a bad idea. Use the filters to show you where the centre is for different categories (Theatre, Comedy etc.).

Go On…


  • All of the Fringe shows (with date, time, category and subcategory)
  • All of the Fringe venues (with latitude and longitude)


  • Apply a weighting to each show, which is just the number of times it’s on.
  • Get the longitude and latitude of that show’s location.
  • For each subcategory, category, and overall, find the midpoint (according to the weightings calculated above).
  • Plot

Easy. No fuss.

If you’re still curious, ask away. If you’d like to give me beer because I put your bar near the centre of things, good. Do. If you’d like me to find the midpoint of something, let me know. If you’d like me to stop promising to write more and then not doing so, sorry, that’s just unrealistic.

My geekness is a-quivering.

p.s. if there’s sufficient interest I’ll do a tutorial on how to get the data, find the midpoints and then put together the visualization.

Generating B2B sales data in Python


Look, I’m sorry. Yet again, I’ve not written any blog posts for ages. Let’s all get over it and move on to something more important. Sales. Let’s imagine you’re an organisation selling B2B. You use Salesforce (or any other platform). You’ve got plenty of opportunities and a history of those opportunities. You’ve gone and built a sales pipeline.

Good work. That’s not an easy thing to do.

Now you want to use that pipeline to get better at sales. You want to use the data you’ve got to help forecast what you’ll do in the future. You want to know the value of what you’ve already got in the pipeline. You want to know what the most valuable activities you perform are. I’m not going to be able to fit all of that into one post so I’ll break things up into parts and (I’ve said this before only to underdeliver) FINISH THE SERIES.

However, for part 1 I’m actually only going to focus on generating some dummy data to play with. “What!? That’s none of the things you said you’d do!” No. It’s not. However, if you’re able to find me a B2B company with a small number of sales who are willing to publicly share all their data then fair play to you. Lacking that I’m going to have to create a dummy set of data and make it halfway believable. In doing this I’ve made a few assumptions (that I’m later going to try to show). It’s a bit circular but don’t be that guy. What I’m doing is broadly legit and if you look at the data and don’t think it’s reasonable then I’m providing the code so you can change whichever bit you find egregious. Even better, just use your actual company’s sales data (assuming you’re lucky enough to have it).

I’ll be building a dataframe that resembles a Salesforce pipeline – it’s going to have the following rows:

Stage – this is the ‘Salesforce/Hubspot/<don’t care>’ stage in the pipeline. Measures how far along an opportunity is.
Name – got to keep track of the opportunities using something
Value – how much money are we going to make from this opportunity. Daily, Monthly, Annually. Doesn’t matter.
Days – this is the date the opportunity entered the stage given. Going to be important later for time-dependence stuff.

So, let’s begin (all code also available here)

import numpy as np
import random
import matplotlib
from matplotlib import pyplot as plt
import datetime
from datetime import datetime as dt
from scipy import stats
import pandas as pd

def weighted_pick(weights, n_picks):
    t = np.cumsum(weights)
    s = np.sum(weights)
    return np.searchsorted(t, np.random.rand(n_picks)*s)

pre_stages = [('Contact initiated', 0.8, 10), ('Meeting booked', 0.6, 20), ('Trial booked', 0.4, 15), ('Proposal sent', 0.3, 25), ('Contract sent', 0.2, 10)]

closed_stages = ['Closed Won', 'Closed Lost']

success_stages = ['Closed Won']

Here I’m declaring a few things that are going to be useful to me later. I want all of the stages in the pipeline that I care about, the closed stages and the success stages. The code is probably a bit brittle regarding the random addition of closed and success stages but is fine for new ‘pre_stages’. The parameters are the probability that the opportunity will fall out of this stage (rather than move on successfully) and something else that we’ll talk about later.

WORDS = open('/usr/share/dict/words', 'rb').read().splitlines()


sales_opportunities = [(entry.title(), np.random.normal(AVERAGE_SALE_PRICE, SD_SALE_PRICE)) for entry in np.random.choice(WORDS, NUM_POINTS, replace=False)]

Here I’m generating a list of ‘company names’, picking words randomly from a dictionary. In all honesty, just looking through the list of company names is pretty fun in itself. I’m also assuming that the revenue I make from my product is a normal distribution with mean and standard deviation given as ‘AVERAGE_SALE_PRICE’ and ‘SD_SALE_PRICE’. Not rocket science. But it is an assumption I’m making – let’s chalk it down. First assumption: revenue/client is normally distributed. Then we build a list of sales opportunities and their value.

start_date = - datetime.timedelta(days = 365*2)
days_range = range(365*2)
y = [float(entry)/365. for entry in days_range]

indices = weighted_pick(np.exp(y), NUM_POINTS)

Second assumption I’m going to make in generating this data – you’re working for the right kind of start-up/business. Basically, the number of opportunities created are going to broadly follow an exponential distribution. That is, you specify how many opportunities enter the pipeline with ‘NUM_POINTS’ and we’re going to distribute those according to an exponential distribution. I’m saying that the company starts 2 years ago – again, change if you don’t like it.

sales_data = [[pre_stages[0][0], name_value_pair[0], name_value_pair[1], start_date + datetime.timedelta(days = index)] for name_value_pair, index in zip(sales_opportunities, indices)]

remaining_opportunities_frame = pd.DataFrame(sales_data)
remaining_opportunities_frame.columns = ['Stage', 'Name', 'Value', 'Days']

sales_data_frame = pd.DataFrame(sales_data)
sales_data_frame.columns = ['Stage', 'Name', 'Value', 'Days']

finished_list = set([])

OK. Now I’ve got the first set of entries that’ll make up my final dataframe – it’s all of the opportunities with the value (generated from a normal distribution) and the time the opportunity entered the pipeline (generated via an exponential distribution). I’m going to create a few things for later, namely a dataframe containing all of the live opportunities and our final dataframe containing all the rows we’re going to care about.

for stage_index, stage in enumerate(pre_stages[1:]):

    next_stage = pd.DataFrame([(sales_opp[1], index, np.argmax(entry)) for sales_opp in sales_data for index, entry in enumerate(np.random.multinomial(1, [0.99, (1. - stage[1])/100., stage[1]/100.0], ( - sales_opp[3]).days)) if entry[0] != 1 and sales_opp[1] not in finished_list])

The above line is where it’s all at. Let me explain slowly and then again, even slower. My intuition is this – I think that the probability that an opportunity converts (moves from its current stage to the next stage) is proportional to the negative exponential of the time spent in that stage. Let’s be clearer. I’m going to make the third assumption – that the probability of moving to the next stage broadly follows a negative exponential. What’s more, I think that each stage will have its own characteristic drop off rate (or half-life, for those of you thinking this looks mightily like radioactive decay). You know how before I said I’d added a parameter to ‘pre_stages’ and I’d explain it. That’s what ‘pre_stages[x][2]’ is. So, for a given stage in the sales pipeline, for each opportunity left in the previous stage, for every day between when the opportunity entered the stage and now I run the multinomial line. The multinomial line is going to return a binary array of three elements where exactly one of the elements is filled. The first element will be filled in 99% of cases – I’ve chosen to set this and if you don’t like it then change it to something else. It means that, for every day between the opportunity entering the state and today there’s a 99% the opportunity will still be in that state at the end of the day. If the second element is filled then that means that the opportunity succeeded on that particular day (with probability given by the stage parameter). Finally, if the third element is filled then the opportunity died on that particular day. ‘Index’ gives us the number of days that’ve happened since the opportunity entered the stage and the argmax gives us whether we succeeded or failed (you’ll see we’re ignoring days when we neither succeeded or failed).

    next_stage.columns = ['Name', 'Days', 'Status']

    meh = next_stage.ix[next_stage.groupby('Name').Days.idxmin()]
    tempy_frame = meh.merge(remaining_opportunities_frame[['Name', 'Value', 'Days']], how='inner', on='Name')

    tempy_frame['new_date'] = tempy_frame.apply(lambda x: x.Days_y + datetime.timedelta(days = x.Days_x), axis=1)

    tempy_frame = tempy_frame[['Name', 'Value', 'new_date', 'Status']]
    tempy_frame.columns = ['Name', 'Value', 'Days', 'Status']

    success_frame = tempy_frame[tempy_frame.Status == 1]
    success_frame = success_frame.drop('Status', 1)
    success_frame.insert(0, 'Stage', pre_stages[stage_index + 1][0] if stage_index + 1 < len(pre_stages) - 1 else success_stages[0])

    failure_frame = tempy_frame[tempy_frame.Status == 2]
    failure_frame = failure_frame.drop('Status', 1)
    failure_frame.insert(0, 'Stage', closed_stages[1])

    sales_data_frame = sales_data_frame.append(success_frame).append(failure_frame)

That was a crazy line – but it contained most of the interesting stuff we do. From here on in we grab the first of the days that the opportunity moved (we actually kept all of the days in the above line but we’re only allowing each opportunity to move out of each stage once!), add the number of days to the original date we entered the stage to find the day we move into the next stage and then create the rows that we need.

    finished_frame = sales_data_frame.groupby('Name').apply(lambda x: x.Stage.isin(closed_stages).any())
    finished_list = set(finished_list).union(set(finished_frame[finished_frame == True].index.values))
    remaining_opportunities = remaining_opportunities_frame[~remaining_opportunities_frame.Name.isin(finished_list)]

Finally, there’s a bit of tidying up to make sure that we don’t calculate anything for any of the opportunities that have died

dates = matplotlib.dates.date2num(sales_data_frame[sales_data_frame.Stage == success_stages[0]].sort('Days').Days.astype(dt))
revenue = sales_data_frame[sales_data_frame.Stage == success_stages[0]].sort('Days').Value.cumsum().values

plt.plot_date(dates, revenue, 'b-')
plt.title('Company revenue over time')

sales_data_frame.to_csv('generated_data.csv', index=False)

Quite a lot of work, really, just to generate some ‘likely looking’ sales data. Again, if you’ve got your own then use it! However, up till now I’ve just asserted that it’s likely looking. If you play around with it you can actually see some pretty interesting stuff. Firstly, with lots and lots of data point (N = 8000) you see that the company revenue growth looks very exponential:


However, it’s unlikely that you’ve got 8000 B2B transactions in your sales pipeline (if you do, kudos!). Let’s examine the situation where you’ve got 150:


And a once more with 150:


I think it’s interesting that, even though we’ve literally built this whole pipeline using exponential growth – we still look flat in a lot of places. Hopefully that might provide some solace if you’re struggling with sales and think you’re not hitting your exponential growth. Play around with the parameters and you can see what sort of effect increasing your conversion at various stages has on your overall revenue etc. Or just read the company names – they’re also pretty good.

Right, I’m counting that as broadly done. We’ve got sales data that nobody will mind me analysing in a public forum. Stay tuned/subscribe/email me to keep in touch for part 2. We’ll imagine that we’ve started with this data and we’ll try to assign a total value to our pipeline, and maybe even get onto predicting how many opportunities will progress in the next N days.

Finding Topics in Harry Potter using K-Means Clustering.

I’ll open up with the money-shot – these are all of the clusters that I was able to find using the whole Harry Potter and grouping by chapter:


Every cluster plotted separately.

That’s far too messy to be of any practical use so let’s have a look at a couple of those clusters in more detail:

Privet Drive Cluster

One of the clusters – a Dursley/Privet Drive heavy cluster!



This is a pretty Griphook/Goblin heavy cluster featured on the storyline in book 7.

Hopefully that’s piqued your interest enough to continue on scrolling and see how we got these clusters – and see the words that tie them all together! The code for generating these is on my Github ( and all the graphs and documents are contained there.

I went to a really interesting talk at PyData that was about Latent Dirichlet Allocation, a topic entirely new to me. I thought I’d love to apply it to my favourite book series – Harry Potter. However, that didn’t happen…instead you get this. A heavy rip off of an excellent post ( that walks through how to cluster documents using a bunch of techniques including K-Means.

Step 1

Get plain text copies of all the Harry Potter books and make sure they’re all formatted in roughly the same way. As is often the case, this step took bloody ages.

Step 2

I want a few different documents – more than 7 (the number of Harry Potter books for the heathens out there) but substantially less than the number of sentences in all 7 books. Treating each chapter as a separate document seemed to make most sense and so here I initialise everything I need and split my books into chapter.

So at this point we’ve nicely got ourselves a list of chapter titles and a list of the associated text (as a string).

Step 3

Now we’ve got our chapters we’re going to want to tokenise the text in them. Basically this means converting a string into discrete tokens – what we’d think of as words. The reason it’s got a fancy name and I’m being a bit careful about my terminology is because tokenisation also takes care of things like punctuation and isn’t as simple as just splitting a string into separate words. Having said that, I’ve basically taken the path of least resistance and so have gone with a very simple tokenisation scheme. I’m also going to skip over the pain of utf-8 encoding/decoding/recoding. I’ve basically just dropped any character that I’ve found in the least bit complicated.

Step 4

Next I’m going to perform TF-IDF on my chapters. Here, I convert each token into a number and look at how many times that token appears in a given document, and how many documents that token appears in overall. So in this instance I’m looking to see how often a given word appears in this particular chapter and in how many chapters throughout all 7 books the token appears in. This gives us an idea of how important/prominent a word is in a given chapter, taking into account how common the word is throughout all the chapters. As an example ‘Harry’ is likely to feature a lot in a given chapter but also is likely to feature in every chapter and so probably isn’t especially important to any given chapter’s classification. ‘Nicolas Flamel’, however, is going to appear a reasonable amount in a few chapters but not at all in all the rest. We therefore know that Nicolas Flamel is important in the chapters he does appear in.

Luckily, I don’t have to worry about the implementation of TF-IDF – sklearn has got it.

This gives me a large sparse matrix with the TF-IDF score for each word in each document as the entries. If you pay close attention to the parameters I’ve passed along to TfidfVectorizer as they are ripe for the changing. Firstly, `max_df=0.75` is saying that I don’t care about words that appear in more than 75% of the chapters. `min_df=0.05` is saying that I don’t care about words that appear in fewer than 5% of chapters. You can see I passed along my tokeniser and that I’m using English stop-words (that is, I’m removing the most common English words). Finally, I’m generating n-grams between 1 and 4.

For those uninitiated with n-grams they’re basically a way of splitting text up into handy little chunks. As an example, 3-grams of the following sentence:

“This is not the greatest song in the world”

would be:

“This is not”, “is not the”, “not the greatest”, “the greatest song”, e.t.c.

This allows me to pick out common phrases such as “Snape said” and “wizarding world”. Again, that’s a setting that is begging to be played about with.

Step 5

Performing K-means clustering we get output like so:

Cluster 18 words:
Top words: maxime,madame maxime,karkaroff,madame,hagrid,cedric,moody,krum,champions,tournament

Chapter: the hungarian horntail, Book: gobletOfFire
Chapter: the goblet of fire, Book: gobletOfFire
Chapter: the four champions, Book: gobletOfFire
Chapter: beauxbatons and durmstrang, Book: gobletOfFire
Chapter: the beginning, Book: gobletOfFire
Chapter: the yule ball, Book: gobletOfFire

which is obviously quite a handy little cluster. It’s successfully managed to only take chapters from one book (given that we’ve not allowed K-means access to the book information that is a bit of a triumph). For those aware of Harry Potter you’ll see this is a Triwizard Tournament heavy cluster. Another example:

Cluster 10 words:
Top words: wormtail,cold voice,voldemort,lord,cauldron,riddle,cedric,man,master,faithful

Chapter: the riddle house, Book: gobletOfFire
Chapter: flesh, blood, and bone, Book: gobletOfFire
Chapter: the death eaters, Book: gobletOfFire

Again, all the clusters are from one book – I’m counting that as a result. Added bonus, they’re not sequential chapters! I happen to know (by being a massive Harry Potter geek) that these chapters are towards the start and end of book 4 and focus heavily on Voldemort and Peter Pettigrew.

So far so good and in fact I could stop here but wouldn’t it be nice to visualise those clusters so we can see the topics we’ve picked out graphically? Even if you said no then it doesn’t really matter. I’m still going to do it.

Step 6

First things first, let’s plot all of the chapters on one graph and colour code them with the book from which they came. It’s not a great way of visualising clusters but it is a great way of seeing how everything is laid out:


A messy picture of all the chapters projected into 2d space.

Again, this section is shamelessly copied from the aforementioned blog but ultimately we’re projecting the cosine differences between the tf-idf matrix terms into 2-dimensional space. I declare a colour dictionary for each of the books and then rattle through the chapters plotting them.
I’m sure you’ll agree that’s far too messy for anybody to really do anything with. If you’re following along with the code you’ll see that next I generate the subplot figure shown at the top.

Finally, I create plots for each of the clusters – a few examples of which are:

Cluster 3 words:
Top words: umbridge,professor,dont,professor umbridge,snape,sirius,im,said hermione,harrys,youre

Chapter: the muggle born registration commission, Book: deathlyHallows
Chapter: the hogwarts high inquisitor, Book: orderOfThePhoenix
Chapter: o.w.l.s, Book: orderOfThePhoenix
Chapter: the second war begins, Book: orderOfThePhoenix
Chapter: educational decree number twenty four, Book: orderOfThePhoenix
Chapter: the centaur and the sneak, Book: orderOfThePhoenix
Chapter: percy and padfoot, Book: orderOfThePhoenix
Chapter: detention with dolores, Book: orderOfThePhoenix
Chapter: occlumency, Book: orderOfThePhoenix
Chapter: in the hogs head, Book: orderOfThePhoenix
Chapter: out of the fire, Book: orderOfThePhoenix
Chapter: professor umbridge, Book: orderOfThePhoenix
Chapter: fight and flight, Book: orderOfThePhoenix
Chapter: seen and unforeseen, Book: orderOfThePhoenix
Chapter: career advice, Book: orderOfThePhoenix
Chapter: snapes worst memory, Book: orderOfThePhoenix


Umbridge's cluster

Umbridge’s cluster

Everybody’s favourite professor – Dolores Umbridge!

And another:

Top words: hagrid,yeh,ter,said hagrid,professor,said hermione,malfoy,o,professor trelawney,trelawney

Chapter: professor trelawney's prediction, Book: prisonerOfAzkaban
Chapter: talons and tea leaves, Book: prisonerOfAzkaban
Chapter: the firebolt, Book: prisonerOfAzkaban
Chapter: diagon alley, Book: philosophersStone
Chapter: the foribidden forest, Book: philosophersStone
Chapter: the keeper of the keys, Book: philosophersStone
Chapter: norbert the norwegian ridgeback, Book: philosophersStone
Chapter: hagrids tale, Book: orderOfThePhoenix
Chapter: the eye of the snake, Book: orderOfThePhoenix
Chapter: grawp, Book: orderOfThePhoenix
Chapter: hermione's helping hand, Book: halfBloodPrince
Chapter: rita skeeter's scoop, Book: gobletOfFire

and visually:

Grawp's brother

Keeper of the Keys

The anti-Umbridge – it’s Hagrid’s cluster!

I’ll stop on the random copy/pasting of the clusters and stick them all on my Github – I think you’ve got the idea! All in all I’m pretty happy with how this has worked but I am very dependent on individual character names. I tried just looking at 2-grams but it usually just gave me ” said” with a few exceptions (‘wizarding world’, ‘said softly’, ‘death eater’, ‘godrics hollow’). I’ve also put almost zero effort into formatting the images – you know roughly what it’s meant to look like: having the pictures look good is an exercise I’m leaving to the reader’s imagination.

There’s loads more stuff I could do but I’m going to eat a chicken and go swimming so it’ll have to wait.

All was well.

Emailing multiple inline images in Python

Hi all,

Yet again, sorry for the lack of blog posts. In my defence, I’ve been keeping busy what with the Mining of Massive Data Sets course on Coursera and a few Kaggle problems amongst other distractions. I legit have at least double-digit subscriber numbers so to those people, thanks! Ignoring all that, today I’m going to share with you something I had to puzzle out myself due to lack of information online about it. In sharing, hopefully the next poor schmuck who has to do this will just be able to copy what I’ve done.

We’ll look at building a fairly robust email; one which will feature multiple inline images and multiple attachments. We’ll do the whole thing using Gmail because it’s good and easy to use Gmail and everybody can create one. The original version I wrote used Outlook and integrated Windows security but not everybody has that so Gmail it is. As a bit of background, I used this as part of a reporting pipeline that queries the Google Analytics API. I grab a whole bunch of data, filter and arrange it, combine it with another data source and produce some tables/graphs. These get saved down into a directory every day and my emailer will come along, pick up all the relevant images and hey presto – you’ve got a fairly cool reporting pipeline.

First things first, this is how you send lots of basic plain-text emails.

import smtplib

with open('credentials.csv', 'rb') as f:
    gmail_user = f.readline().strip().split(',')[1]
    gmail_pwd = f.readline().strip().split(',')[1]

def send_email(msg, gmail_user, gmail_pwd, to_list):
    mailServer = smtplib.SMTP('', 587)
    mailServer.login(gmail_user, gmail_pwd)
    mailServer.sendmail(gmail_user, to_list, msg)

for i in range(100):
    send_email('Hello everybody', gmail_user, gmail_pwd, [''])

Would the below code send 100 of the same email to the same person? Yes. Does that have the potential to be abused for perhaps-hilarious purposes. Of course. I’ll leave that up to you.

Next, let’s take a fairly easy step and attach a few files.

from email.mime.multipart import MIMEMultipart
from email.header         import Header
import smtplib
from email.MIMEBase import MIMEBase
from email import Encoders
import os

with open('credentials.csv', 'rb') as f:
    gmail_user = f.readline().strip().split(',')[1]
    gmail_pwd = f.readline().strip().split(',')[1]

def attach_file(filename):
    part = MIMEBase('application', 'octect-stream')
    part.set_payload(open(filename, 'rb').read())
    part.add_header('Content-Disposition', 'attachment; filename=%s' % os.path.basename(filename))
    return part

def generate_email(gmail_user, to_list, data_path):
    msg = MIMEMultipart('related')
    msg['Subject'] = Header(u'Test Attachment Email', 'utf-8')
    msg['From'] = gmail_user
    msg['To'] = ','.join(to_list)
    return msg

def send_email(msg, gmail_user, gmail_pwd, to_list):
    mailServer = smtplib.SMTP('', 587)
    mailServer.login(gmail_user, gmail_pwd)
    mailServer.sendmail(gmail_user, to_list, msg.as_string())

email_msg = generate_email(gmail_user, [''], 'test_data.txt')
send_email(email_msg, gmail_user, gmail_pwd, [''])

All you need to do to allow multiple attachments is call the msg.attach(attach_file()) method a few times. Well, as many times as you’d like attachments would be ideal.

Glorious – so we can email multiple people with multiple attachments. You could argue at this point that, if we can attach the images we’re looking to send then surely we don’t need to create inline images and so we can cease this madness. If you can argue that point successfully then you’re better at arguing than me. Something to do with Blackberry compatibility, looking nicer and “can’t you just not argue and do it for once – is it possible or is it not?”

So we’ll be embedding images to the above email – the way to do so is…

import cgi
import uuid
from email.mime.multipart import MIMEMultipart
from email.mime.text      import MIMEText
from email.mime.image     import MIMEImage
from email.header         import Header
import os
import smtplib
from email.MIMEBase import MIMEBase
from email import Encoders

with open('credentials.csv', 'rb') as f:
    gmail_user = f.readline().strip().split(',')[1]
    gmail_pwd = f.readline().strip().split(',')[1]

def attach_image(img_dict):
    with open(img_dict['path'], 'rb') as file:
    msg_image = MIMEImage(, name = os.path.basename(img_dict['path']))
    msg_image.add_header('Content-ID', '<{}>'.format(img_dict['cid']))
    return msg_image

def attach_file(filename):
    part = MIMEBase('application', 'octect-stream')
    part.set_payload(open(filename, 'rb').read())
    part.add_header('Content-Disposition', 'attachment; filename=%s' % os.path.basename(filename))
    return part

def generate_email(gmail_user, to_list, data_path_1, data_path_2,img1,img2):
    msg =MIMEMultipart('related')
    msg['Subject'] = Header(u'Images and Words', 'utf-8')
    msg['From'] = gmail_user
    msg['To'] = ','.join(to_list)
    msg_alternative = MIMEMultipart('alternative')
    msg_text = MIMEText(u'Image not working - maybe next time', 'plain', 'utf-8')
    msg_html = u'<h1>Some images coming up</h1>'
    msg_html += u'<h3>Image 1</h3><div dir="ltr">''<img src="cid:{cid}" alt="{alt}"><br></div>'.format(alt=cgi.escape(img1['title'], quote=True), **img1)
    msg_html += u'<h3>Image 2</h3><div dir="ltr">''<img src="cid:{cid}" alt="{alt}"><br></div>'.format(alt=cgi.escape(img2['title'], quote=True), **img2)
    msg_html = MIMEText(msg_html, 'html', 'utf-8')
    return msg

def send_email(msg, gmail_user, gmail_pwd, to_list):
    mailServer = smtplib.SMTP('', 587)
    mailServer.login(gmail_user, gmail_pwd)
    mailServer.sendmail(gmail_user, to_list, msg.as_string())

img1 = dict(title = 'Image 1', path = 'test_image_1.png', cid = str(uuid.uuid4()))
img2 = dict(title = 'Image 2', path = 'test_image_2.png', cid = str(uuid.uuid4()))

email_msg = generate_email(gmail_user, [''], 'test_data.txt', 'test_data_2.txt', img1, img2)
send_email(email_msg, gmail_user, gmail_pwd, [''])

So all I need to do is make sure I’m downloading the right images and data files everyday and putting them in the correct directory. Ordinarily I’d do that using wget on Unix but as I’ve done the rest in Python I’ll include the download in the script. This way the script should work irrespective of OS – a handy little bonus.

Get that run and you’ll see a lovely set of inline images, all the attachments as they were and we’re done. Nothing too complicated I hope and we’ve made a nice little ’email image aggregation program’ thingy.

As a possible extension I think this could be better used to annoy/amuse and so you could create an ’email digest’ – make something that automatically goes on a few websites, downloads an image and then sticks them all inline in an email. It’ll also download a few data files and attach them. Let’s say I really like the NASA Astronomy picture of the day (I do), the XKCD comic (I do), the Google doodle (I do) and a CSV of the week’s weather forecast (I do not). I could easily set up a bash script (using wget) and a cronjob that would get the images/csv files each day and put them into a directory. Equally, if we wanted to play nicely with Windows as well as *nix, we could use Python to download the images. I’m not going to build this as it’s not really a problem that needs solving (as far as I see it).

Peace out yo.

Building a Recommendation Service on AWS with Mahout

Hi all,

Those (two) of my regular readers will know I frequently curse my fool-of-a-took laptop. With this in mind I’ve been meaning to write a tutorial on working with Amazon’s web services for when you don’t have the hardware you need. This isn’t that post but I will assume you’ve already got a cluster to hand with all the bits and bobs you need installed.

If you don’t, don’t worry about it (for now) – I’ll do a post on how to get one of those set up at a later date. For now I’d like to talk about collaborative filtering. This is one of the most prevalent techniques for generating recommendations and with good reason – it’s pretty good. It’s quite a common sense method:

I know that person A likes items 1, 2 and 3 but not item 4.
I know that person B likes item 4 and 3 but not 1 or 2.
I know that person C likes item 1  but not 2, 3 or 4.
I know that person D doesn’t like item 4 – which of the items 1, 2 or 3 should I recommend them?

I’m sure most people reading this will say we should recommend them item 1, then probably item 2 then 3? Certainly item 1 with highest preference. Now if you deconstruct your thought process in coming up with that recommendation I’d wager (not a lot) that it’d be akin to the collaborative filtering algorithm.

Based on the behaviour of people A, B and C we can start to get an idea about which items are rated favourably by the same people versus those that aren’t. User-based collaborative filtering is about finding similar users to person D and looking at what they liked. Item-based collaborative filtering is about finding similar items to the ones person D has rated positively and recommending those.

There are lots of nice articles on collaborative filtering – more often than not when it comes to machine learning I end up recommending Andrew Ng material and this is no exception – check out his machine learning course on Coursera. There’s a whole lecture on collaborative filtering and it’s boss. If my explanation wasn’t enough (if you’re serious about this sort of thing, it shouldn’t be) check out his lectures on the subject and read around on the mathematical details of it all. Make sure you polish up on your linear algebra – you’ll have very large sparse matrices heading your way!

Also, please note that this whole post is just a re-imagining of this post by Amazon: Recommendations using EMR – I liked it and thought I’d give it a go myself with a different data set and technologies I was more familiar with. Now on with the show…

The Data Set

I’m going to be using a publicly available data set containing user ratings of books gathered on Amazon – dataset available here: Book Data Set

There are three files in there – one with all the details of the books, one with all the details of the users and finally one with all the user ratings. Don’t worry about downloading them yet if you’re following this as a tutorial/example – we’ll stick that bit in a script later on.

The Cluster Set-Up

Given that I’m on the free tier and I don’t use AWS at work (we’ve got a nice cluster that handles all our data just fine thank you) I’m a bit ‘over-compute’ happy at the moment and so I’m going to use a 5 node cluster (1 master node and 4 slaves). This isn’t really necessary but then nor is this as a piece of work so I’m not that bothered by practicality at this point. The benefit of using Amazon’s Elastic Mapreduce is that the nodes all come with Hadoop (& Hive/Pig/Mahout/whatever) installed and set up so you don’t have to bother with any of that.

Building a Service

I want to demonstrate how easy it is to turn this sort of thing into a recommendations service (instead of a one-off bit of analysis). Now I’m going to stress that I’m setting this up as an offline batch machine learning method and that there are a few steps I’d change if I were actually setting this up as a real, ‘production’, service.

I’m going to use a Redis database because I’ve got a reasonable amount of memory on these boxes (Redis is an in-memory key-value store), because Redis is really really quick and because I’m used to using Redis with Python. Redis will store the output of my collaborative filtering and I’ll stick a lightweight web framework (Flask) on top of it. This will allow a client to retrieve recommendations for a given user as well as delete them or overwrite them. I’ve decided to add the latter two functions because we don’t live in a perfect world.

I’ll also write a script that’ll update the books, users and ratings files and stick it on a cronjob to run every day. That way, by copying this method you’ll end up with a reasonably scalable, reasonably fast and reasonably up-to-date recommendation service. It’s not the best one out there (heck, it might be the worst) but it goes to show how simple these things are to set up given all the tools that are already written…

The Code

Firstly, we’ll create the cluster using the Elastic Mapreduce CLI:

./elastic-mapreduce --create --alive --name recommendation-service --num-instances 5 --master-instance-type m1.large --slave-instance-type m2.xlarge --ami-version 3.1 --ssh

The --ssh means that once the cluster has been created and all the necessary bits and bobs installed we ssh into the master node

Extracting the data set

sed '1d;' BX-Book-Ratings.csv | tr ';' ',' | tr -d '"' > Book-Ratings.csv}

Unfortunately for us, we’ve got a bit of work to do – the ‘out of the box’ Mahout recommendation algorithm requires user and book ids to be integers and ISBNs are not. To get around this, we use this nifty little script:


import sys
import re

isbn_dictionary = {}

counter = 0

with open('Book-Ratings.csv', 'rb') as f:
    with open('New-Book-Ratings.csv', 'w') as g:
        with open('Book-Mappings.csv', 'w') as h:
            for line in f:
                    user, book, rating = line.split(',')
                    index = isbn_dictionary[book]
                    isbn_dictionary[book] = counter
                    index = counter
                    h.write("%s,%dn" % (book, counter))
                    counter += 1
                g.write("%d,%d,%dn" % (int(user), index, int(rating)))

Now we’ve got data in the format our Mahout recommender is expecting let’s put it in HDFS and run our algorithm

hadoop fs -put New-Book-Ratings.csv /New-Book-Ratings.csv
mahout recommenditembased --input /New-Book-Ratings.csv --output recommendations --numRecommendations 10 --similarityClassname SIMILARITY_COSINE

Now we play the waiting game…
Actually, while that’s running, open up another SSH session to that terminal and run these commands

sudo easy_install redis flask
tar xzf redis-2.8.13.tar.gz
cd redis-2.8.13
./src/redis-server &

Now Redis is installed and running and we’ve got the necessary Python libraries we can create the following server…

import redis
import os

isbn_dictionary = {}

with open('Book-Mappings.csv', 'rb') as f:
    for line in f:
        isbn, counter = line.split(',')
        isbn_dictionary[int(counter)] = isbn

r = redis.StrictRedis(host='localhost', port=6379, db=0)

p = os.popen('hadoop fs -cat recommendations/part*')

for recommendation in p:
    user, recommendations = recommendation.split('t')
    recommendations = [entry.split(':')[0] for entry in recommendations.replace('[', '').replace(']','').split(',')]
    recommendations = [isbn_dictionary[int(entry)] for entry in recommendations]
    r.set(int(user), recommendations)

Now we’ve populated our database with all the recommendations we generated with our Mahout job. Now we’ll set something up (called to return these items to a client (as well as allow modifications to be made).

import redis
from flask import Flask
r = redis.StrictRedis(host='localhost', port=6379, db=0)

app = Flask(__name__)

@app.route('/user/<user_id>', methods = ['GET', 'POST', 'PUT', 'DELETE'])
def restful_api(user_id):
    if request.method == 'GET':
        recommendations = r.get(int(user_id))
        return recommendations if recommendations else "User %d does not exist" % int(user_id)
    elif request.method == 'POST':
        if request.headers['Content-Type'] == 'application/json':
            r.set(int(user_id), request.json)
            return "Successfully set recommendations for user %d" % int(user_id)
            return False
    elif request.method == 'DELETE':
        if r.exists(int(user_id)):
            return "Successfully deleted recommendations for user %d" % int(user_id)
            return "User %d does not exist" % int(user_id)
    elif request.method == 'PUT':
        if request.headers['Content-Type'] == 'application/json':
            r.set(int(user_id), request.json)
            return "Successfully set recommendations for user %d" % int(user_id)
            return "Require JSON put"
        return "No idea what you've done here, but don't do it again"

if __name__ == '__main__':

Now we’ve built our client-facing recommendations service let’s ‘deploy’ it and give it a test

python &
curl -X GET localhost:5000/user/277965
curl -X DELETE localhost:5000/user/277965
curl -X GET localhost:5000/user/277965


We’ve not done anything that tricky but in a fairly short time we’ve managed to build something that’d be at least part way useful to a small business somewhere looking for recommendations for their users. It’s not a great leap from this and a genuine production ready system – you don’t need 5 Hadoop nodes running all the time for a 20 minute Mahout job and I’m uncomfortable using Flask and local files for a serious system. As such, you’d likely put together a couple of nodes, each with their own copy of the recommendations set in memory. One of them would create a cluster, get any new data a generate the data set as often as you saw fit – communicating it to the other node. Finally, you’d add a few more actions to the server and make sure it could be called by whichever clients need it (and block everybody else).

As I say, nothing earth-shattering but a solid base on which to build a recommendations system.

I am become death.


Bayesian Testing of Conversion Rate

Hi all,

I’m on a train to the Fringe festival and I’ve managed to bag myself first class tickets! What this means is that I’ve got about 3 hours to kill, unlimited free wine (who knew that was thing!?) and a fairly flaky internet connection. All of that goes together to make now the perfect time to share with you a bit of work I’ve been up to recently.

Firstly, massive props to this particular book: Probabilistic Programming & Bayesian Methods for Hackers

It’s really really good and this testing is just a minor reworking of one of the examples in the book – they say imitation is the sincerest form of flattery no?

Anyway, enough of the wine-induced babbling, on with the Bayesian testing…

The setting of the scene

What I’ll be looking to do here is identify changes in conversion rate (can be any kind of conversion, I’ll use orders/visitors but it doesn’t really matter) in historical data. The reason I say historical data is thusly: this is a good technique for identifying a visualizing changes but it isn’t as good as running a legitimate A/B test. Ideally, we’d implement the change (whatever it is) for one group of customers and not for the rest of the customers and measure relative performance. However, let’s assume that, for whatever reason, you’re looking over a bunch of (conversion) data trying to identify a change.

I’ll artificially generate data so you can run similar examples and get an idea of what our data source looks like:


import random
import numpy as np

total_points = 50

## Generate a list containing the number of trials
trials = [random.randint(20,100) for _ in range(total_points)]
results = [np.random.binomial(value, 0.4) if total_point/2 else np.random.binomial(value, 0.3) for index, value in enumerate(trials)]

for trial, result in zip(trials, results):
    print "%dt%d" % (trial, result)

where in our example, trials is going to be the number of visitors on successive days and results is going to be the number of orders.

Pipe that into a file (python > conversion_data.txt ) and we’ve got ourself a nice list of conversion data with a change in the rate at some point (halfway in this example) through the data.

The Bayesian Bit

So the idea behind Bayesian statistics revolves around priors and posteriors – your prior is going to be a distribution that represents your (shocker) prior ideas about the result. You’re going to update this prior hypothesis with data as you get it and when you do so the resulting distribution is called the posterior. This is great for a number of reasons – my favourite two are:

  1. You get to set a prior that influences the final outcome. If I’ve got fairly strong ideas about what the conversion rate is (say I think it very likely lies between 80% and 90%) I can reflect that in my prior. When we start any frequentist (non-Bayesian) calculations we assume every probability between 0% and 100% is equally likely – generally that’s not the case.
  2. The final result is a distribution. Distributions are great for visualization, allow for easy comparisons against other distributions and are really easy to show uncertainty on. Do away with all the talk of p-values and show a graph with two ‘conversion distributions’ on and you’re on to a winner.

Shut up and Calculate

Code first – then commentary:


import pymc as pm
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import datetime

basket_list = []
conversion_list = []

with open('conversion_data.txt', 'rb') as f:
    for line in f:
        baskets, conversions = line.strip().split()

n_percent_list = len(basket_list)

uniform_one_samples = []
uniform_two_samples = []
tau_samples = []
uniform_one = pm.Uniform('uniform_one', 0, 1)
uniform_two = pm.Uniform('uniform_two', 0, 1)

tau = pm.DiscreteUniform('tau', lower=0, upper=n_percent_list)

def lambda_(tau = tau, uniform_one = uniform_one, uniform_two = uniform_two):
    out = np.zeros(n_percent_list)
    out[:tau] = uniform_one
    out[tau:] = uniform_two
    return out

observations = pm.Binomial('obs', n=basket_list, p=lambda_, value=conversion_list, observed=True)

model = pm.Model([observations, uniform_one, uniform_two, tau])
mcmc = pm.MCMC(model)
mcmc.sample(10000, 2500, 1)

uniform_one_samples = mcmc.trace('uniform_one')[:]
uniform_two_samples = mcmc.trace('uniform_two')[:]
tau_samples = mcmc.trace('tau')[:]

N = tau_samples.shape[0]

conversion_rate = np.zeros(n_percent_list)
for day in range(0, n_percent_list):
    ix = day < tau_samples
    conversion_rate[day] = (uniform_one_samples[ix].sum() + uniform_two_samples[~ix].sum()) / N


plt.plot(range(n_percent_list), 100.*conversion_rate, lw=4, color='#E24A33', label='Expected conversion rate')
plt.xlim([0, n_percent_list])
plt.title("Changes in the probability of conversion")
plt.ylabel('Expected conversion rate'), [100.0*convert/basket for convert,basket in zip(conversion_list, basket_list)], color='#348ABD', alpha=0.65, label='Observed conversion rate')
plt.legend(loc='upper right')

ax = plt.subplot(412)

plt.hist(uniform_one_samples, histtype='stepfilled', bins=50, alpha=0.85, label="posterior of initial conversion probability", color='#A60628', normed=True)
plt.legend(loc='upper left')

plt.xlabel("Probability of Conversion")
ax = plt.subplot(413)
plt.hist(uniform_two_samples, histtype='stepfilled', bins=50, alpha=0.85, label="posterior of later conversion probability", color='#7A68A6', normed=True)
plt.legend(loc='upper left')
plt.xlabel("Probability of Conversion")
w = 1.0/tau_samples.shape[0] * np.ones_like(tau_samples)
plt.hist(tau_samples, bins=n_percent_list, alpha=1, label="posterior of conversion change date", color='#467821', weights=w, rwidth=2.)
plt.legend(loc='upper left')
plt.xlim([0, n_percent_list])
plt.xlabel("Day of change")
plt.ylabel('P(change occurred)')

That’s a fair bit of code – what I’m doing isn’t that complicated. After importing, defining and grabbing the conversion data that we generated in the first script we declare uniform_1 and uniform_2.

We’re going to say that our data ultimately comes from a Binomial distribution where people convert with probability p. However, we are going to say that at some time (given in this example as tau) the value of p changed. p is the conversion of a Binomial distribution and so saying that p changes at some point means we’re saying our conversion rate changed.

Firstly, I declare my priors – I say that uniform_1 and uniform_two, or the conversion before and after the change we’re trying to identify, are both drawn from a uniform distribution that runs between 0 and 1. I’m saying I think the conversion is equally likely to everywhere between 0 and 100%. If you’re doing this properly I’m sure you can improve on this (I’d advise looking into the Beta distribution) but with sufficient data the uniform distribution should work fine.

So, at this point we’ve got the distributions from which our p-values are drawn described by uniform distributions – our first pair of priors.

Next, I declare tau – the time at which our conversion rate changed. I’m using a discrete uniform distribution and saying that it could have happened with equal probability at any time between the first day and the last day. If you’re looking to identify when the biggest change in conversion occurred I’d advise using a discrete uniform distribution. If you’re trying to identify the effect a change on a particular day had then you can be more creative with this prior.

Now we declare _lambda. For the eagle-eyes pseudo-coders amongst you, I’m sure it’s clear but this represents our complete prior belief about the conversion rate (p of a Binomial). We say we think the value of p follows uniform_1 up until tau and then switches to uniform_2.

Then we let pymc take over – we specify our distribution is a binomial with as many trials as we generated in the first script, and with the probabilities given in _lambda. We tell pymc that we obtained the values we’re passing to it and created our distribution. I’m not going to go in what wizardy follows but check out chapter 3 of the book I mentioned at the start if you’re interested. What we end up with (when we grab the traces) is 10,000 numbers drawn from posterior distributions we’ve generated from uniform_1, uniform_2 and tau.

The final calculation builds a vector of Booleans for each day of the data set specifying whether the day is less than each of the 10,000 samples drawn from the tau distribution, or greater than the tau. Then we go along that 10,000 element boolean vector and create a sum – if the day number was less than the tau we take whatever is in the uniform_1 list at that point, otherwise taking what is in the uniform_2 point. Finally dividing by the number of points we drew, we can get an idea of the expected conversion rate (expected value of _lambda).

I get that that section is a bit complicated but if you get it:
a.) I’ve not had enough wine
b.) You’ve got the idea behind the testing so far.

Finally, there’s a lot of Matplotlib plotting stuff all leading us towards the following beautiful picture:

Basic Bayesian Conversion Testing/Change Recognition

I hope that shows more clearly what I’ve been trying to explain. Firstly, the top graph shows our expected and our observed conversions. The bars show our ‘actual conversion rates’ on given days – these were the values we artificially generated. The red line is the output of our model discussed in the final point above. The second and third graph show the distributions of the conversion rate from before the change and after the change. If you’re looking to paint a compelling picture, I’d advise putting them both on the same axis, drawing them as densities (not histograms) and adding lines at the 95th percentiles. Finally, the bottom graph shows us when we change in conversion rate likely changed. As you can see, it’s fairly heavily concentrated around the middle (good) but there’s quite a wide range of days around there where it could be.

I’ve found that in practice, it makes sense to run these simulations multiple times (they don’t take very long) and concatenate the results together when working out when a change occurred. I’ll leave that as an exercise given how long this post already is – there’s  a picture of what it looks like at the bottom – the posteriors of the conversion probabilities get a bit messy.

So there we have it – it’s been a bit of a slog and the woman who brings the wine round the train has started laughing when she pours me a new glass (surely a troubling sign). However, we’re now able to run Bayesian conversion tests like a boss and visualize them in a really funky kind of way.

As always, comments are welcome, questions too.

The Lannisters send their regards.


Picture when run with multiple iterations…

Multiple Iteration Bayesian Conversion Plots

Mutliple Iterations of Bayesian Conversion Testing.

Drugs prescribed by the NHS

Hi All,

As always, apologies for the length of time between posts – think this is a record. I was working on stuff I’m not allowed to share (work stuff, then Kaggle stuff) and got minorly derailed by Game of Thrones. Finished all the TV series, and one of the books. And Life of Pi. And I’m working my way through the Book Thief. So yeah, a tad derailed.

However, I’ve got a lazy Sunday and I saw an advert on TV that really annoyed me. It was basically along the lines of ‘did your doctor mess up? Why not sue them?’ and I think that it takes us towards where the Americans are. Which, when it comes to healthcare, is not where we want to go. Given that in the UK, if you’re being treated by a doctor it’s almost certainly an NHS doctor it seems pretty sucky that people are being encouraged to sue them. Puts the price up for everybody and makes people less likely to become doctors and whatnot. Anyway, this isn’t a political rant blog – I thought I’d have a look and see what data is available on the NHS to see if I could show the effects of an increase in litigation on the standard on medical care provided/costs (and ideally, contrast with America). In short, I couldn’t. The data that I wanted just wasn’t there. However, there was data on the drugs that the GPs for the NHS prescribes (at least between Jan and June in 2012) by practice with cost data. That seemed pretty interesting and so here we are.

I’ve been playing around with Google’s coLaboratory (check it out here) and would have loved to use this to do this particular bit of analysis. However, after playing round with it for a bit and struggling with external documents, finding my Google docs and various libraries I wanted I’ve decided to leave it for a while until coLaboratory becomes a tad more mature. Lots of promise there and with good Google Analytics API integration we could transform analytics practices at my company. Certainly one to watch.

Anyway, without that I’ll think of the questions I want to answer first and then pick my tool. Firstly, the data…

The data

As ever, to the rescue – head here and download yourself a nice copy of the data:

Once you’ve got a copy of all the data and the list of practices in England (I used the most recent one) we’re ready to start asking some questions of the data…

The Analysis

First off, let’s pick something simple – which drug costs the NHS most in each of the months of our test set, and overall.

Total Drug Cost

This actually seems to lend itself to the mapreduce paradigm pretty nicely – the mapper seems pretty unnecessary and as I’ve not got a cluster to hand (and this doesn’t warrant me spinning one up with AWS) I’ll just write a quick reducer in Python and use the Unix sort. For what it’s worth, I think I might write something about spinning up a quick cluster on AWS in the near future. It’s a fairly useful skill to have and given the increasing reluctance of my computer to perform the most basic of tasks, I think a fair bit of my future data analysis might have to happen in the cloud. Anyway, this is what my command will look like once I’m done:

awk 'FNR>1{print}' T201202PDP IEXT.CSV | sort -t , -k5 | python

Nothing too complicated there. I’m ignoring the first line (the awk command), then sorting the whole file based on the 4th column (the drug id) and piping the whole thing into the following reducer:


import sys

current_drug = None
current_cost = 0.0

for line in sys.stdin:
    authority, trust, practice, drug_code, drug_name, number_bought, ni_cost, act_cost, period = line.strip().split(',')
    act_cost = float(act_cost.lstrip('0'))
    if drug_name == current_drug:
        current_cost += act_cost
        if current_drug:
            print current_drug + "t" + str(current_cost)
        current_drug = drug_name
        current_cost = act_cost

print current_drug + "t" + str(current_cost)

Nothing too difficult there – we’re just keeping track of the drug we’re on and adding up as we go. If I had no fear for the amount of RAM I had we could’ve accomplished the same thing without the laborious sorting step using associative arrays in awk. But for the next stage – the total across all 6 months, I am very afraid (RAM wise) and so we can run the same query with a bit of wildcarding:

awk 'FNR>1{print}' T20120[1-6]P*.CSV | sort -t , -k5 | python > drugs_by_spend.txt

This runs the same calculation over every file matching that wildcard pattern (all the data between 2012/01 and 2012/06. Note that this’ll take a little time – that sort is reasonably expensive over the 4 or so gigabytes of data we’ve got. Now we’ll pull together a few graphs and for this I think we’ll use R…

my_frame <- data.frame(read.csv('drugs_by_spend.txt', header=F, sep="t", colnames=c("Drug", "Spend")))
my_frame <- my_frame[order(-my_frame$Spend),]

The top 5 drugs, by spend, in the first half of 2012 were:

  1. Fluticasone Propionate (Inh)
  2. Atorvastatin
  3. Enteral Nutrition
  4. Pregabalin
  5. Budesonide

Between them these cost: £738,620,789

Wow – that’s a hell of a lot. In 6 months, the actual costs of these drugs alone was more than £700million!

The total cost of drugs prescribed in that time period: £2,785,540,256

So I think we can surmise that lots of money is spent by the NHS – OK, I suppose that’s no surprise. For the non-doctors amongst us (that includes me) that list features two anti-asthmatic treatments i.e. those inhalers that I’m sure a lot of you have (also includes me). A quick Wikipedia shows that Pfizer holds the patent to at least a couple of those drugs (or at least did, Atorvastatin has expired) – it might be interesting to stick the patent holder next to these drugs. Maybe later…

Right, there’s lots that we could do here but I’m going to call it a day for now. In the future I think I’ll try and get more months of data and then start to look at evolving trends. To do that, I’ll use an AWS cluster and so will write something and using that.

Until then.

Kaggle Digit Recognizer Problem with Adaboost

Hi everybody,

Hi Dr Nick. But enough of that – today I’m going to be working through a Kaggle problem. For those of you who don’t know Kaggle, I can’t advise in favour of it strongly enough. It’s a great place to have a go at using real data sets to apply various machine learning techniques. There’s a leaderboard, discussions on methods and some non-too shabby prizes. I’ll come clean at this point – I’m not a natural salesperson.

I believe there’s something of a taboo against posting solutions/methods for Kaggle – however, I think I’m good to write about a method of solving this particular problem. The digit recognizer problem seems to be a rolling competition with a bunch of already published results and a few training classes on how to solve it. Let me know if you think this is overstepping the mark.

So, the problem:
Given a big set (42,000) of labelled training data (28 x 28 black and white images) of handwritten images (0-9) are we able to correctly identify other (identically dimensioned) handwritten digits.

There are a whole bunch of ways of doing this and the method I’ve had best success with is Support Vector Machines (using LibSVM). I may post an example of how to run that for this particular example but today I’d like to look at Adaboost (using the MultiBoost package)…

Until fairly recently I was entirely ignorant of Adaboost – I came across it on a different Kaggle problem (the Higgs one). There, a number of ‘out of the box’ methods were showcased – the most successful of which was Adaboost. A bit of reading on Adaboost suggests that it’s a fairly well-regarded, and successful method of performing a range of machine learning tasks. It’s also sometimes cited as being the best ‘out of the box’ (not specifically designed for the task at hand) algorithm in machine learning.

My current intuition on Adaboost is that it’s basically a ‘rule of thumb’ algorithm. It takes a lot of very simple decision boundaries and uses them to create a more complicated decision space. I say rule of thumb because I imagine a car mechanic or a doctor trying to diagnose a fault. The patient presents with symptom x, that makes a whole bunch of things less likely. However, if the patient falls into this age bucket and this ethnic group, some of the previously discounted things become more likely. I don’t know if that sort of explanation helps you but I quite like it. Basically, you create a simple rule that’s more often right than wrong. However, you can then update it with as many exceptions as you’ve got other bits of data. I think that’s a lot how the human decision-making process goes.

Anyway, all this talking isn’t getting us closer to a juicy set of predictions. Mad props to whoever first generated this particular procedure for the Higgs problem – I’ve shamelessly ripped it off, only making changes where necessary for this problem.


import random
import csv
import subprocess
import numpy as np

def DataToArff(dataset, labels, header, title, filename):
    With this data structure we're able to turn an arbitrary string of data into a .arff file
    These files allow us to import the data into Multiboost or Weka (amongst other machine learning libraries
    with open(filename + ".arff", 'w') as f:
        f.write('@RELATION ' + title + 'nn')
        for feature in header:
            f.write('@ATTRIBUTE ' + feature + ' NUMERICn')
        f.write('@ATTRIBUTE class {0,1,2,3,4,5,6,7,8,9}n')
        ## We could do this using all_data - however, we need the labels for further work
        ## Additionally, if the labels were numeric variables we'd be able to leave the rest of our work unchanged and handle them here
        for datarow, label in zip(dataset, labels):
            for value in datarow:
                f.write(str(value) + ',')
            f.write(str(label) + 'n')

all_data = list(csv.reader(open('train.csv', 'rb'), delimiter=','))
header = np.array(all_data[0][1:])
dataset = np.array([map(float, row[1:]) for row in all_data[1:]])
(numpoints, numfeatures) = dataset.shape

# Labels on the first column of the line
labels = np.array([row[0] for row in all_data[1:]])

randomPermutation = random.sample(range(len(dataset)), len(dataset))
## If this breaks halfway through, we'll be glad to be able to load our random permutation
np.savetxt('randomPermutation.csv', randomPermutation, fmt='%d', delimiter=',')

## I'll change the proportion of the train set and see how we get on.
numpointsTrain = int(numpoints*0.75)
numpointsValidatin = numpoints - numpointsTrain

## Because we've got a random permutation there's no problem taking slices of the total set to sort into train and validation
datasetTrain = dataset[randomPermutation[:numpointsTrain]]
datasetValidation = dataset[randomPermutation[numpointsTrain:]]

labelsTrain = labels[randomPermutation[:numpointsTrain]]
labelsValidation = labels[randomPermutation[numpointsTrain:]]

DataToArff(datasetTrain, labelsTrain, header, 'DigitsML_challenge_train', 'training')
DataToArff(datasetValidation, labelsValidation, header, 'DigitsML_challenge_validation', 'validation')

## Our Adaboost parameters are wholly contained in the relevant config files
p1 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'config.txt'])
p2 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'configScoresValidation.txt'])

testText = list(csv.reader(open('test.csv', 'rb'), delimiter=','))
datasetTest = np.array([map(float, row[:]) for row in testText[1:]])
labelsTest = np.repeat('0', len(testText) - 1)

DataToArff(datasetTest, labelsTest, header, 'Digit_challenge_test', 'test')

p3 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'configScoresTest.txt'])

testScoresText = list(csv.reader(open('scoresTest.txt', 'rb'), delimiter=','))
with open('submission.csv', 'w') as f:
    for index, entry in enumerate(testScoresText):
        ## Take the index of the maximum value for a given row - this is the most likely value
        f.write(str(index + 1) + "," + str(np.argmax(entry)) + 'n')

I’d like to think that that bit of code is reasonably transparent and clear on what it’s doing. If I’m wrong, a basic explanation:
1.) Randomly split the data into a training and validation set
2.) Create .arff files for both of these sets
3.) Run Multiboost (our Adaboost implementation) on the training set and validation set
4.) Using the files created from our train/validation Adaboost, get the test set and generate predictions (again using Multiboost)
5.) Generate a submission file in the required format

Easy does it. Now for the configuration files that I’m using:


fileformat arff
verbose 2
learnertype TreeLearner
seed 50
weightpolicy balanced
baselearnertype SingleStumpLearner 8
outputinfo results.dta e01w01auc
traintest training.arff validation.arff 5000
shypname shyp.xml


posteriors validation.arff shyp.xml scoresValidation.txt 5000
fileformat arff
verbose 2
learnertype TreeLearner
baselearnertype SingleStumpLearner 8


posteriors test.arff shyp.xml scoresTest.txt 5000
fileformat arff
verbose 2
learnertype TreeLearner
baselearnertype SingleStumpLearner 8

I’ll be honest and say I’ve not really found the ideal set-up for this problem here. I’m able to get a score of around 0.965 using the ones here but if you look at the leader board you’ll see that’s not all that good. Certainly the LibSVM method performed much better (something like 0.99). Not to worry, it’s doing the right thing, generating good results and is another tool in our arsenal.

The World Cup may stymie my blog posts for a bit – then again, supporting England, it might only be 3 games.

Football’s coming home.

« Older posts

© 2024 DogDogFish

Theme by Anders NorenUp ↑