Data Science, amongst other things.

Tag: machine learning

Finding Topics in Harry Potter using K-Means Clustering.

I’ll open up with the money-shot – these are all of the clusters that I was able to find using the whole Harry Potter and grouping by chapter:


Every cluster plotted separately.

That’s far too messy to be of any practical use so let’s have a look at a couple of those clusters in more detail:

Privet Drive Cluster

One of the clusters – a Dursley/Privet Drive heavy cluster!



This is a pretty Griphook/Goblin heavy cluster featured on the storyline in book 7.

Hopefully that’s piqued your interest enough to continue on scrolling and see how we got these clusters – and see the words that tie them all together! The code for generating these is on my Github ( and all the graphs and documents are contained there.

I went to a really interesting talk at PyData that was about Latent Dirichlet Allocation, a topic entirely new to me. I thought I’d love to apply it to my favourite book series – Harry Potter. However, that didn’t happen…instead you get this. A heavy rip off of an excellent post ( that walks through how to cluster documents using a bunch of techniques including K-Means.

Step 1

Get plain text copies of all the Harry Potter books and make sure they’re all formatted in roughly the same way. As is often the case, this step took bloody ages.

Step 2

I want a few different documents – more than 7 (the number of Harry Potter books for the heathens out there) but substantially less than the number of sentences in all 7 books. Treating each chapter as a separate document seemed to make most sense and so here I initialise everything I need and split my books into chapter.

So at this point we’ve nicely got ourselves a list of chapter titles and a list of the associated text (as a string).

Step 3

Now we’ve got our chapters we’re going to want to tokenise the text in them. Basically this means converting a string into discrete tokens – what we’d think of as words. The reason it’s got a fancy name and I’m being a bit careful about my terminology is because tokenisation also takes care of things like punctuation and isn’t as simple as just splitting a string into separate words. Having said that, I’ve basically taken the path of least resistance and so have gone with a very simple tokenisation scheme. I’m also going to skip over the pain of utf-8 encoding/decoding/recoding. I’ve basically just dropped any character that I’ve found in the least bit complicated.

Step 4

Next I’m going to perform TF-IDF on my chapters. Here, I convert each token into a number and look at how many times that token appears in a given document, and how many documents that token appears in overall. So in this instance I’m looking to see how often a given word appears in this particular chapter and in how many chapters throughout all 7 books the token appears in. This gives us an idea of how important/prominent a word is in a given chapter, taking into account how common the word is throughout all the chapters. As an example ‘Harry’ is likely to feature a lot in a given chapter but also is likely to feature in every chapter and so probably isn’t especially important to any given chapter’s classification. ‘Nicolas Flamel’, however, is going to appear a reasonable amount in a few chapters but not at all in all the rest. We therefore know that Nicolas Flamel is important in the chapters he does appear in.

Luckily, I don’t have to worry about the implementation of TF-IDF – sklearn has got it.

This gives me a large sparse matrix with the TF-IDF score for each word in each document as the entries. If you pay close attention to the parameters I’ve passed along to TfidfVectorizer as they are ripe for the changing. Firstly, `max_df=0.75` is saying that I don’t care about words that appear in more than 75% of the chapters. `min_df=0.05` is saying that I don’t care about words that appear in fewer than 5% of chapters. You can see I passed along my tokeniser and that I’m using English stop-words (that is, I’m removing the most common English words). Finally, I’m generating n-grams between 1 and 4.

For those uninitiated with n-grams they’re basically a way of splitting text up into handy little chunks. As an example, 3-grams of the following sentence:

“This is not the greatest song in the world”

would be:

“This is not”, “is not the”, “not the greatest”, “the greatest song”, e.t.c.

This allows me to pick out common phrases such as “Snape said” and “wizarding world”. Again, that’s a setting that is begging to be played about with.

Step 5

Performing K-means clustering we get output like so:

Cluster 18 words:
Top words: maxime,madame maxime,karkaroff,madame,hagrid,cedric,moody,krum,champions,tournament

Chapter: the hungarian horntail, Book: gobletOfFire
Chapter: the goblet of fire, Book: gobletOfFire
Chapter: the four champions, Book: gobletOfFire
Chapter: beauxbatons and durmstrang, Book: gobletOfFire
Chapter: the beginning, Book: gobletOfFire
Chapter: the yule ball, Book: gobletOfFire

which is obviously quite a handy little cluster. It’s successfully managed to only take chapters from one book (given that we’ve not allowed K-means access to the book information that is a bit of a triumph). For those aware of Harry Potter you’ll see this is a Triwizard Tournament heavy cluster. Another example:

Cluster 10 words:
Top words: wormtail,cold voice,voldemort,lord,cauldron,riddle,cedric,man,master,faithful

Chapter: the riddle house, Book: gobletOfFire
Chapter: flesh, blood, and bone, Book: gobletOfFire
Chapter: the death eaters, Book: gobletOfFire

Again, all the clusters are from one book – I’m counting that as a result. Added bonus, they’re not sequential chapters! I happen to know (by being a massive Harry Potter geek) that these chapters are towards the start and end of book 4 and focus heavily on Voldemort and Peter Pettigrew.

So far so good and in fact I could stop here but wouldn’t it be nice to visualise those clusters so we can see the topics we’ve picked out graphically? Even if you said no then it doesn’t really matter. I’m still going to do it.

Step 6

First things first, let’s plot all of the chapters on one graph and colour code them with the book from which they came. It’s not a great way of visualising clusters but it is a great way of seeing how everything is laid out:


A messy picture of all the chapters projected into 2d space.

Again, this section is shamelessly copied from the aforementioned blog but ultimately we’re projecting the cosine differences between the tf-idf matrix terms into 2-dimensional space. I declare a colour dictionary for each of the books and then rattle through the chapters plotting them.
I’m sure you’ll agree that’s far too messy for anybody to really do anything with. If you’re following along with the code you’ll see that next I generate the subplot figure shown at the top.

Finally, I create plots for each of the clusters – a few examples of which are:

Cluster 3 words:
Top words: umbridge,professor,dont,professor umbridge,snape,sirius,im,said hermione,harrys,youre

Chapter: the muggle born registration commission, Book: deathlyHallows
Chapter: the hogwarts high inquisitor, Book: orderOfThePhoenix
Chapter: o.w.l.s, Book: orderOfThePhoenix
Chapter: the second war begins, Book: orderOfThePhoenix
Chapter: educational decree number twenty four, Book: orderOfThePhoenix
Chapter: the centaur and the sneak, Book: orderOfThePhoenix
Chapter: percy and padfoot, Book: orderOfThePhoenix
Chapter: detention with dolores, Book: orderOfThePhoenix
Chapter: occlumency, Book: orderOfThePhoenix
Chapter: in the hogs head, Book: orderOfThePhoenix
Chapter: out of the fire, Book: orderOfThePhoenix
Chapter: professor umbridge, Book: orderOfThePhoenix
Chapter: fight and flight, Book: orderOfThePhoenix
Chapter: seen and unforeseen, Book: orderOfThePhoenix
Chapter: career advice, Book: orderOfThePhoenix
Chapter: snapes worst memory, Book: orderOfThePhoenix


Umbridge's cluster

Umbridge’s cluster

Everybody’s favourite professor – Dolores Umbridge!

And another:

Top words: hagrid,yeh,ter,said hagrid,professor,said hermione,malfoy,o,professor trelawney,trelawney

Chapter: professor trelawney's prediction, Book: prisonerOfAzkaban
Chapter: talons and tea leaves, Book: prisonerOfAzkaban
Chapter: the firebolt, Book: prisonerOfAzkaban
Chapter: diagon alley, Book: philosophersStone
Chapter: the foribidden forest, Book: philosophersStone
Chapter: the keeper of the keys, Book: philosophersStone
Chapter: norbert the norwegian ridgeback, Book: philosophersStone
Chapter: hagrids tale, Book: orderOfThePhoenix
Chapter: the eye of the snake, Book: orderOfThePhoenix
Chapter: grawp, Book: orderOfThePhoenix
Chapter: hermione's helping hand, Book: halfBloodPrince
Chapter: rita skeeter's scoop, Book: gobletOfFire

and visually:

Grawp's brother

Keeper of the Keys

The anti-Umbridge – it’s Hagrid’s cluster!

I’ll stop on the random copy/pasting of the clusters and stick them all on my Github – I think you’ve got the idea! All in all I’m pretty happy with how this has worked but I am very dependent on individual character names. I tried just looking at 2-grams but it usually just gave me ” said” with a few exceptions (‘wizarding world’, ‘said softly’, ‘death eater’, ‘godrics hollow’). I’ve also put almost zero effort into formatting the images – you know roughly what it’s meant to look like: having the pictures look good is an exercise I’m leaving to the reader’s imagination.

There’s loads more stuff I could do but I’m going to eat a chicken and go swimming so it’ll have to wait.

All was well.

Kaggle Digit Recognizer Problem with Adaboost

Hi everybody,

Hi Dr Nick. But enough of that – today I’m going to be working through a Kaggle problem. For those of you who don’t know Kaggle, I can’t advise in favour of it strongly enough. It’s a great place to have a go at using real data sets to apply various machine learning techniques. There’s a leaderboard, discussions on methods and some non-too shabby prizes. I’ll come clean at this point – I’m not a natural salesperson.

I believe there’s something of a taboo against posting solutions/methods for Kaggle – however, I think I’m good to write about a method of solving this particular problem. The digit recognizer problem seems to be a rolling competition with a bunch of already published results and a few training classes on how to solve it. Let me know if you think this is overstepping the mark.

So, the problem:
Given a big set (42,000) of labelled training data (28 x 28 black and white images) of handwritten images (0-9) are we able to correctly identify other (identically dimensioned) handwritten digits.

There are a whole bunch of ways of doing this and the method I’ve had best success with is Support Vector Machines (using LibSVM). I may post an example of how to run that for this particular example but today I’d like to look at Adaboost (using the MultiBoost package)…

Until fairly recently I was entirely ignorant of Adaboost – I came across it on a different Kaggle problem (the Higgs one). There, a number of ‘out of the box’ methods were showcased – the most successful of which was Adaboost. A bit of reading on Adaboost suggests that it’s a fairly well-regarded, and successful method of performing a range of machine learning tasks. It’s also sometimes cited as being the best ‘out of the box’ (not specifically designed for the task at hand) algorithm in machine learning.

My current intuition on Adaboost is that it’s basically a ‘rule of thumb’ algorithm. It takes a lot of very simple decision boundaries and uses them to create a more complicated decision space. I say rule of thumb because I imagine a car mechanic or a doctor trying to diagnose a fault. The patient presents with symptom x, that makes a whole bunch of things less likely. However, if the patient falls into this age bucket and this ethnic group, some of the previously discounted things become more likely. I don’t know if that sort of explanation helps you but I quite like it. Basically, you create a simple rule that’s more often right than wrong. However, you can then update it with as many exceptions as you’ve got other bits of data. I think that’s a lot how the human decision-making process goes.

Anyway, all this talking isn’t getting us closer to a juicy set of predictions. Mad props to whoever first generated this particular procedure for the Higgs problem – I’ve shamelessly ripped it off, only making changes where necessary for this problem.


import random
import csv
import subprocess
import numpy as np

def DataToArff(dataset, labels, header, title, filename):
    With this data structure we're able to turn an arbitrary string of data into a .arff file
    These files allow us to import the data into Multiboost or Weka (amongst other machine learning libraries
    with open(filename + ".arff", 'w') as f:
        f.write('@RELATION ' + title + 'nn')
        for feature in header:
            f.write('@ATTRIBUTE ' + feature + ' NUMERICn')
        f.write('@ATTRIBUTE class {0,1,2,3,4,5,6,7,8,9}n')
        ## We could do this using all_data - however, we need the labels for further work
        ## Additionally, if the labels were numeric variables we'd be able to leave the rest of our work unchanged and handle them here
        for datarow, label in zip(dataset, labels):
            for value in datarow:
                f.write(str(value) + ',')
            f.write(str(label) + 'n')

all_data = list(csv.reader(open('train.csv', 'rb'), delimiter=','))
header = np.array(all_data[0][1:])
dataset = np.array([map(float, row[1:]) for row in all_data[1:]])
(numpoints, numfeatures) = dataset.shape

# Labels on the first column of the line
labels = np.array([row[0] for row in all_data[1:]])

randomPermutation = random.sample(range(len(dataset)), len(dataset))
## If this breaks halfway through, we'll be glad to be able to load our random permutation
np.savetxt('randomPermutation.csv', randomPermutation, fmt='%d', delimiter=',')

## I'll change the proportion of the train set and see how we get on.
numpointsTrain = int(numpoints*0.75)
numpointsValidatin = numpoints - numpointsTrain

## Because we've got a random permutation there's no problem taking slices of the total set to sort into train and validation
datasetTrain = dataset[randomPermutation[:numpointsTrain]]
datasetValidation = dataset[randomPermutation[numpointsTrain:]]

labelsTrain = labels[randomPermutation[:numpointsTrain]]
labelsValidation = labels[randomPermutation[numpointsTrain:]]

DataToArff(datasetTrain, labelsTrain, header, 'DigitsML_challenge_train', 'training')
DataToArff(datasetValidation, labelsValidation, header, 'DigitsML_challenge_validation', 'validation')

## Our Adaboost parameters are wholly contained in the relevant config files
p1 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'config.txt'])
p2 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'configScoresValidation.txt'])

testText = list(csv.reader(open('test.csv', 'rb'), delimiter=','))
datasetTest = np.array([map(float, row[:]) for row in testText[1:]])
labelsTest = np.repeat('0', len(testText) - 1)

DataToArff(datasetTest, labelsTest, header, 'Digit_challenge_test', 'test')

p3 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'configScoresTest.txt'])

testScoresText = list(csv.reader(open('scoresTest.txt', 'rb'), delimiter=','))
with open('submission.csv', 'w') as f:
    for index, entry in enumerate(testScoresText):
        ## Take the index of the maximum value for a given row - this is the most likely value
        f.write(str(index + 1) + "," + str(np.argmax(entry)) + 'n')

I’d like to think that that bit of code is reasonably transparent and clear on what it’s doing. If I’m wrong, a basic explanation:
1.) Randomly split the data into a training and validation set
2.) Create .arff files for both of these sets
3.) Run Multiboost (our Adaboost implementation) on the training set and validation set
4.) Using the files created from our train/validation Adaboost, get the test set and generate predictions (again using Multiboost)
5.) Generate a submission file in the required format

Easy does it. Now for the configuration files that I’m using:


fileformat arff
verbose 2
learnertype TreeLearner
seed 50
weightpolicy balanced
baselearnertype SingleStumpLearner 8
outputinfo results.dta e01w01auc
traintest training.arff validation.arff 5000
shypname shyp.xml


posteriors validation.arff shyp.xml scoresValidation.txt 5000
fileformat arff
verbose 2
learnertype TreeLearner
baselearnertype SingleStumpLearner 8


posteriors test.arff shyp.xml scoresTest.txt 5000
fileformat arff
verbose 2
learnertype TreeLearner
baselearnertype SingleStumpLearner 8

I’ll be honest and say I’ve not really found the ideal set-up for this problem here. I’m able to get a score of around 0.965 using the ones here but if you look at the leader board you’ll see that’s not all that good. Certainly the LibSVM method performed much better (something like 0.99). Not to worry, it’s doing the right thing, generating good results and is another tool in our arsenal.

The World Cup may stymie my blog posts for a bit – then again, supporting England, it might only be 3 games.

Football’s coming home.

UK Housing Data – Data Munging for Machine Learning

Hey all,

I’ve been tentatively threatening to write this post for a while now and I’ve been itching to do a bit of machine learning – I’m going to be walking through the steps required to run linear regression and an SVM on our housing sale data to try to predict future house sale prices. I’m not overly confident that it’s going to give us a huge deal of predictive power – any example I see online uses useful things like area of house or number of bedrooms. All I’ve got is when you sold it, where you sold it and the type of house (new or old, detached or terraced, freehold or leasehold) e.t.c. Not to worry – it’ll be a blast all the same.

First things first, which machine learning technique should we use? We’ll be using a supervised algorithm as we have a labelled training set (we can tell our classifier what the right answer is). We’re looking at a regression problem, not a classification problem (we’ve trying to predict a continuous variable, not a discrete one). All in all, that’s screaming linear regression to me. As an additional bonus though, I also happen to know that SVMs work very well for this kind of problem and have used Libsvm for similar things in the past.

In terms of the bits of kit I’ll be using – let’s start off, as I always seem to, in bash.

cut -d ',' -f3- pp-all.csv | cut -d ',' -f1-13 | tr -d '"' > pp_all_for_weka.txt 

Just a bit of formatting but, if you’ve followed the posts through then you’ll have pp-all.csv on your computer. If you’ve not see here for how to get it.

Now we’re here, you may notice the strange filename – Weka is a machine learning library for Java. We’ve been using it a bit at work recently and I fancied getting a bit more experience of it. I’ll be using Weka for the linear regression; I’ve not got the heart to do a Weka installation post just yet (it’s not difficult, I’m just tired) but will do one if there’s any demand. I’m going to be using LibSVM for my support vector machine calculations and again, not going to talk you through the install unless you fancy it. The reason I’ve told you about that is because I’m now going to convert my CSV file into an ARFF file and a libsvm formatted file. While I’m at it, I’m going to convert all of my values (postcodes, dates e.t.c.) into numbers. Doing this allows me to very easily feed this data set into the above programs and get an answer out relatively easily.

Could we write the algorithms ourselves? Sure – but not nearly as well as they’ve already been written. Sidebar: if you’re interested in understanding how all these algorithms work I’d encourage you to check out Andrew Ng’s lecture series on Coursera. It’s excellent.

Anyway, to run the conversion and to output the two different files I wrote the following Python script:


class mapping_dictionary(object):

    def __init__(self, output_file, svm_file):
        self.mapping = {}
        self.mapping_count = {}
        self.column_names = ['date', 'postcode', 'f_map', 'n_map', 'l_map', 'addy', 'addy1', 'addy2', 'addy3', 'addy4', 'addy5', 'addy6', 'price']
        self.writer_file = open(output_file, 'w')
        self.svm_file = open(svm_file, 'w')

    def shut_file(self):

    def add_to_dictionary(self, column, key):
        if column not in self.mapping:
            self.mapping[column] = {}
            self.mapping_count[column] = 0
        if key not in self.mapping[column]:
            self.mapping[column][key] = self.mapping_count[column]
            self.mapping_count[column] += 1

    def interpret_list(self, listy):
        if len(listy) != len(self.column_names):
            print "Error - unexpected number of columns in the line: %d" % len(listy)
            return None
        for index, value in enumerate(listy):
            if index == 12:
                    value = int(value.strip().strip('n'))
            self.add_to_dictionary(self.column_names[index], value)

    def write_to_file(self, listy):
        string_to_write = ','.join([str(self.mapping[self.column_names[index]][entry]) for index, entry in enumerate(listy) if index != 12])
        string_to_write += ",%sn" % str(listy[-1])

    def write_libsvm_file(self, listy):
        string_to_write = ' '.join([str(index + 1) + ":" + str(self.mapping[self.column_names[index]][entry]) for index, entry in enumerate(listy) if index != 12])
        string_to_write = str(listy[-1]) + " " + string_to_write + "n"

mapping = mapping_dictionary('nice_weka_output.txt', 'nice_libsvm_output.txt')
with open('pp_for_weka.txt', 'rb') as f:
    while True:
        line =

A reasonable amount of Python happening there – if you were worrying about it, I really wouldn’t. All we’re doing is replacing every field (apart from price) with an integer and outputting that in two different formats as we work our way through the file. If I was being thorough I’d remove the hardcoded list at the top, require a header row, take the filenames as command line arguments and then it’d work as a general tool for formatting CSVs as libsvm and arff files. Actually, that doesn’t sound like a bad idea at all.

Now you’ve got your input files, it’s child’s play to run our algorithms to create models that we can pass new data to. I’ll create a separate post detailing the output of the above two classifiers but it looks like I’ll have to leave them running through the night!

Until then.

© 2017 DogDogFish

Theme by Anders NorenUp ↑