Machine Learning – DogDogFish

Hi all,

Those (two) of my regular readers will know I frequently curse my fool-of-a-took laptop. With this in mind I’ve been meaning to write a tutorial on working with Amazon’s web services for when you don’t have the hardware you need. This isn’t that post but I will assume you’ve already got a cluster to hand with all the bits and bobs you need installed.

If you don’t, don’t worry about it (for now) – I’ll do a post on how to get one of those set up at a later date. For now I’d like to talk about collaborative filtering. This is one of the most prevalent techniques for generating recommendations and with good reason – it’s pretty good. It’s quite a common sense method:

I know that person A likes items 1, 2 and 3 but not item 4.
I know that person B likes item 4 and 3 but not 1 or 2.
I know that person C likes item 1 but not 2, 3 or 4.
I know that person D doesn’t like item 4 – which of the items 1, 2 or 3 should I recommend them?

I’m sure most people reading this will say we should recommend them item 1, then probably item 2 then 3? Certainly item 1 with highest preference. Now if you deconstruct your thought process in coming up with that recommendation I’d wager (not a lot) that it’d be akin to the collaborative filtering algorithm.

Based on the behaviour of people A, B and C we can start to get an idea about which items are rated favourably by the same people versus those that aren’t. User-based collaborative filtering is about finding similar users to person D and looking at what they liked. Item-based collaborative filtering is about finding similar items to the ones person D has rated positively and recommending those.

There are lots of nice articles on collaborative filtering – more often than not when it comes to machine learning I end up recommending Andrew Ng material and this is no exception – check out his machine learning course on Coursera. There’s a whole lecture on collaborative filtering and it’s boss. If my explanation wasn’t enough (if you’re serious about this sort of thing, it shouldn’t be) check out his lectures on the subject and read around on the mathematical details of it all. Make sure you polish up on your linear algebra – you’ll have very large sparse matrices heading your way!

Also, please note that this whole post is just a re-imagining of this post by Amazon: Recommendations using EMR – I liked it and thought I’d give it a go myself with a different data set and technologies I was more familiar with. Now on with the show…

The Data Set

I’m going to be using a publicly available data set containing user ratings of books gathered on Amazon – dataset available here: Book Data Set

There are three files in there – one with all the details of the books, one with all the details of the users and finally one with all the user ratings. Don’t worry about downloading them yet if you’re following this as a tutorial/example – we’ll stick that bit in a script later on.

The Cluster Set-Up

Given that I’m on the free tier and I don’t use AWS at work (we’ve got a nice cluster that handles all our data just fine thank you) I’m a bit ‘over-compute’ happy at the moment and so I’m going to use a 5 node cluster (1 master node and 4 slaves). This isn’t really necessary but then nor is this as a piece of work so I’m not that bothered by practicality at this point. The benefit of using Amazon’s Elastic Mapreduce is that the nodes all come with Hadoop (& Hive/Pig/Mahout/whatever) installed and set up so you don’t have to bother with any of that.

Building a Service

I want to demonstrate how easy it is to turn this sort of thing into a recommendations service (instead of a one-off bit of analysis). Now I’m going to stress that I’m setting this up as an offline batch machine learning method and that there are a few steps I’d change if I were actually setting this up as a real, ‘production’, service.

I’m going to use a Redis database because I’ve got a reasonable amount of memory on these boxes (Redis is an in-memory key-value store), because Redis is really really quick and because I’m used to using Redis with Python. Redis will store the output of my collaborative filtering and I’ll stick a lightweight web framework (Flask) on top of it. This will allow a client to retrieve recommendations for a given user as well as delete them or overwrite them. I’ve decided to add the latter two functions because we don’t live in a perfect world.

I’ll also write a script that’ll update the books, users and ratings files and stick it on a cronjob to run every day. That way, by copying this method you’ll end up with a reasonably scalable, reasonably fast and reasonably up-to-date recommendation service. It’s not the best one out there (heck, it might be the worst) but it goes to show how simple these things are to set up given all the tools that are already written…

The Code

Firstly, we’ll create the cluster using the Elastic Mapreduce CLI:

./elastic-mapreduce --create --alive --name recommendation-service --num-instances 5 --master-instance-type m1.large --slave-instance-type m2.xlarge --ami-version 3.1 --ssh

The --ssh means that once the cluster has been created and all the necessary bits and bobs installed we ssh into the master node

Extracting the data set

 
wget http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
unzip BX-CSV-Dump.zip
sed '1d;' BX-Book-Ratings.csv | tr ';' ',' | tr -d '"' > Book-Ratings.csv}

Unfortunately for us, we’ve got a bit of work to do – the ‘out of the box’ Mahout recommendation algorithm requires user and book ids to be integers and ISBNs are not. To get around this, we use this nifty little script:

#!/usr/bin/python

import sys
import re

isbn_dictionary = {}

counter = 0

with open('Book-Ratings.csv', 'rb') as f:
    with open('New-Book-Ratings.csv', 'w') as g:
        with open('Book-Mappings.csv', 'w') as h:
            for line in f:
                try:
                    user, book, rating = line.split(',')
                except:
                    continue
                try:
                    index = isbn_dictionary[book]
                except:
                    isbn_dictionary[book] = counter
                    index = counter
                    h.write("%s,%dn" % (book, counter))
                    counter += 1
                g.write("%d,%d,%dn" % (int(user), index, int(rating)))

Now we’ve got data in the format our Mahout recommender is expecting let’s put it in HDFS and run our algorithm

hadoop fs -put New-Book-Ratings.csv /New-Book-Ratings.csv
mahout recommenditembased --input /New-Book-Ratings.csv --output recommendations --numRecommendations 10 --similarityClassname SIMILARITY_COSINE

Now we play the waiting game…
Actually, while that’s running, open up another SSH session to that terminal and run these commands

sudo easy_install redis flask
wget http://download.redis.io/releases/redis-2.8.13.tar.gz
tar xzf redis-2.8.13.tar.gz
cd redis-2.8.13
make
./src/redis-server &
cd

Now Redis is installed and running and we’ve got the necessary Python libraries we can create the following server…

import redis
import os

isbn_dictionary = {}

with open('Book-Mappings.csv', 'rb') as f:
    for line in f:
        isbn, counter = line.split(',')
        isbn_dictionary[int(counter)] = isbn

r = redis.StrictRedis(host='localhost', port=6379, db=0)

p = os.popen('hadoop fs -cat recommendations/part*')

for recommendation in p:
    user, recommendations = recommendation.split('t')
    recommendations = [entry.split(':')[0] for entry in recommendations.replace('[', '').replace(']','').split(',')]
    recommendations = [isbn_dictionary[int(entry)] for entry in recommendations]
    r.set(int(user), recommendations)

Now we’ve populated our database with all the recommendations we generated with our Mahout job. Now we’ll set something up (called webserver.py) to return these items to a client (as well as allow modifications to be made).

import redis
from flask import Flask
r = redis.StrictRedis(host='localhost', port=6379, db=0)

app = Flask(__name__)

@app.route('/user/<user_id>', methods = ['GET', 'POST', 'PUT', 'DELETE'])
def restful_api(user_id):
    if request.method == 'GET':
        recommendations = r.get(int(user_id))
        return recommendations if recommendations else "User %d does not exist" % int(user_id)
    elif request.method == 'POST':
        if request.headers['Content-Type'] == 'application/json':
            r.set(int(user_id), request.json)
            return "Successfully set recommendations for user %d" % int(user_id)
        else:
            return False
    elif request.method == 'DELETE':
        if r.exists(int(user_id)):
            r.delete(int(user_id))
            return "Successfully deleted recommendations for user %d" % int(user_id)
        else:
            return "User %d does not exist" % int(user_id)
    elif request.method == 'PUT':
        if request.headers['Content-Type'] == 'application/json':
            r.set(int(user_id), request.json)
            return "Successfully set recommendations for user %d" % int(user_id)
        else:
            return "Require JSON put"
    else:
        return "No idea what you've done here, but don't do it again"

if __name__ == '__main__':
    app.run()

Now we’ve built our client-facing recommendations service let’s ‘deploy’ it and give it a test

python webserver.py &
curl -X GET localhost:5000/user/277965
curl -X DELETE localhost:5000/user/277965
curl -X GET localhost:5000/user/277965

Summary

We’ve not done anything that tricky but in a fairly short time we’ve managed to build something that’d be at least part way useful to a small business somewhere looking for recommendations for their users. It’s not a great leap from this and a genuine production ready system – you don’t need 5 Hadoop nodes running all the time for a 20 minute Mahout job and I’m uncomfortable using Flask and local files for a serious system. As such, you’d likely put together a couple of nodes, each with their own copy of the recommendations set in memory. One of them would create a cluster, get any new data a generate the data set as often as you saw fit – communicating it to the other node. Finally, you’d add a few more actions to the server and make sure it could be called by whichever clients need it (and block everybody else).

As I say, nothing earth-shattering but a solid base on which to build a recommendations system.

I am become death.

Hi everybody,

Hi Dr Nick. But enough of that – today I’m going to be working through a Kaggle problem. For those of you who don’t know Kaggle, I can’t advise in favour of it strongly enough. It’s a great place to have a go at using real data sets to apply various machine learning techniques. There’s a leaderboard, discussions on methods and some non-too shabby prizes. I’ll come clean at this point – I’m not a natural salesperson.

I believe there’s something of a taboo against posting solutions/methods for Kaggle – however, I think I’m good to write about a method of solving this particular problem. The digit recognizer problem seems to be a rolling competition with a bunch of already published results and a few training classes on how to solve it. Let me know if you think this is overstepping the mark.

So, the problem:
Given a big set (42,000) of labelled training data (28 x 28 black and white images) of handwritten images (0-9) are we able to correctly identify other (identically dimensioned) handwritten digits.

There are a whole bunch of ways of doing this and the method I’ve had best success with is Support Vector Machines (using LibSVM). I may post an example of how to run that for this particular example but today I’d like to look at Adaboost (using the MultiBoost package)…

Until fairly recently I was entirely ignorant of Adaboost – I came across it on a different Kaggle problem (the Higgs one). There, a number of ‘out of the box’ methods were showcased – the most successful of which was Adaboost. A bit of reading on Adaboost suggests that it’s a fairly well-regarded, and successful method of performing a range of machine learning tasks. It’s also sometimes cited as being the best ‘out of the box’ (not specifically designed for the task at hand) algorithm in machine learning.

My current intuition on Adaboost is that it’s basically a ‘rule of thumb’ algorithm. It takes a lot of very simple decision boundaries and uses them to create a more complicated decision space. I say rule of thumb because I imagine a car mechanic or a doctor trying to diagnose a fault. The patient presents with symptom x, that makes a whole bunch of things less likely. However, if the patient falls into this age bucket and this ethnic group, some of the previously discounted things become more likely. I don’t know if that sort of explanation helps you but I quite like it. Basically, you create a simple rule that’s more often right than wrong. However, you can then update it with as many exceptions as you’ve got other bits of data. I think that’s a lot how the human decision-making process goes.

Anyway, all this talking isn’t getting us closer to a juicy set of predictions. Mad props to whoever first generated this particular procedure for the Higgs problem – I’ve shamelessly ripped it off, only making changes where necessary for this problem.

#!/usr/bin/python

import random
import csv
import subprocess
import numpy as np

def DataToArff(dataset, labels, header, title, filename):
    """
    With this data structure we're able to turn an arbitrary string of data into a .arff file
    These files allow us to import the data into Multiboost or Weka (amongst other machine learning libraries
    """
    with open(filename + ".arff", 'w') as f:
        f.write('@RELATION ' + title + 'nn')
        for feature in header:
            f.write('@ATTRIBUTE ' + feature + ' NUMERICn')
        f.write('@ATTRIBUTE class {0,1,2,3,4,5,6,7,8,9}n')
        f.write('n@DATAn')
        ## We could do this using all_data - however, we need the labels for further work
        ## Additionally, if the labels were numeric variables we'd be able to leave the rest of our work unchanged and handle them here
        for datarow, label in zip(dataset, labels):
            for value in datarow:
                f.write(str(value) + ',')
            f.write(str(label) + 'n')

all_data = list(csv.reader(open('train.csv', 'rb'), delimiter=','))
header = np.array(all_data[0][1:])
dataset = np.array([map(float, row[1:]) for row in all_data[1:]])
(numpoints, numfeatures) = dataset.shape

# Labels on the first column of the line
labels = np.array([row[0] for row in all_data[1:]])

randomPermutation = random.sample(range(len(dataset)), len(dataset))
## If this breaks halfway through, we'll be glad to be able to load our random permutation
np.savetxt('randomPermutation.csv', randomPermutation, fmt='%d', delimiter=',')

## I'll change the proportion of the train set and see how we get on.
numpointsTrain = int(numpoints*0.75)
numpointsValidatin = numpoints - numpointsTrain

## Because we've got a random permutation there's no problem taking slices of the total set to sort into train and validation
datasetTrain = dataset[randomPermutation[:numpointsTrain]]
datasetValidation = dataset[randomPermutation[numpointsTrain:]]

labelsTrain = labels[randomPermutation[:numpointsTrain]]
labelsValidation = labels[randomPermutation[numpointsTrain:]]

DataToArff(datasetTrain, labelsTrain, header, 'DigitsML_challenge_train', 'training')
DataToArff(datasetValidation, labelsValidation, header, 'DigitsML_challenge_validation', 'validation')

## Our Adaboost parameters are wholly contained in the relevant config files
p1 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'config.txt'])
p1.wait()
p2 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'configScoresValidation.txt'])
p2.wait()

testText = list(csv.reader(open('test.csv', 'rb'), delimiter=','))
datasetTest = np.array([map(float, row[:]) for row in testText[1:]])
labelsTest = np.repeat('0', len(testText) - 1)

DataToArff(datasetTest, labelsTest, header, 'Digit_challenge_test', 'test')

p3 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'configScoresTest.txt'])
p3.wait()

testScoresText = list(csv.reader(open('scoresTest.txt', 'rb'), delimiter=','))
with open('submission.csv', 'w') as f:
    f.write('ImageId,Labeln')
    for index, entry in enumerate(testScoresText):
        ## Take the index of the maximum value for a given row - this is the most likely value
        f.write(str(index + 1) + "," + str(np.argmax(entry)) + 'n')

I’d like to think that that bit of code is reasonably transparent and clear on what it’s doing. If I’m wrong, a basic explanation:
1.) Randomly split the data into a training and validation set
2.) Create .arff files for both of these sets
3.) Run Multiboost (our Adaboost implementation) on the training set and validation set
4.) Using the files created from our train/validation Adaboost, get the test set and generate predictions (again using Multiboost)
5.) Generate a submission file in the required format

Easy does it. Now for the configuration files that I’m using:

config.txt

fileformat arff
verbose 2
learnertype TreeLearner
constant
seed 50
weightpolicy balanced
baselearnertype SingleStumpLearner 8
outputinfo results.dta e01w01auc
traintest training.arff validation.arff 5000
shypname shyp.xml

configScoresValidation.txt

posteriors validation.arff shyp.xml scoresValidation.txt 5000
fileformat arff
verbose 2
learnertype TreeLearner
baselearnertype SingleStumpLearner 8

configScoresTest.txt

posteriors test.arff shyp.xml scoresTest.txt 5000
fileformat arff
verbose 2
learnertype TreeLearner
baselearnertype SingleStumpLearner 8

I’ll be honest and say I’ve not really found the ideal set-up for this problem here. I’m able to get a score of around 0.965 using the ones here but if you look at the leader board you’ll see that’s not all that good. Certainly the LibSVM method performed much better (something like 0.99). Not to worry, it’s doing the right thing, generating good results and is another tool in our arsenal.

The World Cup may stymie my blog posts for a bit – then again, supporting England, it might only be 3 games.

Football’s coming home.