Data Science, amongst other things.

Average House Price Visualization using Python and Google Charts

Hi all,

Only yesterday I came across a rich store of data that I had hitherto been unaware of; namely, data.gov.uk. Giddy with joy, I perused the mountains of interesting data and thought it’d be fun to pull together a visualization based on some of it. One particular set caught my eye: all of the house sales in the UK in the last 19 years (link to the data at the bottom).

So,  the first question that sprung to mind was “How does the average house price vary by region?”

First things first, let’s calculate the average house price per locale using (only) Python:


def average_price(year):
    with open('pp-' + str(year) + '.csv', 'rb') as f:
        lines = f.readlines()[1:]
        for line in lines:
            try:
                identifier, price, date, postcode, type_of_house, new, freehold_or_leasehold, address_1, address_2, address_3, address_4, address_5, address_6, address_7, letter = line.split(',')
            except:
                continue
            try:
                data_dictionary[address_7.strip('"').lower()].append(int(price.strip('"')))
            except:
                data_dictionary[address_7.strip('"').lower()] = [int(price.strip('"'))]
        final_results = dict((key, sum(value)/float(len(value))) for key,value in data_dictionary.iteritems())
        return final_results

So, we’ve built ourselves a nice little function that’ll take the year as an input, open up the relevant data file, calculate the average house price by region and return the result in a dictionary. Easy does it.

Please note at this point that there are a million ways to do what we’ve just done ranging from sticking the raw CSV into Excel and using pivot tables, loading the data into R and performing an aggregate on the region, using Panda’s excellent data frames or in fact using a bash one-liner (a favourite of mine):
cut -d ',' -f2,14 pp-2014.csv | tr -d '"' | awk '{region[$2] += $1; region_count[$2]++;} END { for (area in region) print area"t"region[area]/region_count[area]}' .
However, I’m sticking to Python for reasons hopefully soon to become clear.

So now we can see that Stoke-on-Trent is very cheap and London is very expensive. Can we see this data changing over time?

I decided the nicest way of piecing this together was using a jQuery slider to select the year, a Google Geocharts frontend to visualize the data and then a lightweight Python web framework to hold the whole thing together. I chose web.py because I’ve used it before and think it’s great for work with AJAX and is also useful when you’ve already written your Python functions and just need something that won’t get in your way too much.

I’m not going to show all the code here but you can find my finished versions on my Github:
Back end
Front end

There are a couple of details that are contained in that code that I’ve not dealt with.

Firstly, Google Charts API doesn’t work with the place names listed in the Government data. As such, you’ll see I’ve written a little lookup function to map Government place names to ISO-3166 Codes as required by Google. There’s a bit of fuzzy matching going on here but if you navigate around in that repo, you’ll find I tested a few things and settled on a decent solution. When I can be bothered I’ll go and tidy that up by filling in the missing ISO codes manually.

Secondly, you’ll notice (if you get this running on your own computer) that the visualization is fairly slow. It runs calculations over the entire data set each time a query is run. What’s more, it then tries to render around 100 points on the Google Chart. Given that there are only a limited number of ways you’d ever want the user to be able to query the data and that the data doesn’t change day on day, you’d want to pre-aggregate the results and store them in a database somewhere.

Thirdly, you’ll note that this blog doesn’t contain the visualization. Pretty shoddy on my part, just haven’t got around to doing that yet.

That’s one of the problems with this data science malarkey, I could spend my time building my own blog platform that allows me to serve simple web apps. I could spend my time sticking in a simple database caching solution to speed up the apps on localhost. I could tidy up the fuzzy matching on the ISO codes to create an 100% correct mapping. However, it seems much more interesting to head off and see what else this data contains.

Next stop – is there a better time to buy/sell a house? Do house prices go up in certain months? I’ll try to answer that question fairly thoroughly with due consideration to statistical significance along the way.

1 Comment

  1. Ole Dam Møller

    To fetch all the files in one easy go;

    for y in `seq 1995 2014`
    do
    wget http://publicdata.landregistry.gov.uk/market-trend-data/price-paid-data/b/pp-${y}.csv
    done

Leave a Reply

Your email address will not be published. Required fields are marked *

© 2024 DogDogFish

Theme by Anders NorenUp ↑