DogDogFish

Data Science, amongst other things.

Tag: visualization

Best Housing Investments of the last 20 years

Hey all,

So from that long list I posted I’ve decided I’m most interested in the fastest growing and falling towns in the UK, as measured by the average house selling price. Once we’ve got a bunch of hot/not towns we might even be able to have a look at what these towns looked like before their boom. The hope is then that we’ll be able to find towns in that situation right now and boom, we’re housing moguls.

Given my poor excuse for hardware, we’ll have to start in bash:

 awk -F, '{print $3"-"$4"-"($(NF-3))}' pp-all.csv | tr -d '"' | cut -d '-' -f1,2,3,5 | tr '-' 't' | awk '{summary_array[$2"-"$3"-"$4] += $1; count_array[$2"-"$3"-"$4]++} END {for (region in summary_array) print region"t"summary_array[region]/count_array[region]}' | tr '-' 't' | sort -nk1 -nk2 > average_sale_price_by_town_and_month.txt 

where pp-all.csv is all the UK housing data as downloaded from data.gov.uk. The format of the data and what the above code does is inferrable from the following R code:

library(reshape2)
library(plyr)
library(ggplot2)
myData <- read.csv('average_sale_price_by_town_and_month.txt', header=F, sep='t')
colnames(myData) <- c("Year", "Month", "Town", "Average_Price")
## Let's not make the same mistakes as we did last time - which data should we remove pre-analysis?
summary(count(myData, c('Town'))$freq)
## From that, I'm going to say let's remove any town without 229 points.
myData <- myData[!(myData$Town %in% levels(myData$Town)[(count(myData, c('Town'))$freq < 229)]),]

I’m going to break out of code mode to explain that last line because it is horrible. I’m first grouping my data by town and counting the number of entries – in SQL this’d be SELECT count(date) FROM myData GROUP BY Town . I’m then comparing every entry in the list to 229 (the max number of points each town can have) and producing a logical list of the same size as the number of towns indicating whether the town at that index has a full complement of points. levels(myData$Town) gives me a list of all the towns which is placed against the previously calculated logical list – only towns matching a TRUE are kept. At that point, we’ve got a list containing all the towns we want to keep – myData$Town %in% compares the Town column of myData against this list and acts like a SQL where clause. Finally, in confusing developments, I’ve inverted all of the above logic with an ! – this’ll now only keep columns where the number of entries per town is greater than 228. The comma before the square bracket says to include all columns (we could add filters there if we liked). We assign all of this to itself, in effect filtering the original data frame by removing any lines belonging to towns which don’t have a full complement of points. All in that one line.

In practical advice for the novice R coder (which I’d class myself as) – start with the smallest bit of code you can and then add bits on when you’re comfortable with what they’re doing. Actually, that’s not just true of R – the bash one-liner above would likely be best constructed in exactly the same way.

Anyway, where were we?

myData$Datey <- as.Date(paste(myData$Year, myData$Month, 1, sep='-'), '%Y-%m-%d')
myData <- myData[,!(names(myData) %in% c("Year", "Month"))]
ts_frame <- dcast(myData, Datey ~ Town, value.var="Average_Price")
row.names(ts_frame) <- ts_frame$Datey
ts_frame <- ts_frame[,!(names(ts_frame) %in% c("Datey"))]
growth_frame = data.frame(matrix(0, ncol=1, nrow=length(ncol(ts_frame))))
row.names(growth_frame) <- names(ts_frame)
for (i in 1:ncol(ts_frame)) {
  total_growth <- 100*((ts_frame[length(ts_frame[,i]),i] - ts_frame[1,i])/ts_frame[1,i])
  growth_frame[[names(ts_frame)[i]]] = total_growth
}
long_growth_frame <- melt(growth_frame)
colnames(long_growth_frame) <- c("Town", "Percentage_growth")
long_growth_frame <- long_growth_frame[long_growth_frame$Percentage_growth > 0.1,]
ggplot(long_growth_frame, aes(x=Percentage_growth)) + geom_density() + ggtitle("Percent Uplift in UK housing prices between 1995 and 2014")

giving us:

Growth in UK House Prices

Density of the Percentage Growth of Average House Price (by town) in the UK between 1995 and 2014

As we can see, the bulk of towns experienced between 100% and 400% growth in that time – if anybody can point me towards UK salary figures over that time period I think that’d be a nice set to join this with.

Anyway, let’s take what we’ve done in the previous post on house prices and plot the data on a UK map. There are too many points for me to reasonably plot all of them – let’s go with the top 20 (Red diamonds) and bottom 20 (black crosses):

ordered_growth_frame <- long_growth_frame[order(long_growth_frame$Percentage_growth),]
library(maps)
library(mapdata)
library(RCurl)
library(RJSONIO)
## A couple of functions allowing us to dynamically get the longitude and latitude of regions
construct.geocode.url <- function(address, return.call = "json", sensor = "false") {
  root <- "http://maps.google.com/maps/api/geocode/"
  u <- paste(root, return.call, "?address=", address, "&sensor=", sensor, sep = "")
  return(URLencode(u))
}

gGeoCode <- function(address,verbose=FALSE) {
  if(verbose) cat(address,"n")
  u <- construct.geocode.url(address)
  doc <- getURL(u)
  x <- fromJSON(doc,simplify = FALSE)
  if(x$status=="OK") {
    lat <- x$results[[1]]$geometry$location$lat
    lng <- x$results[[1]]$geometry$location$lng
    return(c(lat, lng))
  } else {
    return(c(NA,NA))
  }
}

map('worldHires',  c('UK', 'Ireland', 'Isle of Man','Isle of Wight'), xlim=c(-7,2), ylim=c(50.1,58.7))  
long_and_lat <- data.frame(sapply(paste(head(ordered_growth_frame, n=20)$Town, ", UK", sep=''), function(x) gGeoCode(x)))
row.names(long_and_lat) <- c("Latitude", "Longitude")
long_and_lat <- data.frame(t(long_and_lat))
points(long_and_lat$Longitude, long_and_lat$Latitude, col=1, pch=4)
long_and_lat <- data.frame(sapply(paste(tail(ordered_growth_frame, n=20)$Town, ", UK", sep=''), function(x) gGeoCode(x)))
row.names(long_and_lat) <- c("Latitude", "Longitude")
long_and_lat <- data.frame(t(long_and_lat))
points(long_and_lat$Longitude, long_and_lat$Latitude, col=2, pch=5)
title('Fastest/Slowest Growing House Prices - UK (1995-2014)')
legend("topright", legend=c("Fastest", "Slowest"), title="Legend", bty="n", pch=c(5,4), col=c("red", "black"), inset=c(-0.05,0))

giving us:

UK Growth Map

The towns with the fastest and slowest growth in average house price since 1995.

So it looks like the South Coast has been the place to buy houses in the last 20 or so years. And the North East/North West were the places to be avoided.

OK – that’s all well and good but it doesn’t really tell us anything about the area we should be buying houses in now. Hold your horses. I’m getting to that. Obviously we’re not really going to be able to learn anything looking at the price difference between the end of our ‘test’ period and the start of our ‘test’ period. We need to build our model over a subset of this data, and test it against the remaining data.

For my next trick (blog post) I’ll look at predicting the fastest growing regions. As a sneak peek, to do we’ll use growth % as the metric we’re trying to predict (a continuous variable) and we’ll create features out of the input data set. I don’t know which method we’ll use yet but it’ll be one of linear regression, SVM regression or neural networks. Likely whichever is best supported by the language I choose to use. I’ve used Libsvm before and found it very good so maybe that.

What we do in life echoes in eternity.

Average House Price Visualization using Python and Google Charts

Hi all,

Only yesterday I came across a rich store of data that I had hitherto been unaware of; namely, data.gov.uk. Giddy with joy, I perused the mountains of interesting data and thought it’d be fun to pull together a visualization based on some of it. One particular set caught my eye: all of the house sales in the UK in the last 19 years (link to the data at the bottom).

So, ┬áthe first question that sprung to mind was “How does the average house price vary by region?”

First things first, let’s calculate the average house price per locale using (only) Python:


def average_price(year):
    with open('pp-' + str(year) + '.csv', 'rb') as f:
        lines = f.readlines()[1:]
        for line in lines:
            try:
                identifier, price, date, postcode, type_of_house, new, freehold_or_leasehold, address_1, address_2, address_3, address_4, address_5, address_6, address_7, letter = line.split(',')
            except:
                continue
            try:
                data_dictionary[address_7.strip('"').lower()].append(int(price.strip('"')))
            except:
                data_dictionary[address_7.strip('"').lower()] = [int(price.strip('"'))]
        final_results = dict((key, sum(value)/float(len(value))) for key,value in data_dictionary.iteritems())
        return final_results

So, we’ve built ourselves a nice little function that’ll take the year as an input, open up the relevant data file, calculate the average house price by region and return the result in a dictionary. Easy does it.

Please note at this point that there are a million ways to do what we’ve just done ranging from sticking the raw CSV into Excel and using pivot tables, loading the data into R and performing an aggregate on the region, using Panda’s excellent data frames or in fact using a bash one-liner (a favourite of mine):
cut -d ',' -f2,14 pp-2014.csv | tr -d '"' | awk '{region[$2] += $1; region_count[$2]++;} END { for (area in region) print area"t"region[area]/region_count[area]}' .
However, I’m sticking to Python for reasons hopefully soon to become clear.

So now we can see that Stoke-on-Trent is very cheap and London is very expensive. Can we see this data changing over time?

I decided the nicest way of piecing this together was using a jQuery slider to select the year, a Google Geocharts frontend to visualize the data and then a lightweight Python web framework to hold the whole thing together. I chose web.py because I’ve used it before and think it’s great for work with AJAX and is also useful when you’ve already written your Python functions and just need something that won’t get in your way too much.

I’m not going to show all the code here but you can find my finished versions on my Github:
Back end
Front end

There are a couple of details that are contained in that code that I’ve not dealt with.

Firstly, Google Charts API doesn’t work with the place names listed in the Government data. As such, you’ll see I’ve written a little lookup function to map Government place names to ISO-3166 Codes as required by Google. There’s a bit of fuzzy matching going on here but if you navigate around in that repo, you’ll find I tested a few things and settled on a decent solution. When I can be bothered I’ll go and tidy that up by filling in the missing ISO codes manually.

Secondly, you’ll notice (if you get this running on your own computer) that the visualization is fairly slow. It runs calculations over the entire data set each time a query is run. What’s more, it then tries to render around 100 points on the Google Chart. Given that there are only a limited number of ways you’d ever want the user to be able to query the data and that the data doesn’t change day on day, you’d want to pre-aggregate the results and store them in a database somewhere.

Thirdly, you’ll note that this blog doesn’t contain the visualization. Pretty shoddy on my part, just haven’t got around to doing that yet.

That’s one of the problems with this data science malarkey, I could spend my time building my own blog platform that allows me to serve simple web apps. I could spend my time sticking in a simple database caching solution to speed up the apps on localhost. I could tidy up the fuzzy matching on the ISO codes to create an 100% correct mapping. However, it seems much more interesting to head off and see what else this data contains.

Next stop – is there a better time to buy/sell a house? Do house prices go up in certain months? I’ll try to answer that question fairly thoroughly with due consideration to statistical significance along the way.

© 2018 DogDogFish

Theme by Anders NorenUp ↑