Data Science, amongst other things.

# Tag: housing market

Hey all,

I’ve been tentatively threatening to write this post for a while now and I’ve been itching to do a bit of machine learning – I’m going to be walking through the steps required to run linear regression and an SVM on our housing sale data to try to predict future house sale prices. I’m not overly confident that it’s going to give us a huge deal of predictive power – any example I see online uses useful things like area of house or number of bedrooms. All I’ve got is when you sold it, where you sold it and the type of house (new or old, detached or terraced, freehold or leasehold) e.t.c. Not to worry – it’ll be a blast all the same.

First things first, which machine learning technique should we use? We’ll be using a supervised algorithm as we have a labelled training set (we can tell our classifier what the right answer is). We’re looking at a regression problem, not a classification problem (we’ve trying to predict a continuous variable, not a discrete one). All in all, that’s screaming linear regression to me. As an additional bonus though, I also happen to know that SVMs work very well for this kind of problem and have used Libsvm for similar things in the past.

In terms of the bits of kit I’ll be using – let’s start off, as I always seem to, in bash.

`cut -d ',' -f3- pp-all.csv | cut -d ',' -f1-13 | tr -d '"' > pp_all_for_weka.txt `

Just a bit of formatting but, if you’ve followed the posts through then you’ll have pp-all.csv on your computer. If you’ve not see here for how to get it.

Now we’re here, you may notice the strange filename – Weka is a machine learning library for Java. We’ve been using it a bit at work recently and I fancied getting a bit more experience of it. I’ll be using Weka for the linear regression; I’ve not got the heart to do a Weka installation post just yet (it’s not difficult, I’m just tired) but will do one if there’s any demand. I’m going to be using LibSVM for my support vector machine calculations and again, not going to talk you through the install unless you fancy it. The reason I’ve told you about that is because I’m now going to convert my CSV file into an ARFF file and a libsvm formatted file. While I’m at it, I’m going to convert all of my values (postcodes, dates e.t.c.) into numbers. Doing this allows me to very easily feed this data set into the above programs and get an answer out relatively easily.

Could we write the algorithms ourselves? Sure – but not nearly as well as they’ve already been written. Sidebar: if you’re interested in understanding how all these algorithms work I’d encourage you to check out Andrew Ng’s lecture series on Coursera. It’s excellent.

Anyway, to run the conversion and to output the two different files I wrote the following Python script:

```#!/usr/bin/python

class mapping_dictionary(object):

def __init__(self, output_file, svm_file):
self.mapping = {}
self.mapping_count = {}
self.writer_file = open(output_file, 'w')
self.svm_file = open(svm_file, 'w')

def shut_file(self):
self.writer_file.close()
self.svm_file.close()

if column not in self.mapping:
self.mapping[column] = {}
self.mapping_count[column] = 0
if key not in self.mapping[column]:
self.mapping[column][key] = self.mapping_count[column]
self.mapping_count[column] += 1

def interpret_list(self, listy):
if len(listy) != len(self.column_names):
print "Error - unexpected number of columns in the line: %d" % len(listy)
return None
for index, value in enumerate(listy):
if index == 12:
try:
value = int(value.strip().strip('n'))
except:
break
self.write_to_file(listy)
self.write_libsvm_file(listy)

def write_to_file(self, listy):
string_to_write = ','.join([str(self.mapping[self.column_names[index]][entry]) for index, entry in enumerate(listy) if index != 12])
string_to_write += ",%sn" % str(listy[-1])
self.writer_file.write(string_to_write)

def write_libsvm_file(self, listy):
string_to_write = ' '.join([str(index + 1) + ":" + str(self.mapping[self.column_names[index]][entry]) for index, entry in enumerate(listy) if index != 12])
string_to_write = str(listy[-1]) + " " + string_to_write + "n"
self.svm_file.write(string_to_write)

mapping = mapping_dictionary('nice_weka_output.txt', 'nice_libsvm_output.txt')
with open('pp_for_weka.txt', 'rb') as f:
while True:
line = f.next()
mapping.interpret_list(line.strip().strip('n').split(','))
mapping.shut_file()

```

A reasonable amount of Python happening there – if you were worrying about it, I really wouldn’t. All we’re doing is replacing every field (apart from price) with an integer and outputting that in two different formats as we work our way through the file. If I was being thorough I’d remove the hardcoded list at the top, require a header row, take the filenames as command line arguments and then it’d work as a general tool for formatting CSVs as libsvm and arff files. Actually, that doesn’t sound like a bad idea at all.

Now you’ve got your input files, it’s child’s play to run our algorithms to create models that we can pass new data to. I’ll create a separate post detailing the output of the above two classifiers but it looks like I’ll have to leave them running through the night!

Until then.

Hi All,

After work over one year (approximately 800,000 data points) it’s time to start ramping things up a bit and we’ll be heading into the realms of big data. There are a bunch of definitions of big data floating around but I’m going to go with the one I like the most: “Big data is the point at which the size of your data becomes part of the problem.” That of course depends entirely on the hardware you’ve got and given my current hardware situation, as soon as we start looking at multiple years of house sale data, we’re firmly in ‘big data’ territory. In fact, the size of the data isn’t really that large – only ~20 million rows with a total size of 3.2GB. I could upgrade my RAM for a small amount and we’d be fine to deal with this in the usual ways. However, I’m cheap, the shops are closed and I’d rather think of a less hardware bound way of doing it.

Unfortunately for me, while working at home on my own projects, I don’t have access to any cluster of computers so won’t be throwing my data into HDFS and getting my mappers and reducers polished. At work we’re soon to be adding 40TB of storage and 10 more nodes to our existing cluster and that set up would chew through this sort of thing like nobody’s business. Given an i5 and 6GB of RAM, we’re going to have to be a tad more creative ðŸ˜‰ It’s also worth noting that, for 3.2GB of data, distributing the problem isn’t likely the most efficient solution. Just because I’m talking about big data, there’s no reason we can’t use R, Python, SQL or anything else that will solve the problem.

I liked what we were able to do in R and don’t feel I’ve had a sufficiently deep dive into the statistical functions it offers and so will be trying to use it again. Fitting all the data in memory isn’t going to work but R offers a number of packages for working with data that exceeds the RAM you’d be willing to give it; I’ll be using bigmemory.

First things first, all the data is split across text files. If we want to have them read into a single table I think we’ll be best creating that text file in bash and adding a new column with the year – bash to the rescue:

``` for file in pp-*.csv do echo "Processing \$file..."; year=\${file:3:4}; awk -F, -v year=\$year '{print year","\$0}' \$file >> pp-all.csv; done ```

At the risk of stating the obvious, that will loop over all the relevant files, extract the year from the title of the file, prepend it to the start of the line and then append the line to pp-all.

O.K – originally I was going to use bigmemory but have found it unsuitable for the task (namely, I couldn’t even get it to load the file from disk). I’m sure that’s more a damning indictment of my own ability rather than bigmemory but I’ll proceed in a different way for now.

``` cut -d ',' -f3,4 pp-all.csv | tr -d '"' | cut -d '-' -f1,2 | awk -F, '{ total_array[\$2] += \$1; count_array[\$2] += 1} END {for (region in total_array) print region"t"total_array[region]/count_array[region]}' | tr '-' ' ' | sort -nk1 -nk2 | sed -e 's/s+/-/' > price_summary_by_month.txt ```

OK – it’s a bit of a nasty one-liner but it basically:
1.) Gets the date and house sale price
2.) Formats the output
3.) Create an array of the count and sum of sale prices per month
4.) Sort by date and format
5.) Output to price_summary_by_month.txt

Now I’ve got myself a lovely text file – I decided to have a dig into Python and especially, Matplotlib. As it’s not really relevant to the overall direction of this analysis I won’t include the code here but it’s available on my Github. As a summary, this script looks at the percentage of total value of the year’s sales made in each month. A vector is built for each month and then the Gaussian density is calculated for that month. This is plotted versus the surrounding months. I’m aware that’s likely a poor explanation but if you’re interested, have a look at the code and feel free to drop me a comment. The output is something like this:

% of value of house sales since 1995, by month.

Anyway, back to the main thread of the investigation. I’ve got a relatively small time-series and I’m investigating the periodicity of the data – I’m right back in native R territory.

```priceSummary <- read.table('price_summary_by_month.txt', header=F, col.names=c("date", "price"))
pricetimeseries <- ts(priceSummary\$price, frequency=12, start=c(1995,1))
plot(pricetimeseries, ylab="Average sale price (Â£)", main="UK House Prices")
```

Giving us:

Average UK House Price (by month)

That’s pretty much as you’d expect – generally rising prices with a bit of a wobble around the housing crash. Disappointingly, there doesn’t seem to be that much evidence of seasonality being a factor. However, R provides us the tools to see whether or not that’s the case:

```library(ggplot2)
library(scales)
pricetimeseriescomponents <- decompose(pricetimeseries)
plot(pricetimeseriescomponents)
seasonality <- data.frame(pricetimeseriescomponents\$seasonal[c(1:12)])
seasonality\$Month <- factor(month.name, levels=month.name)
colnames(seasonality) <- c("Price", "Month")
ggplot(seasonality, aes(x=Month, y=Price)) + geom_bar(stat="identity") + ggtitle("Seasonal variations in house sale price") + theme(axis.text.x = element_text(angle=90, hjust=1)) + scale_y_continuous(name="Price difference (Â£)", labels=comma)
orderedSeasonality <- transform(seasonality, Month = reorder(Month, Price))
ggplot(orderedSeasonality, aes(x=Month, y=Price)) + geom_bar(stat="identity") + ggtitle("Seasonal variations in house sale price") + theme(axis.text.x = element_text(angle=90, hjust=1)) + scale_y_continuous(name="Price difference (Â£)", labels=comma)
```

This gives us our first hint at some real cyclicity to the housing market:

Decomposition of UK House Prices

Seasonality of House Prices – the amount above/below average you’ll pay for a house based on the month

and a perhaps more helpful, sorted version:

Ordered seasonality of House Prices – the amount above/below average you’ll pay for a house based on the month

Well well well. Time to throw the cowboy hat into the air, unholster our pistol and start firing wildly into the sky? Or at least quietly inform those who I know who are looking to buy/sell a house that they can likely save/make themselves an extra Â£10,000 or so if only they’re willing to wait until summer before selling?

Unfortunately, I’m not confident enough to say that just yet. I’ve already fairly conclusively shown that there are a bunch of factors that affect the sale price of a house. Now while I’m happy to say that the month of the year is significant in determining the sale price, I don’t really know why. It could be that people are more inclined to buy houses in the south during summer. As these houses are more expensive, we see the average house price rise in the summer months. It could be that people are less inclined to buy terraced houses during the summer (heating concerns?) and, as terraced houses are generally cheaper, the average sale price is inflated.

If either of those statements, or indeed, any other like it that I’ve not thought of are true it could lead to me giving bad advice in specific cases. To be able to report anything especially useful (and ultimately, actionable) we’ll need to look a bit closer at the causes of seasonality. My next post will hopefully address these issues and will try to determine whether or not the average UK house buyer is better off waiting until winter before buying.