DogDogFish

Data Science, amongst other things.

Tag: Data Analysis

Generating B2B sales data in Python

‘Sup

Look, I’m sorry. Yet again, I’ve not written any blog posts for ages. Let’s all get over it and move on to something more important. Sales. Let’s imagine you’re an organisation selling B2B. You use Salesforce (or any other platform). You’ve got plenty of opportunities and a history of those opportunities. You’ve gone and built a sales pipeline.

Good work. That’s not an easy thing to do.

Now you want to use that pipeline to get better at sales. You want to use the data you’ve got to help forecast what you’ll do in the future. You want to know the value of what you’ve already got in the pipeline. You want to know what the most valuable activities you perform are. I’m not going to be able to fit all of that into one post so I’ll break things up into parts and (I’ve said this before only to underdeliver) FINISH THE SERIES.

However, for part 1 I’m actually only going to focus on generating some dummy data to play with. “What!? That’s none of the things you said you’d do!” No. It’s not. However, if you’re able to find me a B2B company with a small number of sales who are willing to publicly share all their data then fair play to you. Lacking that I’m going to have to create a dummy set of data and make it halfway believable. In doing this I’ve made a few assumptions (that I’m later going to try to show). It’s a bit circular but don’t be that guy. What I’m doing is broadly legit and if you look at the data and don’t think it’s reasonable then I’m providing the code so you can change whichever bit you find egregious. Even better, just use your actual company’s sales data (assuming you’re lucky enough to have it).

I’ll be building a dataframe that resembles a Salesforce pipeline – it’s going to have the following rows:

Stage – this is the ‘Salesforce/Hubspot/<don’t care>’ stage in the pipeline. Measures how far along an opportunity is.
Name – got to keep track of the opportunities using something
Value – how much money are we going to make from this opportunity. Daily, Monthly, Annually. Doesn’t matter.
Days – this is the date the opportunity entered the stage given. Going to be important later for time-dependence stuff.

So, let’s begin (all code also available here)

import numpy as np
import random
import matplotlib
from matplotlib import pyplot as plt
import datetime
from datetime import datetime as dt
from scipy import stats
import pandas as pd

def weighted_pick(weights, n_picks):
    t = np.cumsum(weights)
    s = np.sum(weights)
    return np.searchsorted(t, np.random.rand(n_picks)*s)

pre_stages = [('Contact initiated', 0.8, 10), ('Meeting booked', 0.6, 20), ('Trial booked', 0.4, 15), ('Proposal sent', 0.3, 25), ('Contract sent', 0.2, 10)]

closed_stages = ['Closed Won', 'Closed Lost']

success_stages = ['Closed Won']

Here I’m declaring a few things that are going to be useful to me later. I want all of the stages in the pipeline that I care about, the closed stages and the success stages. The code is probably a bit brittle regarding the random addition of closed and success stages but is fine for new ‘pre_stages’. The parameters are the probability that the opportunity will fall out of this stage (rather than move on successfully) and something else that we’ll talk about later.

WORDS = open('/usr/share/dict/words', 'rb').read().splitlines()

NUM_POINTS = 400
AVERAGE_SALE_PRICE = 3500
SD_SALE_PRICE = 1000

sales_opportunities = [(entry.title(), np.random.normal(AVERAGE_SALE_PRICE, SD_SALE_PRICE)) for entry in np.random.choice(WORDS, NUM_POINTS, replace=False)]

Here I’m generating a list of ‘company names’, picking words randomly from a dictionary. In all honesty, just looking through the list of company names is pretty fun in itself. I’m also assuming that the revenue I make from my product is a normal distribution with mean and standard deviation given as ‘AVERAGE_SALE_PRICE’ and ‘SD_SALE_PRICE’. Not rocket science. But it is an assumption I’m making – let’s chalk it down. First assumption: revenue/client is normally distributed. Then we build a list of sales opportunities and their value.

start_date = datetime.datetime.now() - datetime.timedelta(days = 365*2)
days_range = range(365*2)
y = [float(entry)/365. for entry in days_range]

indices = weighted_pick(np.exp(y), NUM_POINTS)

Second assumption I’m going to make in generating this data – you’re working for the right kind of start-up/business. Basically, the number of opportunities created are going to broadly follow an exponential distribution. That is, you specify how many opportunities enter the pipeline with ‘NUM_POINTS’ and we’re going to distribute those according to an exponential distribution. I’m saying that the company starts 2 years ago – again, change if you don’t like it.

sales_data = [[pre_stages[0][0], name_value_pair[0], name_value_pair[1], start_date + datetime.timedelta(days = index)] for name_value_pair, index in zip(sales_opportunities, indices)]

remaining_opportunities_frame = pd.DataFrame(sales_data)
remaining_opportunities_frame.columns = ['Stage', 'Name', 'Value', 'Days']

sales_data_frame = pd.DataFrame(sales_data)
sales_data_frame.columns = ['Stage', 'Name', 'Value', 'Days']

finished_list = set([])

OK. Now I’ve got the first set of entries that’ll make up my final dataframe – it’s all of the opportunities with the value (generated from a normal distribution) and the time the opportunity entered the pipeline (generated via an exponential distribution). I’m going to create a few things for later, namely a dataframe containing all of the live opportunities and our final dataframe containing all the rows we’re going to care about.

for stage_index, stage in enumerate(pre_stages[1:]):

    next_stage = pd.DataFrame([(sales_opp[1], index, np.argmax(entry)) for sales_opp in sales_data for index, entry in enumerate(np.random.multinomial(1, [0.99, (1. - stage[1])/100., stage[1]/100.0], (datetime.datetime.now() - sales_opp[3]).days)) if entry[0] != 1 and sales_opp[1] not in finished_list])

The above line is where it’s all at. Let me explain slowly and then again, even slower. My intuition is this – I think that the probability that an opportunity converts (moves from its current stage to the next stage) is proportional to the negative exponential of the time spent in that stage. Let’s be clearer. I’m going to make the third assumption – that the probability of moving to the next stage broadly follows a negative exponential. What’s more, I think that each stage will have its own characteristic drop off rate (or half-life, for those of you thinking this looks mightily like radioactive decay). You know how before I said I’d added a parameter to ‘pre_stages’ and I’d explain it. That’s what ‘pre_stages[x][2]’ is. So, for a given stage in the sales pipeline, for each opportunity left in the previous stage, for every day between when the opportunity entered the stage and now I run the multinomial line. The multinomial line is going to return a binary array of three elements where exactly one of the elements is filled. The first element will be filled in 99% of cases – I’ve chosen to set this and if you don’t like it then change it to something else. It means that, for every day between the opportunity entering the state and today there’s a 99% the opportunity will still be in that state at the end of the day. If the second element is filled then that means that the opportunity succeeded on that particular day (with probability given by the stage parameter). Finally, if the third element is filled then the opportunity died on that particular day. ‘Index’ gives us the number of days that’ve happened since the opportunity entered the stage and the argmax gives us whether we succeeded or failed (you’ll see we’re ignoring days when we neither succeeded or failed).

    next_stage.columns = ['Name', 'Days', 'Status']

    meh = next_stage.ix[next_stage.groupby('Name').Days.idxmin()]
    tempy_frame = meh.merge(remaining_opportunities_frame[['Name', 'Value', 'Days']], how='inner', on='Name')

    tempy_frame['new_date'] = tempy_frame.apply(lambda x: x.Days_y + datetime.timedelta(days = x.Days_x), axis=1)

    tempy_frame = tempy_frame[['Name', 'Value', 'new_date', 'Status']]
    tempy_frame.columns = ['Name', 'Value', 'Days', 'Status']

    success_frame = tempy_frame[tempy_frame.Status == 1]
    success_frame = success_frame.drop('Status', 1)
    success_frame.insert(0, 'Stage', pre_stages[stage_index + 1][0] if stage_index + 1 < len(pre_stages) - 1 else success_stages[0])

    failure_frame = tempy_frame[tempy_frame.Status == 2]
    failure_frame = failure_frame.drop('Status', 1)
    failure_frame.insert(0, 'Stage', closed_stages[1])

    sales_data_frame = sales_data_frame.append(success_frame).append(failure_frame)

That was a crazy line – but it contained most of the interesting stuff we do. From here on in we grab the first of the days that the opportunity moved (we actually kept all of the days in the above line but we’re only allowing each opportunity to move out of each stage once!), add the number of days to the original date we entered the stage to find the day we move into the next stage and then create the rows that we need.

    finished_frame = sales_data_frame.groupby('Name').apply(lambda x: x.Stage.isin(closed_stages).any())
    finished_list = set(finished_list).union(set(finished_frame[finished_frame == True].index.values))
    remaining_opportunities = remaining_opportunities_frame[~remaining_opportunities_frame.Name.isin(finished_list)]

Finally, there’s a bit of tidying up to make sure that we don’t calculate anything for any of the opportunities that have died

dates = matplotlib.dates.date2num(sales_data_frame[sales_data_frame.Stage == success_stages[0]].sort('Days').Days.astype(dt))
revenue = sales_data_frame[sales_data_frame.Stage == success_stages[0]].sort('Days').Value.cumsum().values

plt.plot_date(dates, revenue, 'b-')
plt.xlabel('Date')
plt.ylabel('Revenue')
plt.title('Company revenue over time')
plt.show()

sales_data_frame.to_csv('generated_data.csv', index=False)

Quite a lot of work, really, just to generate some ‘likely looking’ sales data. Again, if you’ve got your own then use it! However, up till now I’ve just asserted that it’s likely looking. If you play around with it you can actually see some pretty interesting stuff. Firstly, with lots and lots of data point (N = 8000) you see that the company revenue growth looks very exponential:

8000points

However, it’s unlikely that you’ve got 8000 B2B transactions in your sales pipeline (if you do, kudos!). Let’s examine the situation where you’ve got 150:

150points

And a once more with 150:

150points3

I think it’s interesting that, even though we’ve literally built this whole pipeline using exponential growth – we still look flat in a lot of places. Hopefully that might provide some solace if you’re struggling with sales and think you’re not hitting your exponential growth. Play around with the parameters and you can see what sort of effect increasing your conversion at various stages has on your overall revenue etc. Or just read the company names – they’re also pretty good.

Right, I’m counting that as broadly done. We’ve got sales data that nobody will mind me analysing in a public forum. Stay tuned/subscribe/email me to keep in touch for part 2. We’ll imagine that we’ve started with this data and we’ll try to assign a total value to our pipeline, and maybe even get onto predicting how many opportunities will progress in the next N days.

UK House Sales – When should estate agents go on holiday?

Hi all,

If you’ve been following all of these blog posts then you’re in a minority of one. However, you’ll also know that we’ve taken all of the UK house sales in the last 18 years or so and have found a bunch of things out. We’ve seen the spread of average house price by region, we’ve seen seasonality in the average house price, we’ve seen the impact the housing crash had on average house price (not that much) and on number of houses sold (an awful lot). Finally, we investigated seasonality of number of house sales by region and in doing so, found that London suffered the housing crash worse than other areas but almost immediately picked itself up and is in fact (relative to the rest of the country) better off than it was before the crash.

I had a couple of ideas for investigations in my last post – one of which was finding the most sold house in the UK and seeing if there was a correlation between the times a house has sold and its price. I’ll briefly tackle this because it’s one line of bash – working with our file pp-all.csv (all of our data in one big text file) the following command will give us the top 100 most sold properties in the UK in the last 18 years:


cut -d ',' -f5,9-15 pp-all.csv | tr -d '"' | tr ',' ' ' | sort | uniq -c | sort -k1 -n -r | head -n 100

The top 10 are as follows (with the format: # of sales | postcode | address ) :

24 L17 3BP 48 FLAT 5-19 ULLET ROAD LIVERPOOL LIVERPOOL MERSEYSIDE
19 W8 6JE 126 FLAT 1-10 LEXHAM GARDENS LONDON KENSINGTON AND CHELSEA GREATER LONDON
19 LS2 7LY 31 EASTGATE LEEDS LEEDS LEEDS WEST YORKSHIRE
16 SE1 3FF 41 FLAT 67 MALTBY STREET LONDON SOUTHWARK GREATER LONDON
16 PL2 1RR 48 HADDINGTON ROAD PLYMOUTH CITY OF PLYMOUTH CITY OF PLYMOUTH
16 BN43 5AR SHOREHAM COURT 3-10 THE CLOSE SHOREHAM-BY-SEA ADUR WEST SUSSEX
15 IP1 3PW 54 ANGLESEA ROAD IPSWICH IPSWICH IPSWICH SUFFOLK
14 WA16 6JD TATTON LODGE 1-6 MOORSIDE KNUTSFORD CHESHIRE EAST CHESHIRE EAST
14 M3 6DE FRESH 138 APARTMENT 1008 CHAPEL STREET SALFORD SALFORD SALFORD
14 M19 2HF 35 CENTRAL AVENUE MANCHESTER MANCHESTER GREATER MANCHESTER

I decided against pursuing this investigation as a quick hunt on Zoopla tells me that flats 5-19 Ullet Road were sold individually and so we’re just seeing the results of grouping and nothing overly interesting. I guess there are certain houses in there of interest – why has 48 Haddington Road been sold 16 times since 1995?

Sale prices of 48 Haddington Road since 1995

Sale prices of 48 Haddington Road since 1995

I don’t really know and I’m not going to investigate – I think I’d rather look at the age old question:

“When should estate agents take holidays?”

Of course, this is every bit as much a question of where national estate agents should have their staff distributed throughout the year, which region removal companies should target throughout the year, where travelling housing surveyors are most likely to pick up business e.t.c.

In the last post we were able to create time series of the percentage of UK house sales a region was responsible for. Now we’re going to create a whole bunch of time series (one for each region) and perform clustering on them to see when each region peaks and troughs. Note we’re looking at a percentage of total sales here and not absolute numbers – I’m also not looking at the saturation of the market or anything like that.

I’m going to use Markov Clustering in this example – don’t worry too much about this (if you don’t want to), I’ll do a post on Markov Clustering at a later point. For now, all you need to know is that it’ll cluster our data in a sensible way.

So, down to business:

## Get the data I need in a small(er) table
library(reshape2)
library(rEMM)
mini_frame <- data.frame(newFrame$Datey, newFrame$Region, newFrame$Percent)
colnames(mini_frame) <- c("Datey", "Region", "Percent")
ts_frame <- dcast(mini_frame, Datey ~ Region, sum)
## Have found I need to initialize this before I kick off
seasonal_ts_frame <- data.frame(matrix(0, nrow=12, ncol=length(colnames(ts_frame)))
colnames(seasonal_ts_frame) <- colnames(ts_frame)
row.names(seasonal_ts_frame) <- factor(month.name, levels=month.name)
for (i in 1:ncol(ts_frame)) {
    decomposed_ts <- decompose(ts(ts_frame[,i], frequency=12, start=c(1995,1)))$seasonal[c(1:12)]
    seasonal_ts_frame[[colnames(ts_frame)[i]]] <- decomposed_ts
}
cor(seasonal_ts_frame)
emm <- EMM(threshold=0.2, measure="eJaccard")
build(emm, cor(seasonal_ts_frame))
cluster_centres <- data.frame(cluster_centers(emm))
cluster_frame <- data.frame(lapply(cluster_centres, which.max))
row.names(cluster_frame) <- c("Cluster")
cluster_frame <- data.frame(t(cluster_frame))
colnames(cluster_frame) <- c("Region", "Cluster")
cluster_one <- subset(cluster_frame, Cluster==1)
cluster_two <- subset(cluster_frame, Cluster==2)
## A list of all the cluster one regions
seasonal_ts_frame[,(names(seasonal_ts_frame) %in% row.names(cluster_one))]
## Now the biggie - let's see the points on a map
library(maps)
library(mapdata)
library(RCurl)
library(RJSONIO)

## A couple of functions allowing us to dynamically get the longitude and latitude of regions
construct.geocode.url <- function(address, return.call = "json", sensor = "false") {
  root <- "http://maps.google.com/maps/api/geocode/"
  u <- paste(root, return.call, "?address=", address, "&sensor=", sensor, sep = "")
  return(URLencode(u))
}

gGeoCode <- function(address,verbose=FALSE) {
  if(verbose) cat(address,"n")
  u <- construct.geocode.url(address)
  doc <- getURL(u)
  x <- fromJSON(doc,simplify = FALSE)
  if(x$status=="OK") {
    lat <- x$results[[1]]$geometry$location$lat
    lng <- x$results[[1]]$geometry$location$lng
    return(c(lat, lng))
  } else {
    return(c(NA,NA))
  }
}

## Plot a UK map
map('worldHires', c('UK', 'Ireland', 'Isle of Man','Isle of Wight'), xlim=c(-8,2), ylim=c(51.8,54.2))

longitude_and_latitude <- data.frame(sapply(paste(row.names(cluster_one), ", UK", sep=''), function(x) gGeoCode(x)))
row.names(longitude_and_latitude) <- c("Latitude", "Longitude")
longitude_and_latitude <- data.frame(t(longitude_and_latitude))
points(longitude_and_latitude$Longitude, longitude_and_latitude$Latitude, col=1, pch=4)

O.K, so there’s an awful lot of R code in there and all to produce a slightly underwhelming graph. Where does the great divide come in seasonality of house sales? Why, in South Wales and North East England of course:

Seasonality Cluster

Regions that break the national housing seasonality pattern

Semi-ignoring our ability to get accurate latitude and longitude using what was at best, a wildly optimistic attempt at doing so, we have some fairly believable (if confusing clusters). The bulk of the country follows the trend set by London:

Seasonal Variations in London's Percent of the UK Housing Market (by number sold)

Seasonal Variations in London’s Percent of the UK Housing Market (by number sold)

with lots of houses being bought in the summer and many fewer being bought in the winter. However, 6 regions in South Wales (including all of Glamorgan), 2 regions in the North East of England and Avon all follow this trend:

Seasonal Variations in Humberside's Percent of the UK Housing Market (by number sold)

Seasonal Variations in Humberside’s Percent of the UK Housing Market (by number sold)

This struck me as very strange indeed and so I looked at the original data and what should I discover? That almost all of the above analysis is wrong. If only we’d have performed the following query the folly of all that I’ve done would have become clear:

colSums(ts_frame==0) > 200

Every single one of the regions that didn’t follow our pattern had substantial missing data. One with more knowledge of British geography may have been able to spot that those counties had ceased to exist in 1995/6. The reason why I’ve left all that analysis in, aside from the fact that there are a few useful bits of code in there (plotting the regions on a UK map may well be helpful to somebody), is to show that it’s really really important to check your data when you’ve got an unexpected result. It’s also important to check it whatever the result, but in data analysis, if something seems dodgy there’s a good chance it is.

When I strip out all regions with missing data, we in fact see that all of the regions follows the same pattern as Greater London. Bugger.

Going to draw this one to a close – what have we discovered? Well, we now know that every single region in the UK follows the same seasonality pattern when it comes to house sales: lots more in summer than winter. We also know that the average house price follows the same trend. I’ve not shown that the regionality isn’t a factor in the increasing average house price (you could imagine the scenario where the more expensive areas see a greater surge in house sales in the summer than the less expensive areas). I’m not sure what I’m going to work on next – I’m getting a bit sick of house prices.

It’ll likely either be:
1.) Regional variations in average house price.
2.) Which regions see the greatest increase in number of house sales in summer – clustering as before.
3.) Seasonality of any other variable (type of house, new versus old, freehold versus leasehold)
4.) Build a predictive model to calculate something specific (number of old detached houses sold in Derbyshire every month for the next year).
5.) Identify towns with the fastest growing (and falling) average house price over the last x years. Try to use this to predict which areas will see similar areas of growth/decline in the future.
6.) Finding correlated stock opening/closing prices over historical data and using this to make £££££. Obviously that one is a bit different but does involve ££££.

UK House Sales – More Seasonality in Time Series with R

So the average sale price of houses in the UK is seasonal. Does that mean it’s sensible to advise house buyers to only purchase in the winter months? Let’s try to see.

I’m going to have a look and see if the data we have implies that the change in average sale price of a house with the month is actually just a function of some other monthly variation. I don’t really know how to go about doing this but it’s probably best to not let things like that stop me – I’m thinking the first port of call is likely calculate the correlation between the month each of the other factors (excluding price). If there’s a decent correlation (positive or negative) then we might be in trouble and will have to investigate that variable with a bit more seriousness.

Again, that’d be a delightfully easy task if I could hold the entire dataset in memory. Unfortunately I’m not that lucky and so I’ll have to do a bit of aggregation before importing the data to R/Python.

So my independent variables:

1.) Region
2.) Type of house
3.) New or old house
4.) Freehold or leasehold

I’m thinking of the work we did in the last blog post in Python and that that might be the best way to proceed; to generate vectors containing the average proportion of sales due to each ‘test group’ (the factors of the independent variable in question) in each of the relevant years. Once I’ve got that, I’m initially thinking of a twelve variant paired t-test. We’ve got 12 different months – in each month we’ve got a year for which each of the other test groups have a corresponding year, hence the choice of paired t-test. However, previously when I grievously abused the normality assumption required to run a t-test I had a whole bunch of data (800,000 points) and so I was sort of O.K with it. Now, I’ve got 18. We may have to investigate other options – Kruskal-Wallis being at the forefront of those. Anyway – let’s worry about that when we have to.

First things first, let’s get this data into a format we can load into memory:

awk -F, '{print $4"-"$(NF-1)}' pp-all.csv | cut -d '-' -f1,2,4 | tr -d '"' | tr '-' ' ' | sed -e 's/s+/-/' | sed -e 's/s+/,/' | sort | uniq -c | sort -s -nk2 | sed 's/^ *//' | sed -e 's/s+/,/' | awk -F, '{if ($3 != "2014-01") print $0}' > number_of_sales_by_region.txt

Again, a horrible one-liner that I’ll have to apologise for. All it does is give me an output file with the format: Count | Month | Region – off of the back of that I can now use R:

library(plyr)
library(ggplot2)
library(scales)
myData <- read.csv('number_of_sales_by_region.txt', header=F, sep=',', col.names=c("Sales", "Datey", "Region"))
## To store as a date object we need a day - let's assume the first of the month
myData$Datey <- as.Date(paste(myData$Datey, 1, sep="-"), format="%Y-%m-%d")
## I'm not too worried about January 2014 - it makes the lengths of the 'month vectors' uneven and ruins the below graphs
myData <- myData[format(myData$Datey, "%Y") < 2014,]
byYear <- data.frame(aggregate(Sales ~ format(Datey, "%Y"), data = myData, FUN=sum))
colnames(byYear) <- c("Year", "houseSales")
ggplot(byYear, aes(x=Year, y=houseSales)) + geom_bar(stat="identity") + ggtitle("Number of UK House Sales") + theme(axis.text.x = element_text(angle=90, hjust=1)) + scale_y_continuous(name="Houses Sold", labels=comma)
byMonth <- data.frame(aggregate(Sales ~ format(Datey, "%m"), data = myData, FUN=sum))
colnames(byMonth) <- c("Month", "houseSales")
byMonth$Month <- factor(month.name, levels=month.name)
ggplot(byMonth, aes(x=Month, y=houseSales)) + geom_bar(stat="identity") + ggtitle("Number of UK House Sales") + theme(axis.text.x = element_text(angle=90, hjust=1)) + scale_y_continuous(name="Houses Sold", labels=comma)

Giving us what’d I’d class as a couple of very interesting graphs:

Number of UK House Sales by Year

Number of UK House Sales by Year

and:

UK House Sales by Month

Number of UK House Sales by Month

In terms of the housing crash, we saw it a bit in the average house sale price but we can see the main impact was a complete slow-down on the number of houses sold. There are potentially hints of a re-awakening in 2013 but I guess we’ll have to see how this year ends up panning out. The monthly variation is interesting and at first glance, counter-intuitive when viewed alongside the average house price data. Naively, you’d expect the average house price to be highest when fewer houses were being sold (what with number of houses being the denominator and all). I’m not too bothered in digging into the relationship between number of houses sold and average house sale price (I’ve got the feeling that it’s the sort of thing economists would concern themselves with) so won’t really be looking at that. I am however now at least a bit interested in the most sold houses in the UK – I don’t know what I’ll uncover but I’m marking it down as something to look at in the future.

Anyway, now we’ve had a first look at our data let’s see if we can track the proportion of UK house sales made by each region. There are likely a few ways to do this in R; I’ll be picking the SQL-esque way because I use SQL a lot more than I use R and so am more familiar with the ideas behind it. I’d be glad to be shown a more paradigmically R way to do it (in the comments):

myData$Year <- format(myData$Datey, "%Y")
myData <- merge(x=myData, y=byYear, by = "Year")
myData$Percent <- 100*(myData$Sales/myData$houseSales)
## I'm not very London-centric but given that they're the biggest house sellers in the UK...
londontimeseries <- ts(myData[myData$Region == 'GREATER LONDON',]$Percent, frequency=12, start=c(1995, 1))
london_decomposed <- decompose(londontimeseries)
plot(london_decomposed)
seasonality <- data.frame(london_decomposed$seasonal[c(1:12)])
colnames(seasonality) <- c("Sales", "Month")
ggplot(seasonality, aes(x=Month, y=Sales)) + geom_bar(stat="identity") + ggtitle("Seasonal variations in London's proportion of UK House Sales") + theme(axis.text.x = element_text(angle=90, hjust=1)) + scale_y_continuous(name="$ of London's % of total UK house sales", labels=percent)

giving:

London's proportion of UK House Sales

The percent of UK house sales that were made in Greater London between 1995 and 2014 and inferences around the overall trend and seasonality

and the bit we were after:

Seasonal Variations in London's Percent of the UK Housing Market (by number sold)

Seasonal Variations in London’s Percent of the UK Housing Market (by number sold)

Well, I don’t really know if that’s good news or bad. It’s good in the fact that we thought to check the factors behind seasonal variations in house price. It’s bad because I can no longer advise people to buy houses in the winter (I’ve checked and there’s a seasonal variation for every region I tried). In all honesty, I think the two graphs above are really interesting. I’m saying that the housing crash effected London more strongly than the rest of the country, but that the market in London bounced back within a year and is now above pre-crash levels. The size of the seasonal variations is pretty marked as well, with 20% swings either way from London’s mean value of percent of total house sales (sorry if the language seems verbose – I’m being careful to be precise).

What does this mean for our investigation into the seasonality of the average house price? Well, I’m confident that the average house price is seasonal but I’m also confident that we can’t use that to advise people when they should be selling their house (just yet).

There are a couple of pieces of analysis I’d now like to do on this data. I think it’d be really interesting to get an idea of the ‘most-sold’ house in the UK since 1995. I also think there may be surprises around the correlation between the number of times a house is sold and its selling price. However, this seasonality by region is also really interesting and I think I’d like to try to cluster regions based on the seasonality of their housing market. It’d be interesting to graph the clusters and see if the divide is North/South, City/Country or something else entirely. Additionally, the (G.C.S.E) economist in me is screaming out for the same investigation as above but with total sale price instead of number sold.

UK House Prices – Seasonality in Time Series with R

Hi All,

After work over one year (approximately 800,000 data points) it’s time to start ramping things up a bit and we’ll be heading into the realms of big data. There are a bunch of definitions of big data floating around but I’m going to go with the one I like the most: “Big data is the point at which the size of your data becomes part of the problem.” That of course depends entirely on the hardware you’ve got and given my current hardware situation, as soon as we start looking at multiple years of house sale data, we’re firmly in ‘big data’ territory. In fact, the size of the data isn’t really that large – only ~20 million rows with a total size of 3.2GB. I could upgrade my RAM for a small amount and we’d be fine to deal with this in the usual ways. However, I’m cheap, the shops are closed and I’d rather think of a less hardware bound way of doing it.

Unfortunately for me, while working at home on my own projects, I don’t have access to any cluster of computers so won’t be throwing my data into HDFS and getting my mappers and reducers polished. At work we’re soon to be adding 40TB of storage and 10 more nodes to our existing cluster and that set up would chew through this sort of thing like nobody’s business. Given an i5 and 6GB of RAM, we’re going to have to be a tad more creative 😉 It’s also worth noting that, for 3.2GB of data, distributing the problem isn’t likely the most efficient solution. Just because I’m talking about big data, there’s no reason we can’t use R, Python, SQL or anything else that will solve the problem.

I liked what we were able to do in R and don’t feel I’ve had a sufficiently deep dive into the statistical functions it offers and so will be trying to use it again. Fitting all the data in memory isn’t going to work but R offers a number of packages for working with data that exceeds the RAM you’d be willing to give it; I’ll be using bigmemory.

First things first, all the data is split across text files. If we want to have them read into a single table I think we’ll be best creating that text file in bash and adding a new column with the year – bash to the rescue:


for file in pp-*.csv
do
echo "Processing $file...";
year=${file:3:4};
awk -F, -v year=$year '{print year","$0}' $file >> pp-all.csv;
done

At the risk of stating the obvious, that will loop over all the relevant files, extract the year from the title of the file, prepend it to the start of the line and then append the line to pp-all.

O.K – originally I was going to use bigmemory but have found it unsuitable for the task (namely, I couldn’t even get it to load the file from disk). I’m sure that’s more a damning indictment of my own ability rather than bigmemory but I’ll proceed in a different way for now.


cut -d ',' -f3,4 pp-all.csv | tr -d '"' | cut -d '-' -f1,2 | awk -F, '{ total_array[$2] += $1; count_array[$2] += 1} END {for (region in total_array) print region"t"total_array[region]/count_array[region]}' | tr '-' ' ' | sort -nk1 -nk2 | sed -e 's/s+/-/' > price_summary_by_month.txt

OK – it’s a bit of a nasty one-liner but it basically:
1.) Gets the date and house sale price
2.) Formats the output
3.) Create an array of the count and sum of sale prices per month
4.) Sort by date and format
5.) Output to price_summary_by_month.txt

Now I’ve got myself a lovely text file – I decided to have a dig into Python and especially, Matplotlib. As it’s not really relevant to the overall direction of this analysis I won’t include the code here but it’s available on my Github. As a summary, this script looks at the percentage of total value of the year’s sales made in each month. A vector is built for each month and then the Gaussian density is calculated for that month. This is plotted versus the surrounding months. I’m aware that’s likely a poor explanation but if you’re interested, have a look at the code and feel free to drop me a comment. The output is something like this:

Python sample figure

% of value of house sales since 1995, by month.

Anyway, back to the main thread of the investigation. I’ve got a relatively small time-series and I’m investigating the periodicity of the data – I’m right back in native R territory.

priceSummary <- read.table('price_summary_by_month.txt', header=F, col.names=c("date", "price"))
pricetimeseries <- ts(priceSummary$price, frequency=12, start=c(1995,1))
plot(pricetimeseries, ylab="Average sale price (£)", main="UK House Prices")

Giving us:

UK House Prices

Average UK House Price (by month)

That’s pretty much as you’d expect – generally rising prices with a bit of a wobble around the housing crash. Disappointingly, there doesn’t seem to be that much evidence of seasonality being a factor. However, R provides us the tools to see whether or not that’s the case:

library(ggplot2)
library(scales)
pricetimeseriescomponents <- decompose(pricetimeseries)
plot(pricetimeseriescomponents)
seasonality <- data.frame(pricetimeseriescomponents$seasonal[c(1:12)])
seasonality$Month <- factor(month.name, levels=month.name)
colnames(seasonality) <- c("Price", "Month")
ggplot(seasonality, aes(x=Month, y=Price)) + geom_bar(stat="identity") + ggtitle("Seasonal variations in house sale price") + theme(axis.text.x = element_text(angle=90, hjust=1)) + scale_y_continuous(name="Price difference (£)", labels=comma)
orderedSeasonality <- transform(seasonality, Month = reorder(Month, Price))
ggplot(orderedSeasonality, aes(x=Month, y=Price)) + geom_bar(stat="identity") + ggtitle("Seasonal variations in house sale price") + theme(axis.text.x = element_text(angle=90, hjust=1)) + scale_y_continuous(name="Price difference (£)", labels=comma)

This gives us our first hint at some real cyclicity to the housing market:

Decomposition of UK House Prices

Decomposition of UK House Prices

 

House price seasonality

Seasonality of House Prices – the amount above/below average you’ll pay for a house based on the month

and a perhaps more helpful, sorted version:

Ordered house price seasonality

Ordered seasonality of House Prices – the amount above/below average you’ll pay for a house based on the month

Well well well. Time to throw the cowboy hat into the air, unholster our pistol and start firing wildly into the sky? Or at least quietly inform those who I know who are looking to buy/sell a house that they can likely save/make themselves an extra £10,000 or so if only they’re willing to wait until summer before selling?

Unfortunately, I’m not confident enough to say that just yet. I’ve already fairly conclusively shown that there are a bunch of factors that affect the sale price of a house. Now while I’m happy to say that the month of the year is significant in determining the sale price, I don’t really know why. It could be that people are more inclined to buy houses in the south during summer. As these houses are more expensive, we see the average house price rise in the summer months. It could be that people are less inclined to buy terraced houses during the summer (heating concerns?) and, as terraced houses are generally cheaper, the average sale price is inflated.

If either of those statements, or indeed, any other like it that I’ve not thought of are true it could lead to me giving bad advice in specific cases. To be able to report anything especially useful (and ultimately, actionable) we’ll need to look a bit closer at the causes of seasonality. My next post will hopefully address these issues and will try to determine whether or not the average UK house buyer is better off waiting until winter before buying.

Monthly House Price Variation – an adventure in R

So we’ve got our data set, we’ve had a cursory investigation and now we’re ready to see if we can find anything interesting. I’m going to proceed in a fairly methodical way and be precise in the way I do things – let’s do this scientific like.

So let’s start with a null hypothesis: “The month alone has no impact on the average selling price of houses“.

At this point, I don’t really know whether or not that’s true but it seems likely that it’s not. I can imagine that the housing market is, to some extent, cyclical. The first thing I’ll do is plot the data – there are a number of reasons why this is a good idea and I’d advise plotting your data wherever possible as a first step.

I’ll run this example investigation in R – it’s great for exploratory analysis and it allows me to produce graphics that I can share really easily.

housingData <- read.csv('pp-2013.csv', header=TRUE)
housingData$date <- as.Date(housingData$date, "%Y-%m-%d %H:%M")
# We're only interested in the month at this point
housingData$month <- strftime(housingData$date, "%m")
housingDataSummary <- data.frame(aggregate(housingData$price, by=list(housingData$month), FUN=mean)
colnames(housingDataSummary) <- c("Month", "Price")
# It's nice to have a look at the data before we perform our test on it, just to get an idea of how it looks and to check we think what we've done up to this point is reasonable.
# Let's take advantage of a very commonly used R library - ggplot2
library(ggplot2)
ggplot(housingDataSummary, aes(x=housingDataSummary$Month, y=housingDataSummary$Price, fill=housingDataSummary$Price)) + geom_bar(stat="identity", width=0.4, position=position_dodge(width=0.5)) + guides(fill=FALSE) + xlab("Month") + ylab("Average price") + ggtitle("Average GB house sale price in 2013")+ scale_y_continuous(labels=comma) + coord_fixed(ratio=0.000035)

Giving us:

Average house price by month in the UK in 2013

Average house price by month in the UK in 2013

O.K – I don’t know about you but looking at the graph I’d say it looks like there might be something to the theory that the month is important in house sales. In this year, we can see a dip in the early months and then a peak when we get to summer.

Let’s shore up the mathematics behind this – I’m going to imagine the situation where I’ve got twelve test groups (one for each month) where my values are the sold house prices. There’s a whole lot of statistical tests designed for this situation. To compare the means of these groups with each other I’d perform a one-way ANOVA (analysis of variance) – a multivariate extension of the t-test. Technically, the assumptions made in performing a one-way are independence of measurements (I’m happy that the sale price of one house is independent of the sale of another), continuity of the dependent variable (house price is continuous) and the dependent variables come from a normal distribution. A simple density plot shows us that the house prices aren’t normally distributed:

2013 house price density plot

2013 house price density plot

However, fear not. The t-test is still a good choice of test under violations of normality, especially so when there are lots of data points (we’ve got over 700,000) . As a little check, let’s also have a go at comparing the medians of the test groups. To do this we can use the Mann-Whitney U test and its multivariate brother, the Kruskal-Willis test. These are non-parametric tests (don’t require normally distributed data) and so if these say the medians of the groups are significantly different and the one-way ANOVA has shown the means of the groups are significantly different, we can be fairly confident they are!

month_aov <- aov(price ~ month, data=housingData)
summary(month_aov)
print(model.tables(month_aov, "means"), digits=2)
kruskal.test(price ~ as.factor(month), data=housingData)

Giving us:


Df Sum Sq Mean Sq F value Pr(>F)
month 11 4.463e+13 4.057e+12 46.38 <2e-16 ***
Residuals 781167 6.834e+16 8.748e+10
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Tables of means
Grand mean

246834.8

month
01 02 03 04 05 06 07 08 09 10 11 12
245200 234879 235136 242679 237509 243942 257421 256289 256988 249722 245077 246392
rep 43404 45238 54692 51025 66357 66175 73526 79119 69314 76042 80986 75301


Kruskal-Wallis rank sum test

data: price by as.factor(month)
Kruskal-Wallis chi-squared = 1642.505, df = 11, p-value < 2.2e-16

There’s an awful lot of that that we’re not interested in – the key bits for us to pick out are the p-values on the ANOVA and the Kruskal-Wallis tests – both of which are < 2e-16. We can fairly conclusively say that there are statistically significant differences between the mean and median of the average house prices by month.

However, let’s have a think what assumptions we’re making and whether we’re comfortable making them. Firstly, a previous investigation has hinted that the region the house is in makes a difference to the sale price. This could impact our data in any number of ways – it could be that each region sells their houses at different times and as average house value per region is different then the average house value per month ends up being different. Obviously, region isn’t the only concern – I singled that out as I’ve previously looked at it and know it’s a contributing factor to house price.

Additionally, we’re only looking in one year. That means that any overall change in house prices (ala the property market crash) will completely throw our results off and give us the (possibly false) impression of a cyclical housing market.

There’s an awful lot more we’ll have to consider if we’re to answer our question satisfactorily: does the month impact the selling price of a house? In the next post in this series I’ll be looking at ways of unpicking dependencies (MANOVA and MANCOVA) and will likely have to do bits of the analysis in a distributed way as there’s no way the entire dataset (since 1995) is going to fit in my RAM!

© 2017 DogDogFish

Theme by Anders NorenUp ↑