Hi All,
As always, apologies for the length of time between posts – think this is a record. I was working on stuff I’m not allowed to share (work stuff, then Kaggle stuff) and got minorly derailed by Game of Thrones. Finished all the TV series, and one of the books. And Life of Pi. And I’m working my way through the Book Thief. So yeah, a tad derailed.
However, I’ve got a lazy Sunday and I saw an advert on TV that really annoyed me. It was basically along the lines of ‘did your doctor mess up? Why not sue them?’ and I think that it takes us towards where the Americans are. Which, when it comes to healthcare, is not where we want to go. Given that in the UK, if you’re being treated by a doctor it’s almost certainly an NHS doctor it seems pretty sucky that people are being encouraged to sue them. Puts the price up for everybody and makes people less likely to become doctors and whatnot. Anyway, this isn’t a political rant blog – I thought I’d have a look and see what data is available on the NHS to see if I could show the effects of an increase in litigation on the standard on medical care provided/costs (and ideally, contrast with America). In short, I couldn’t. The data that I wanted just wasn’t there. However, there was data on the drugs that the GPs for the NHS prescribes (at least between Jan and June in 2012) by practice with cost data. That seemed pretty interesting and so here we are.
I’ve been playing around with Google’s coLaboratory (check it out here) and would have loved to use this to do this particular bit of analysis. However, after playing round with it for a bit and struggling with external documents, finding my Google docs and various libraries I wanted I’ve decided to leave it for a while until coLaboratory becomes a tad more mature. Lots of promise there and with good Google Analytics API integration we could transform analytics practices at my company. Certainly one to watch.
Anyway, without that I’ll think of the questions I want to answer first and then pick my tool. Firstly, the data…
The data
As ever, data.gov.uk to the rescue – head here and download yourself a nice copy of the data:
http://data.gov.uk/dataset/gp-practice-prescribing-data
Once you’ve got a copy of all the data and the list of practices in England (I used the most recent one) we’re ready to start asking some questions of the data…
The Analysis
First off, let’s pick something simple – which drug costs the NHS most in each of the months of our test set, and overall.
Total Drug Cost
This actually seems to lend itself to the mapreduce paradigm pretty nicely – the mapper seems pretty unnecessary and as I’ve not got a cluster to hand (and this doesn’t warrant me spinning one up with AWS) I’ll just write a quick reducer in Python and use the Unix sort. For what it’s worth, I think I might write something about spinning up a quick cluster on AWS in the near future. It’s a fairly useful skill to have and given the increasing reluctance of my computer to perform the most basic of tasks, I think a fair bit of my future data analysis might have to happen in the cloud. Anyway, this is what my command will look like once I’m done:
awk 'FNR>1{print}' T201202PDP IEXT.CSV | sort -t , -k5 | python spending_reducer.py
Nothing too complicated there. I’m ignoring the first line (the awk command), then sorting the whole file based on the 4th column (the drug id) and piping the whole thing into the following reducer:
#!/usr/bin/python
import sys
current_drug = None
current_cost = 0.0
for line in sys.stdin:
authority, trust, practice, drug_code, drug_name, number_bought, ni_cost, act_cost, period = line.strip().split(',')
act_cost = float(act_cost.lstrip('0'))
if drug_name == current_drug:
current_cost += act_cost
else:
if current_drug:
print current_drug + "t" + str(current_cost)
current_drug = drug_name
current_cost = act_cost
print current_drug + "t" + str(current_cost)
Nothing too difficult there – we’re just keeping track of the drug we’re on and adding up as we go. If I had no fear for the amount of RAM I had we could’ve accomplished the same thing without the laborious sorting step using associative arrays in awk. But for the next stage – the total across all 6 months, I am very afraid (RAM wise) and so we can run the same query with a bit of wildcarding:
awk 'FNR>1{print}' T20120[1-6]P*.CSV | sort -t , -k5 | python spending_reducer.py > drugs_by_spend.txt
This runs the same calculation over every file matching that wildcard pattern (all the data between 2012/01 and 2012/06. Note that this’ll take a little time – that sort is reasonably expensive over the 4 or so gigabytes of data we’ve got. Now we’ll pull together a few graphs and for this I think we’ll use R…
my_frame <- data.frame(read.csv('drugs_by_spend.txt', header=F, sep="t", colnames=c("Drug", "Spend")))
my_frame <- my_frame[order(-my_frame$Spend),]
head(my_frame)
The top 5 drugs, by spend, in the first half of 2012 were:
- Fluticasone Propionate (Inh)
- Atorvastatin
- Enteral Nutrition
- Pregabalin
- Budesonide
Between them these cost: £738,620,789
Wow – that’s a hell of a lot. In 6 months, the actual costs of these drugs alone was more than £700million!
The total cost of drugs prescribed in that time period: £2,785,540,256
So I think we can surmise that lots of money is spent by the NHS – OK, I suppose that’s no surprise. For the non-doctors amongst us (that includes me) that list features two anti-asthmatic treatments i.e. those inhalers that I’m sure a lot of you have (also includes me). A quick Wikipedia shows that Pfizer holds the patent to at least a couple of those drugs (or at least did, Atorvastatin has expired) – it might be interesting to stick the patent holder next to these drugs. Maybe later…
Right, there’s lots that we could do here but I’m going to call it a day for now. In the future I think I’ll try and get more months of data and then start to look at evolving trends. To do that, I’ll use an AWS cluster and so will write something and using that.
Until then.