## DogDogFish

### Data Science, amongst other things.

Hi all,

Can’t promise that this post won’t be bitty – I’m trying to simultaneously run an SVM and a random forest on a bunch of particle physics data for the Kaggle competition. Check it out, it’s pretty cool. Anyway, my computer is straining under the weight of those calculations and so while that was happening I decided to have a look at stock prices using Python/Pandas again.

After chatting with co-blogger Sean, and based on my (limited, and hilariously bad) experiences of stock trading we decided it’d be interesting to identify volatile stocks that don’t seem to have greatly varying fundamental value. We’re basically looking for the position of a harmonic oscillator in a stock. I’m not graphing that – look it up yourself. The logic being, there’ll be a point at which it it’s profitable to buy a stock like this on a down and sell again when it’s back up. Of course, this requires the assumption that the stock itself isn’t having a fundamental value shift – it’s just suffering from cyclicity. I don’t really know if/how this’ll work but that’s half the fun…

Right, back to it (I’ve caught up with Game of Thrones – get it watched). I’ve thought a reasonable amount about this and have decided our first job is to look at maximizing the following quantity:

$frac{Volatility}{Change_{daily}^n Change_{weekly}}$

I might also throw in an additional concern – I’d like to be able to enter and exit the market whenever I want – I don’t see this being a big problem for me (I’m not going to be using a lot of money) but it’ll certainly be a concern for bigger players. Let’s cross that bridge if we need to.

So, to start off with, my daily volatility I’m going to define as
$frac{sum_{i={day_1}}^{Today} frac{HighPrice_i - LowPrice_i}{ClosePrice_i}}{NumberOfDays}$

Hopefully nothing earth-shattering there, just want to see how much it varies over a day. Now while I want the stock price to vary a lot, I want it to head back to where it started. A rapidly increasing/decreasing stock is going to have wildly varying days. However, it’s also going to have a large overall trend. That’s no good for the purposes of finding stocks to buy/sell on a short time period.

$Change_{daily} = sqrt{frac{sum_{i={day_1}}^{Today} (frac{ClosePrice_i - OpenPrice_i}{OpenPrice_i})^2}{NumberOfDays}}$

$Change_{weekly} = sqrt{frac{sum_{i={week_1}}^{Today} (frac{ClosePrice_i - OpenPrice_i}{OpenPrice_i})^2}{NumberOfWeeks}}$

Easy does it – the reason I’ve squared the result is basically that I don’t care whether the stock is rising or falling. I’m trying to minimize the overall long-term variations from the mean.

So, how easy is this in Python? Remarkably so. Let’s start off by plotting a scatter graph of some of the more promising stocks.

```import numpy as np
import pandas as pd
from datetime import datetime
from pylab import savefig

## A list of American Stock Symbols

volatility_measure = []
daily_change_measure = []
weekly_change_measure = []
labels = []

## Let's start out with the biggest 10 companies in my list

try:
company_frame = DataReader(company.strip(), 'yahoo', datetime(2013,1,1), datetime.now().date())
company_frame['Volatility'] = (company_frame['High'] - company_frame['Low'])/company_frame['Close']
volatility_measure.append(company_frame['Volatility'].mean())
company_frame['Daily_Change'] = company_frame['Close'].diff()
daily_change_measure.append(np.sqrt(np.mean(company_frame['Daily_Change']**2)))
## Take every 5th row
weekly_company_frame = company_frame[::5]
weekly_company_frame['Change'] = weekly_company_frame['Close'].diff()
weekly_change_measure.append(np.sqrt(np.mean(weekly_company_frame['Change']**2)))
labels.append(company.strip())
except:
print "Problem parsing %s" % company.strip()

for i in range(1,7):
change_metric = [daily * (weekly ** (1./i)) for daily, weekly in zip(daily_change_measure, weekly_change_measure)]
ax = plt.subplot(3,2,i)
plt.xlabel('Log of overall change metric')
plt.ylabel('Volatility metric')
plt.title('Weekly power %.2f' % float(1./i))
plt.scatter(change_metric, volatility_measure, c = volatility_measure, cmap=plt.get_cmap('Spectral'), label='Weekly power %.2f' % float(1./i))
for label, x, y in zip(labels, change_metric, volatility_measure):
plt.annotate(label, (x,y), xytext=(0,8), textcoords='offset points')

plt.gca().set_xscale('log')
plt.gca().legend_ = None

plt.suptitle('Daily Volatility of Stocks versus Overall Trend')
plt.tight_layout()
plt.show()
savefig('StockVolatility.png')
```

OK – it’s not especially pretty but it gives us the following picture:

The 10 biggest US stocks – their daily & weekly change versus their daily volatility

You could also make a fair point that I’ve formatted it poorly. Open it up as big as your browser will let you and you’ll be able to see it nicely. Or, just run the code and create your own picture. It’s dead easy. I promise.

So what can we infer from that picture? I’m going to go ahead and say not a huge deal. Apple & Google have made some crazy ups and downs over the last year or two (mostly ups) and hence I’ve been forced to use a log plot. Other than that, we can see a cluster of the remaining companies with GE seeming the most stable all round. One point I’d like to make now: by defining my metrics in such a way that they don’t really match to anything in reality, I’ve lost the ability to understand exactly what I’ve plotted. What I’m trying to say, is that the log of an overall change metric isn’t an intuitive quantity. Usually, it’s a good idea to pick metrics that have a fairly firm grounding in the real world unless you’ve got a really good reason not to. In my case, my reason is that all I’m trying to do here is identify stocks in the upper left most corner – I don’t care what their values are yet.

I’d also like to make the point here that for this data set, the change of power associated with the weekly metric seems to make no difference. I put it there to express the idea that we’re likely to want a different weighting on the daily and weekly variability depending on how often we want to trade the stock. As I’m hoping to trade multiple times daily, the daily variability is more important to me than the weekly variability (hence my choice of fractional powers of the weekly variable). If you’re looking at trading less regularly, change your parameters accordingly.

Now I’m going to go out on a limb and say that, when looking for daily volatility, the biggest companies in America aren’t the place to go looking. I’m sure that the algorithmic trading people are all over this kind of trade with fancy-pants C++ code designed to execute multiple trades/second. To do this at a reasonably large scale (and to overcome transaction/infrastructure costs) I’m going to say those guys will play with these big companies where a purchase of £1 million+ of shares isn’t going to be such a big deal. Playing in those markets must be the equivalent of going barracuda fishing with a thumb tack and a tie. I think we should start our search towards the lower market caps and work our way up until we’ve got a few hopefuls.

```volatility_measure = []
daily_change_measure = []
weekly_change_measure = []
labels = []

for company in company_information[company_information['MarketCap'] > 10000000].sort(['MarketCap']).head(25)['Symbol']:
try:
company_frame = DataReader(company.strip(), 'yahoo', datetime(2013,1,1), datetime.now().date())
company_frame['Volatility'] = (company_frame['High'] - company_frame['Low'])/company_frame['Close']
volatility_measure.append(company_frame['Volatility'].mean())
company_frame['Daily_Change'] = company_frame['Close'].diff()
daily_change_measure.append(np.sqrt(np.mean(company_frame['Daily_Change']**2)))
## Take every 5th row
weekly_company_frame = company_frame[::5]
weekly_company_frame['Change'] = weekly_company_frame['Close'].diff()
weekly_change_measure.append(np.sqrt(np.mean(weekly_company_frame['Change']**2)))
labels.append(company.strip())
except:
print "Problem parsing %s" % company.strip()

for i in range(1,7):
change_metric = [daily * (weekly ** (1./i)) for daily, weekly in zip(daily_change_measure, weekly_change_measure)]
ax = plt.subplot(3,2,i)
plt.xlabel('Log of overall change metric')
plt.ylabel('Volatility metric')
plt.title('Weekly power %.2f' % float(1./i))
plt.scatter(change_metric, volatility_measure, c = volatility_measure, cmap=plt.get_cmap('Spectral'), label='Weekly power %.2f' % float(1./i))
for label, x, y in zip(labels, change_metric, volatility_measure):
plt.annotate(label, (x,y), xytext=(0,8), textcoords='offset points')

plt.gca().set_xscale('log')
plt.gca().legend_ = None
plt.autoscale(tight=True)

plt.suptitle('Daily Volatility of Stocks versus Overall Trend')
plt.tight_layout()
plt.show()
savefig('SmallerCompanies.png')
```

Volatility versus overall change for American companies with Market Caps > \$10,000,000

Well bugger me. I don’t know about you but that looks pretty cool to me. Ignore all the gumph in the middle and look at the outliers – AMCO, GRVY, PRLS, DGLY and EEME. These are great examples of companies that are going to be either maximums or minimums for our given metric.

OK – I’m going to call it a night for now but just think of the possibilities open to us now! We can change our date ranges, play around with our metrics and loop through as many stocks as we can find symbols for (harder than you’d think!) until we’ve got a reasonable amount of stocks that we think are great candidates for regularly buying and selling.

Next time, I’ll finalize my list of stocks and hopefully start to gain an idea of when one of these stocks becomes a buy, and when it becomes a sell. That sounds fairly difficult actually. Ah well, that’s the fun.

Winter is coming.

Hi all,

Today I fancy a bit of a play around with stock prices – I recently took the plunge into the world of stocks & shares and have been getting more and more interested in the financial world as I’ve become more and more exposed to it through savings. I’m a bit sceptical as to being able to find anything ‘new’ or any real arbitrage opportunities – mostly because there’s a billion (trillion?) dollar industry built off of the back of stock trading. It attracts some really smart people with some really powerful gear and a whole lot of money to invest. However, there’s no harm in having a look around and seeing what interesting things we can do with the data.

R is well good but I want a bit more freedom with this little project and I’m missing Python. I find that with R, I spend a lot of my time getting data into the right format to be able to use the tools that already exist. With Python, if I’m silly enough to decide on a strange data structure then I can. I shouldn’t, but I can.

Ordinarily I like the Python & SQL combination and tend not to rely too heavily on the ‘Python analysis stack’ of Pandas/iPython/Scipy/Matplotlib, only pulling things in when necessary. I was going to follow the same pure Python & SQL route for this project until I found an awesome little feature of Pandas – in-built Google and Yahoo stock data integration. It’s not that much work to build this sort of thing yourself but why reinvent the wheel? 🙂

So – I guess we should start with some sort of question, shall we see if we can plot some of the big tech players (AMZ, GOOGL, FB e.t.c.) against the whole tech sector. Given the recent headlines in that area it should be interesting and at least give us some ideas about future work.

As a lazy person, I’m not necessarily inclined to manually go through a list of stock symbols and decide if they’re tech or even go through a list of tech stock and type them into a text file. A quick google shows me that there’s nothing (that I could find) in the way of a regularly updated text file of what I’m after but it shouldn’t be too difficult to coax Python into doing this for me – let’s start off with the NASDAQ site.

If you have a look, you’ll see it’s fairly regular in its URL structure and the URLs are easily craftable – there’s an annoying amount of pagination but you can’t have everything. Actually, hold the phone. You can download the list as a CSV – winner winner chicken dinner.

 Symbol Name LastSale MarketCap ADR TSO IPOyear Sector Industry Summary Quote

All I’m really after for now is the sector and symbol – market cap will prove useful but basically I think I can agree we’ve hit the jackpot!

Time for the Python:

```from pandas.io.data import DataReader
import pandas as pd
from datetime import datetime
import numpy as np
mega_frame = [DataReader(company.strip(),  "yahoo", datetime(2014,1,1), datetime.now().date()) for company in company_information[company_information.Sector == 'Technology']['Symbol']]
symbol_list = [symbol for symbol in company_information[company_information.Sector == 'Technology']['Symbol']]
```

At this point we’ve got all the data since the start of the year on every tech stock listed on the NASDAQ, NYSE and AMEX and it’s taken us 6 lines. Note that the population of mega_frame takes a fairly long time. In retrospect, we should have filtered further

20 minutes later and I’m regretting my decision to get all of them.

Cancelled it and switched to the first 50 – will just prove concept first.

Right, now I’ve got a list containing data frames – one for each of the first 50 tech stocks. Let’s throw in a percentage change column and make sure all our data frames are of the same length to avoid problems at a later date:

```mega_frame = [stock for stock in mega_frame if len(stock) == 79]
symbol_list = [symbol_list[index] for index in len(symbol_list) if len(mega_frame[i]) == 79]
for stock_index in range(len(mega_frame)):
mega_frame[stock_index]['perc_change'] = 100*((mega_frame[stock_index]['Close'] - mega_frame[stock_index]['Open'])/mega_frame[stock_index]['Open'])
## The modal value is 79 hence 79
percentage_change_list = [stock['perc_change'] for stock in mega_frame]
```

Now we’re going to create a correlation matrix out of those lists to see the most strongly correlated tech stocks over that time period (and in our subset). I’m also going to look at the negatively correlated stocks – you wouldn’t expect to see a strong negative correlation for two stocks in the same sector and region but it won’t hurt to look:

```correlation_matrix = np.corrcoef(percentage_change_list)
## Correlation with yourself is no big deal
for i in range(np.shape(correlation_matrix)[0]):
for j in range(np.shape(correlation_matrix)[1]):
if i == j:
correlation_matrix[i][j] = 0
maximum_indices = np.argmax(correlation_matrix, axis=1)
minimum_indices = np.argmin(correlation_matrix, axis=1)
for index in range(np.shape(correlation_matrix)[0]):
print "Stock %s is best correlated with stock %s: %.3g" % (my_list[index], my_list[maximum_indices[index]], correlation_matrix[index][maximum_indices[index]])
print "Stock %s is worst correlated with stock %s: %.3g" % (my_list[index], my_list[minimum_indices[index]], correlation_matrix[index][minimum_indices[index]])
```

So there we have it (I’ll leave it to run over all the tech stocks overnight) – a fairly quick and simple way to find the most correlated tech stocks in America over a given time period.

Now this isn’t a particularly great way of doing this, as I said earlier, there are people who dedicate their lives to this. If the fancy takes me, I’ll have a look at a few of these (maximal spanning trees, stability of eigenvectors of correlation matrices e.t.c) and see what improvements we can make to our very simple model.