Data Science, amongst other things.

Stock Prices and Python – Pandas to the rescue

Hi all,

Today I fancy a bit of a play around with stock prices – I recently took the plunge into the world of stocks & shares and have been getting more and more interested in the financial world as I’ve become more and more exposed to it through savings. I’m a bit sceptical as to being able to find anything ‘new’ or any real arbitrage opportunities – mostly because there’s a billion (trillion?) dollar industry built off of the back of stock trading. It attracts some really smart people with some really powerful gear and a whole lot of money to invest. However, there’s no harm in having a look around and seeing what interesting things we can do with the data.

R is well good but I want a bit more freedom with this little project and I’m missing Python. I find that with R, I spend a lot of my time getting data into the right format to be able to use the tools that already exist. With Python, if I’m silly enough to decide on a strange data structure then I can. I shouldn’t, but I can.

Ordinarily I like the Python & SQL combination and tend not to rely too heavily on the ‘Python analysis stack’ of Pandas/iPython/Scipy/Matplotlib, only pulling things in when necessary. I was going to follow the same pure Python & SQL route for this project until I found an awesome little feature of Pandas – in-built Google and Yahoo stock data integration. It’s not that much work to build this sort of thing yourself but why reinvent the wheel? 🙂

So – I guess we should start with some sort of question, shall we see if we can plot some of the big tech players (AMZ, GOOGL, FB e.t.c.) against the whole tech sector. Given the recent headlines in that area it should be interesting and at least give us some ideas about future work.

As a lazy person, I’m not necessarily inclined to manually go through a list of stock symbols and decide if they’re tech or even go through a list of tech stock and type them into a text file. A quick google shows me that there’s nothing (that I could find) in the way of a regularly updated text file of what I’m after but it shouldn’t be too difficult to coax Python into doing this for me – let’s start off with the NASDAQ site.

If you have a look, you’ll see it’s fairly regular in its URL structure and the URLs are easily craftable – there’s an annoying amount of pagination but you can’t have everything. Actually, hold the phone. You can download the list as a CSV – winner winner chicken dinner.

Downloading all the company information we get a CSV with the following headers:

Symbol Name LastSale MarketCap ADR TSO IPOyear Sector Industry Summary Quote

All I’m really after for now is the sector and symbol – market cap will prove useful but basically I think I can agree we’ve hit the jackpot!

Time for the Python:

from pandas.io.data import DataReader
import pandas as pd
from datetime import datetime
import numpy as np
company_information = pd.read_csv('allcompany.csv')
mega_frame = [DataReader(company.strip(),  "yahoo", datetime(2014,1,1), datetime.now().date()) for company in company_information[company_information.Sector == 'Technology']['Symbol']]
symbol_list = [symbol for symbol in company_information[company_information.Sector == 'Technology']['Symbol']]

At this point we’ve got all the data since the start of the year on every tech stock listed on the NASDAQ, NYSE and AMEX and it’s taken us 6 lines. Note that the population of mega_frame takes a fairly long time. In retrospect, we should have filtered further

20 minutes later and I’m regretting my decision to get all of them.

Cancelled it and switched to the first 50 – will just prove concept first.

Right, now I’ve got a list containing data frames – one for each of the first 50 tech stocks. Let’s throw in a percentage change column and make sure all our data frames are of the same length to avoid problems at a later date:

mega_frame = [stock for stock in mega_frame if len(stock) == 79]
symbol_list = [symbol_list[index] for index in len(symbol_list) if len(mega_frame[i]) == 79]
for stock_index in range(len(mega_frame)):
    mega_frame[stock_index]['perc_change'] = 100*((mega_frame[stock_index]['Close'] - mega_frame[stock_index]['Open'])/mega_frame[stock_index]['Open'])
## The modal value is 79 hence 79
percentage_change_list = [stock['perc_change'] for stock in mega_frame]

Now we’re going to create a correlation matrix out of those lists to see the most strongly correlated tech stocks over that time period (and in our subset). I’m also going to look at the negatively correlated stocks – you wouldn’t expect to see a strong negative correlation for two stocks in the same sector and region but it won’t hurt to look:

correlation_matrix = np.corrcoef(percentage_change_list)
## Correlation with yourself is no big deal
for i in range(np.shape(correlation_matrix)[0]):
    for j in range(np.shape(correlation_matrix)[1]):
        if i == j:
             correlation_matrix[i][j] = 0
maximum_indices = np.argmax(correlation_matrix, axis=1)
minimum_indices = np.argmin(correlation_matrix, axis=1)
for index in range(np.shape(correlation_matrix)[0]):
    print "Stock %s is best correlated with stock %s: %.3g" % (my_list[index], my_list[maximum_indices[index]], correlation_matrix[index][maximum_indices[index]])
    print "Stock %s is worst correlated with stock %s: %.3g" % (my_list[index], my_list[minimum_indices[index]], correlation_matrix[index][minimum_indices[index]])

So there we have it (I’ll leave it to run over all the tech stocks overnight) – a fairly quick and simple way to find the most correlated tech stocks in America over a given time period.

Now this isn’t a particularly great way of doing this, as I said earlier, there are people who dedicate their lives to this. If the fancy takes me, I’ll have a look at a few of these (maximal spanning trees, stability of eigenvectors of correlation matrices e.t.c) and see what improvements we can make to our very simple model.

1 Comment

  1. Jayesh

    Thanks a lot for the nice article. I am trying to run the code in Pyhton 3.4 and getting following error. Do you have any idea about this? Please advise
    First Error:
    *************************************
    >>>
    Traceback (most recent call last):
    File “C:UsersjayeshDesktoppthBig_data_Stock_analysisdownload.py”, line 12, in
    symbol_list = [symbol_list[index] for index in len(symbol_list) ]#if len(mega_frame[i]) == 15]
    TypeError: ‘int’ object is not iterable
    >>> ================================ RESTART ================================
    >>>
    *************************************
    Second Error:
    *************************************
    Traceback (most recent call last):
    File “C:UsersjayeshDesktoppthBig_data_Stock_analysisdownload.py”, line 18, in
    correlation_matrix = np.corrcoef(percentage_change_list)
    File “C:Python34libsite-packagesnumpylibfunction_base.py”, line 1951, in corrcoef
    c = cov(x, y, rowvar, bias, ddof)
    File “C:Python34libsite-packagesnumpylibfunction_base.py”, line 1901, in cov
    return (dot(X, X.T.conj()) / fact).squeeze()
    File “C:Python34libsite-packagespandascoregeneric.py”, line 1947, in __getattr__
    (type(self).__name__, name))
    AttributeError: ‘Series’ object has no attribute ‘conjugate’

Leave a Reply to Jayesh Cancel reply

Your email address will not be published. Required fields are marked *

© 2024 DogDogFish

Theme by Anders NorenUp ↑