## DogDogFish

### Data Science, amongst other things.

#### Category: Mathematics

Hi all,

If you were especially upset then I’m sorry it’s been a while since I posted – I discovered Game of Thrones. In a later post I’ll chart the effect of Game of Thrones on overall productivity. I think there’ll be some unsurprising results. Anyway, I spend a reasonable amount of time on the train with my (oft abused) laptop each morning/evening; I don’t have the internet and I don’t have any textbooks so it’s basically a question of what I can work on given only the documentation on my computer and whatever I can remember about programming/maths/stuff.

I was having a think and remembered that you could estimate Pi using a Monte Carlo method and thought like that sounded like the sort of thing I should do. The logic is basically as follows:

Let’s draw a square of side length 2r and a circle centred exactly in the middle of the square with radius r. A well organised blogger would show you a diagram of this set-up, screw it, this is the code to do it and this is what it looks like:

```import matplotlib.pyplot as plt
fig = plt.figure()
circle = plt.Circle((0,0), 1)
axis.set_xlim([-1,1])
axis.set_ylim([-1,1])
axis.set_title('A Circle in a Square')
plt.show()
```

Brilliant – was it worth it? Probably not. But there you have it – with that set up we can now start the Monte Carlo bit. We’ll throw darts at that picture randomly; you’d expect the number of darts in the circle to be proportional to the area of the circle and the number of darts in the square to be proportional to the area of the square. Using that fact and the formulae for the areas of a circle and a square you can estimate Pi using the ratio of darts in the circle and in the square.

Sound good? It’s fairly easy to run this in Python and graph the output using Matplotlib. You’ll see I’ve used Object Oriented Python for this particular exercise, I don’t really know why. Especially because I had a chance to use inheritance and didn’t. Well done me. I’ve let everybody down. Anyway – this is the code I came up with and the graph below shows what I ended up with:

```#!/usr/bin/python

import numpy as np
import math
import matplotlib.pyplot as plt

"""
Calculate pi using Monte-Carlo Simulation
"""

"""
First - the maths:
A circle has area Pi*r^2
A square wholly enclosing above circle has area 4r^2
If we randomly generate points in that square we'd expect the ratio of points in the square/points in the circle to equal the area of the square divided by the circle.
By that logic n_in_sq/n_in_cir = 4/Pi and so Pi = (4 * n_in_cir)/n_in_sq
"""

class pi_calculator(object):

self.iterations = iterations

def scatter_points(self):
for _ in range(self.iterations):
self.square.increment_point_count(point_x, point_y)
self.circle.increment_point_count(point_x, point_y)

def return_pi(self):
return (4.0*self.circle.return_point_count())/self.square.return_point_count()

def calculate_accuracy(self, calc_pi):
absolute_error = math.pi - calc_pi
percent_error = 100*(math.pi - calc_pi)/math.pi
return (absolute_error, percent_error)

def return_iterations(self):
return self.iterations

class square(object):

self.point_count = 0

def in_square(self, point_x, point_y):
return (self.upper_x > point_x > self.lower_x) and (self.upper_y > point_y > self.lower_y)

def increment_point_count(self, point_x, point_y, increment = 1):
if self.in_square(point_x, point_y):
self.point_count += increment

def return_point_count(self):
return self.point_count

class circle(object):

self.point_count = 0

def in_circle(self, point_x, point_y):
return point_x**2 + point_y**2 < self.radius**2

def increment_point_count(self, point_x, point_y, increment=1):
if self.in_circle(point_x, point_y):
self.point_count += increment

def return_point_count(self):
return self.point_count

if __name__ == '__main__':
axis_values = []
pi_values = []
absolute_error_values = []
percent_error_values = []
for _ in range(1,3000,30):
pi_calc = pi_calculator(1, _)
pi_calc.scatter_points()
print "Number of iterations: %d    Accuracy: %.5f" % (pi_calc.return_iterations(), math.fabs(pi_calc.calculate_accuracy(pi_calc.return_pi())))
axis_values.append(_)
pi_values.append(pi_calc.return_pi())
absolute_error_values.append(math.fabs(pi_calc.calculate_accuracy(pi_calc.return_pi())))
percent_error_values.append(math.fabs(pi_calc.calculate_accuracy(pi_calc.return_pi())))

improvement_per_iteration = [absolute_error_values[index] - absolute_error_values[index-1] for index, value in enumerate(absolute_error_values) if index > 0]
fig = plt.figure()
fig.suptitle('Calculating Pi - Monte Carlo Method')
ax1.set_xticklabels([str(entry) for entry in axis_values[::len(axis_values)/5]], rotation=30, fontsize='small')
ax1.set_xlabel('Iterations')
ax1.set_ylabel('Calculated value of Pi')
ax1.plot(pi_values, 'k')
ax1.plot([math.pi for entry in axis_values], 'r')
ax2.set_ylabel('Absolute error')
ax2.set_xticklabels([str(entry) for entry in axis_values[::len(axis_values)/5]], rotation=30, fontsize='small')
ax2.set_xlabel('Iterations')
ax2.plot(absolute_error_values, 'k', label="Total Error")
ax3.set_ylabel('Absolute percentage error (%)')
ax3.set_xticklabels([str(entry) for entry in axis_values[::len(axis_values)/5]], rotation=30, fontsize='small')
ax3.set_xlabel('Iterations')
ax3.plot(percent_error_values, 'k', label="Percent Error")
ax4.set_ylabel('Absolute improvement per iteration')
ax4.set_xticklabels([str(entry) for entry in axis_values[::len(axis_values)/5]], rotation=30, fontsize='small')
ax4.set_xlabel('Iterations')
ax4.plot(improvement_per_iteration, 'k', label="Absolute change")
plt.savefig('pi_calculation.png')
plt.show()

```

giving us:

I can only apologise for any dodgy code in there – in my defence, it was early in the morning. As you can see, it only takes around 100 ‘darts thrown at the board’ to start to see a reasonable value for Pi. I ran it up to about 10,000 iterations without hitting any significant calculation time. The fourth graph doesn’t really show anything interesting – I just couldn’t think of anything to put there.

That’ll do for now – I built something that’ll stream tweets on the Scottish Independence Referendum but don’t know what to do with it yet; there’ll likely be some sort of blog post. There’s a chance I’ll do some sentiment analysis but I’m not sure yet.

When you play the Game of Thrones, you win or you die.

Hi all,

Bit of a different blog coming up – in a previous post I used Markov Clustering and said I’d write a follow-up post on what it was and why you might want to use it. Well, here I am. And here you are. So let’s begin:

In the simplest explanation, imagine an island. The island is connected to a whole bunch of other islands by bridges. The bridges are made out of bricks. Nothing nasty so far – apart from the leader of all the islands. They’re a ‘man versus superman’, ‘survival of the fittest’ sort and so one day the issue a proclamation. “Every day a brick will be taken from every bridge connected to your island and the bricks will be reapportioned on your island back to the bridges, in proportion to the remaining number of bricks in the bridge.”

At first, nobody is especially worried – each day, a brick disappears and then reappears on a different bridge on the island. Some of the islands notice some bridges getting three or four bricks back each day. Some hardly ever seem to see a brick added back to their bridge. Can you see where this will lead in 1000 years? In time, some of the bridges (the smallest ones to start off with) fall apart and end up with no bricks at all. If this is the only way between two islands, these islands become cut off entirely from each other.

This is basically Markov clustering.

For a more mathematical explanation:

```import numpy as np
transition_matrix = np.matrix([[0,0.97,0.5],[0.2,0,0.5],[0.8,0.03,0]])
``` $Transition Matrix = begin{matrix} 0 & 0.97 & 0.5 \ 0.2 & 0 & 0.5 \ 0.8 & 0.03 & 0 end{matrix}$

In the above ‘islands’ picture those numbers represent the number of bricks in the bridges between islands A, B and C. In the random-walk interpretation, those are the probabilities that you’ll end up at each node as the number of random walks tends to infinity. In my previous post on house prices, I used a correlation matrix.

First things first – I’m going to stick a one in each of the identity areas. If you’re interested in why that is, have a read around self-loops and, even better, try this out both with and without self-loops. It sort of fits in nicely with the above islands picture but that’s more of a fluke than anything else – there’s always the strongest bridge possible between an island and itself. The land. Anyway…

```np.fill_diagonal(transition_matrix, 1)
``` $Transition Matrix = begin{matrix} 1 & 0.97 & 0.5 \ 0.2 & 1 & 0.5 \ 0.8 & 0.03 & 1 end{matrix}$

Now let’s normalize – make sure each column sums to 1:

```transition_matrix = transition_matrix/np.sum(transition_matrix, axis=0)
``` $Transition Matrix = begin{matrix} 0.5 & 0.485 & 0.25 \ 0.1 & 0.5 & 0.25 \ 0.4 & 0.015 & 0.5 end{matrix}$

Now we perform an expansion step – that is, we raise the matrix to a power (I’ll use two – you can change this parameter – in the ‘random-walk’ picture this can be thought of as varying how far a person can walk from their original island).

```transition_matrix = np.linalg.matrix_power(transition_matrix, 2)
``` $Expanded Matrix = begin{matrix} 0.3985 & 0.48875 & 0.37125 \ 0.2 & 0.30225 & 0.275 \ 0.4015 & 0.209 & 0.35375 end{matrix}$

Then we perform the inflation step – This involves multiplying each element in the matrix by itself (to a power) and then normalizing on column again. Again, I’ll be using two as a power – increasing this leads to a greater number of smaller clusters:

```for entry in np.nditer(transition_matrix, op_flags=['readwrite']):
entry[...] = entry ** 2
``` $Inflated Matrix = begin{matrix} 0.15880225 & 0.23887556 & 0.13782656 \ 0.04 & 0.09135506 & 0.075625 \ 0.16120225 & 0.043681 & 0.12513906 end{matrix}$

Finally (for this iteration) – we’ll normalize by row.

```transition_matrix = transition_matrix/np.sum(transition_matrix, axis=0)
``` $Normalized Matrix = begin{matrix} 0.44111185 & 0.63885664 & 0.40705959 \ 0.11110972 & 0.24432195 & 0.22335232 \ 0.44777843 & 0.11682141 & 0.36958809 end{matrix}$

And it’s basically that simple. Now all we need to do is rinse and repeat the expansion, inflation and normalization until we hit a stable(ish) solution i.e. $Normalized Matrix_{n+1} - Normalized Matrix_n < epsilon$
for some small epsilon.

Once we’ve done this (with this particular matrix) we should see something like this: $Final Matrix = begin{matrix} 1 & 1 & 1 \ 0 & 0 & 0 \ 0 & 0 & 0 end{matrix}$

Doesn’t look like a brilliant result be we only started with a tiny matrix. In this case we have all three nodes belonging to one cluster. The first node (the first row) is the ‘attractor’ – as it has values in its row it is attracting itself and the second and third row (the columns). If we were to end up with the following result (from a given initial matrix): $Final Matrix = begin{matrix} 1 & 0 & 1 & 0 & 0\ 0 & 1 & 0 & 1 & 0\ 0 & 0 & 0 & 0 & 0\ 0 & 0 & 0 & 0 & 0\ 0 & 0 & 0 & 0 & 1\ end{matrix}$

This basically says that we have three clusters {1,3} (with 1 as the attractor), {2,4} (with 2 as the attractor) and {5} on its lonesome.

Instead of letting you piece all that together here’s the code for Markov Clustering in Python:

```import numpy as np
import math
## How far you'd like your random-walkers to go (bigger number -> more walking)
EXPANSION_POWER = 2
## How tightly clustered you'd like your final picture to be (bigger number -> more clusters)
INFLATION_POWER = 2
## If you can manage 100 iterations then do so - otherwise, check you've hit a stable end-point.
ITERATION_COUNT = 100
def normalize(matrix):
return matrix/np.sum(matrix, axis=0)

def expand(matrix, power)
return np.linalg.matrix_power(matrix, power)

def inflate(matrix, power):
entry[...] = math.pow(entry, power)
return matrix

def run(matrix):
np.fill_diagonal(matrix, 1)
matrix = normalize(matrix)
for _ in range(ITERATION_COUNT):
matrix = normalize(inflate(expand(matrix, EXPANSION_POWER), INFLATION_POWER))
return matrix
```

If you were in the mood to improve it you could write something that’d check for convergence for you and terminate once you’d achieved a stable solution. You could also write a function to perform the cluster interpretation for you.

As you were.