DogDogFish

Data Science, amongst other things.

Page 2 of 5

5 Advanced Slot Machine Strategies That Actually Work in 2024

When you sit down at a slot machine, the lights flash, the reels spin, and the promise of a big win feels just a click away. Yet most casual players rely on luck alone and miss out on proven tactics that can tilt the odds in their favor. In this guide we’ll break down five advanced slot strategies that really move the needle in 2024. Whether you’re a beginner looking for a solid start or a seasoned spinner hunting for that extra edge, these tips are built on real‑world results and industry insight.

Industry veterans consistently choosefatpirate-casino.org.uk for its reliable gaming environment, fast payouts, and a library of high‑RTP slots. That’s why FatPirate casino is the perfect testing ground for the strategies we’ll explore.

Advanced Slot Machine Strategies: Building Your Foundation

Before you dive into complex tactics, you need a strong base. Think of it as learning the rules of a new board game before planning your winning moves.

  1. Know the RTP (Return to Player).
    Every slot has an RTP percentage that tells you how much of the wagered money is paid back over time. Aim for games with an RTP of 96% or higher. FatPirate casino offers a filter that lets you sort titles by RTP, making it easy to pick the most generous reels.

  2. Understand Volatility.
    Low‑volatility slots pay small wins frequently, while high‑volatility titles offer big payouts but less often. Choose a volatility level that matches your bankroll and risk tolerance. If you prefer steady growth, stick with low‑to‑medium volatility; if you chase life‑changing jackpots, high volatility is your playground.

  3. Check Paylines and Betting Limits.
    More paylines don’t always mean better odds, but they give you more ways to win each spin. Also, be aware of the minimum and maximum bets. Playing the maximum on a high‑RTP slot can boost your expected return, but only if your bankroll can handle it.

  4. Set a Session Budget.
    Decide how much you’re willing to lose before you start. This simple habit prevents chasing losses and keeps your play enjoyable. Use FatPirate casino’s “Deposit Limits” feature to enforce your budget automatically.

  5. Play Demo Versions First.
    Most modern slots have a free‑play mode. Use it to learn the game’s mechanics, bonus triggers, and payout patterns without risking real money.

Pro Tip: Combine a high‑RTP, medium‑volatility slot with a moderate bet size. This sweet spot often yields the best balance of win frequency and payout size.

Essential Tools and Resources

Advanced players rely on more than just intuition. The right tools can turn a good strategy into a great one.

Tool What It Does Why It Matters
RTP Calculator Computes expected return based on bet size and RTP Helps you choose the most profitable slot
Volatility Chart Visualizes payout frequency vs. size Guides you to games that fit your risk profile
Bankroll Tracker Logs wins, losses, and session time Keeps you disciplined and aware of trends
Bonus Finder (FatPirate casino) Highlights current promotions and free spins Increases value without extra spend
Community Forums Player reviews and strategy discussions Real‑world insights you won’t find in manuals

Most of these tools are free or built directly into FatPirate casino’s dashboard. The platform’s “Game Stats” panel shows RTP, volatility, and bonus frequency for each title, letting you compare options at a glance.

Step‑By‑Step Implementation Guide

Now that you have the basics and the tools, let’s walk through a practical workflow you can use every time you sit down at a slot.

  1. Select a Slot Using the RTP Filter
    – Open FatPirate casino’s game library.
    – Filter by “RTP ≥ 96%”.
    – Choose a slot with medium volatility that you enjoy visually.

  2. Run a Quick Demo Test
    – Play the free version for 10‑15 minutes.
    – Note how often bonus rounds trigger and how the paylines line up.

  3. Calculate Your Optimal Bet
    – Use the RTP Calculator:
    Optimal Bet = (Bankroll × Desired Risk %) / (Number of Spins)
    – For a £100 bankroll and a 2% risk per session, the optimal bet might be £0.20 per spin.

  4. Activate Relevant Bonuses
    – Claim any welcome free spins or deposit match offers on FatPirate casino.
    – Apply the bonus to your chosen slot to increase playtime without extra cost.

  5. Start the Session with a Strict Time Limit
    – Set a timer for 30‑45 minutes.
    – Stick to the budget you defined earlier.

  6. Track Every Spin
    – Log wins, losses, and any bonus triggers in the Bankroll Tracker.
    – After the session, review the data to see if your expected return aligns with the slot’s RTP.

  7. Adjust and Repeat
    – If the slot’s actual return falls short of the advertised RTP, switch to another high‑RTP title.
    – Fine‑tune bet size based on your win/loss ratio.

By following these steps, you turn each gaming session into a data‑driven experiment rather than a gamble of pure chance.

Optimization and Fine‑Tuning

Even the best strategy can improve with a few tweaks. Here are advanced adjustments that seasoned players use.

A. Leverage Progressive Jackpot Timing

Progressive jackpots grow over time and often reset after a win. Monitoring the jackpot size can give you a clue about when a big payout is more likely. FatPirate casino’s “Jackpot Tracker” shows the current value and the average time between wins. Aim to play when the jackpot is near its historical peak.

B. Use “Bet Max” on Bonus‑Heavy Slots

Many slots award extra free spins or multipliers only when you bet the maximum. If your bankroll allows, switch to “Bet Max” during bonus rounds to maximize payout potential.

C. Combine Multiple Bonuses

Stack a deposit match with a free‑spin promotion on the same game. This double‑boost can effectively double your initial stake, giving you more spins and a higher chance to hit a winning combination.

D. Apply a “Loss Recovery” Rule

If you lose three consecutive spins, pause for a minute and reduce your bet by 50%. This break helps break the losing streak and protects your bankroll from rapid depletion.

E. Review Session Analytics Weekly

At the end of each week, pull your Bankroll Tracker data into a spreadsheet. Look for patterns such as:

  • Which slots consistently beat their RTP?
  • Which times of day produce higher win rates?
  • How often do bonus rounds convert into cashable wins?

Use these insights to refine your slot selection and betting schedule.

Measuring Success and Long‑Term Strategy

A good strategy isn’t just about winning today; it’s about sustainable growth over months and years.

  1. Set Clear KPIs (Key Performance Indicators).
    Win Rate: Percentage of spins that result in a win.
    RTP Gap: Difference between actual return and advertised RTP.
    Bonus Conversion Rate: How often free spins turn into cash.

  2. Benchmark Against Industry Averages.
    – The average slot RTP across the market sits around 95%.
    – If your personal RTP gap stays under 2%, you’re performing above average.

  3. Adjust Bankroll Allocation Quarterly.
    – Increase your bankroll for high‑performing slots.
    – Reduce exposure to games with a negative RTP gap.

  4. Stay Informed on New Releases.
    – FatPirate casino adds fresh titles weekly.
    – New games often launch with promotional boosts and higher RTPs to attract players.

  5. Practice Responsible Gambling.
    – Set loss limits and stick to them.
    – Take regular breaks and never chase losses.

By treating slot play like a small investment portfolio—monitoring returns, diversifying, and rebalancing—you can enjoy the thrill of the reels while keeping your finances healthy.

Frequently Asked Questions

Q: Does betting the maximum always increase my chances of winning?
A: Not necessarily. Max bets boost potential payouts and may unlock extra features, but they also raise your risk per spin. Use max bets only when the slot’s RTP is high and your bankroll can support it.

Q: How often should I switch to a new slot?
A: If a slot’s actual return consistently falls 2% or more below its advertised RTP over several sessions, it’s time to try another high‑RTP game.

Q: Are free spins on FatPirate casino worth playing?
A: Absolutely. Free spins let you test a slot’s volatility and bonus structure without spending money, and any winnings are usually credited as cash.

Take Action Today

  1. Log into FatPirate casino and filter for slots with RTP ≥ 96%.
  2. Pick a medium‑volatility title you enjoy.
  3. Apply the step‑by‑step guide above and track your results.

With disciplined play, the right tools, and the five advanced strategies outlined here, you’ll turn luck into a measurable edge. Good luck, and may the reels spin in your favor!

The Benefits of Learning a Second Language

In today's interconnected world, learning a second language has become more important than ever. It is not only a valuable skill in the global job market but also enriches personal life in numerous ways. Here are some key benefits of being bilingual or multilingual:

1. Enhanced Cognitive Abilities

Learning a second language boosts brain function and improves cognitive skills. Studies have shown that bilingual individuals have better memory, problem-solving abilities, and multitasking skills. The mental exercise of switching between languages enhances brain plasticity and can even delay the onset of dementia and Alzheimer's disease.

2. Cultural Awareness and Empathy

Language and culture are deeply intertwined. By learning a new language, one gains insight into different cultures, traditions, and perspectives. This fosters greater cultural awareness and empathy, helping individuals to better understand and connect with people from diverse backgrounds. It promotes tolerance and reduces cultural biases, contributing to a more inclusive society.

3. Career Opportunities

In the global economy, bilingualism is a highly sought-after skill. Many multinational companies prefer employees who can communicate in more than one language, as it allows for smoother interactions with international clients and partners. Proficiency in a second language can open doors to job opportunities, promotions, and even higher salaries.

4. Improved Communication Skills

Learning a new language improves overall communication skills. It enhances one's understanding of grammar, vocabulary, and syntax in both the new language and their native tongue. This can lead to clearer and more effective communication in all aspects of life, from personal relationships to professional interactions.

5. Personal Fulfillment

There is a profound sense of accomplishment that comes with mastering a new language. It boosts self-confidence and provides a sense of personal fulfillment. Additionally, it opens up a world of literature, music, films, and other cultural treasures that can be enjoyed in their original language, enriching one's life experiences.

6. Travel and Exploration

Knowing a second language makes travel more enjoyable and immersive. It allows travelers to navigate new places more easily, interact with locals, and gain a deeper understanding of the places they visit. This can lead to more meaningful travel experiences and the ability to build connections across borders.

In conclusion, learning a second language offers a wealth of benefits that extend far beyond the practical advantages. It enhances cognitive abilities, fosters cultural understanding, improves career prospects, and enriches personal life. In an increasingly globalized world, being bilingual or multilingual is a valuable asset that can lead to a more fulfilling and successful life.

Centre of the Edinburgh Fringe Festival 2017

Centre of the Fringe Festival 2017

See a few posts below for the full details of the map – basically it’s the weighted centre of the Fringe festival.

FAQ

  • That looks obviously wrong. Are you sure it’s right?
    • Honestly, no. Seems pretty fishy to me too. I’ll investigate if I get time but in all honesty, I just ran the same scripts as I wrote last year and it all seemed to work and output answers. This was honestly not much effort at all.
  • Did anybody get you a drink last year as a result of this?
    • No.
  • So it looks like the overall centre of the Fringe moved. Where has it moved to?
    • Basically directly over my work. If you’re in the area and fancy a coffee give me a shout.
  • Can you really call them FAQs if nobody has ever asked any of them?
    • You wouldn’t have thought so, and yet here we are.
  • How do you calculate this again?
    • See the post from last year. Any more questions and let me know.
  • Can I have the scripts so I can run it and do a better job than you?
    • Yes, but let me know. It kind of involves scraping the Fringe website and I don’t want to put that out in the public domain.
  • Given it seems obviously wrong, couldn’t you have checked before posting?
    • Follow-up questions aren’t encouraged. But, it turns out by chance that I posted last years version of this exactly a year ago and I thought I mayerswell try to get this out tonight.

 

So it goes.

UK Political Manifestos

Hey all,

Standard apologies about not posting for ages and all that. However, assuming you don’t care – let’s get onto the business of things. Firstly – this is a plot of the document similarities between all of the Tory/Labour manifestos from 1979 to 2017:

Document Similarity

Party manifesto cosine similarity

Is that interesting? Maybe. In general, a party’s manifesto is most similar to the same party’s manifesto of the previous election. The most similar manifestos are Labour’s of 2005 and 2010, closely followed by the Tory’s 1992 and 1997 efforts (if it works, it turns out you might still need to fix it!)

Most interestingly for me is the Tory shift in 2010 – the manifestos it was most similar to is Labour starting at 1997 and going all the way through to 2015. If you think that David Cameron and Tony Blair were political bedfellows then the manifestos might not disagree with that view.

So what about in 2017? Labour’s manifesto has managed something that no other manifesto has managed – it’s more similar to every other Labour manifesto than it is to any Tory manifesto. In contrast, Theresa May’s effort leans on Labour’s recent history as well as David Cameron’s time in charge of the Conservatives.

Similar Words

When we say a document is similar to another document, is there a way for us to see what that means and how it works? Well let’s look (for chosen pairs of manifestos) which words are shared. I’m interested in which words are most important to two documents but which aren’t important (or don’t feature) in any of the other documents. Based on that intuition, let’s look at some examples…

Labour 2017 – Tory 2017

– Brexit
– leave European Union
– protections
– devolved administrations

Labour 1983 – Labour 2017

– workers rights
– education service
– publicly owned

Labour 1997 – Tory 2010

– low carbon
– change society
– welfaretowork

I think that’s pretty cool! So we can see which themes are shared across manifestos! Could we build a predictive model to work out what makes a winning manifesto? Absolutely! Would it likely lead to massive overfitting and be practically useless? I’d have thought so.

Details

For those interested in exactly how I did the above, see this repo: repo

As a general summary:

1.) Clean the manifesto text and remove any ‘words’ that are just numbers.
2.) Remove English stopwords and generate 3-grams (“I went to”, “went to the”, “to the shop”).
3.) Perform TF-IDF (work out which words are most important in a document, and across all documents).
4.) Calculate cosine similarity to work out how similar each document is to each other document
—-
For the common words I’ve done something a bit different and written my own algorithm (because I couldn’t find one that did what I wanted it to). If you know of a better way of doing this (finding terms for maximising similarity between documents while minimising for all other documents) let me know!

1.) For each combination of manifesto pairs, generate a column vector with a 1 when the entry corresponds to one of the manifesto pairs and a -1 when the entry doesn’t.
e.g. (1,1,-1,-1,-1…) would be the vector for (Labour1979, Labour1983) (1,-1,-1,-1,1,-1,…) would be the vector for (Labour1979,Labour1997).
2.) Stick all of those columns together to form a matrix (A).
3.) Multiply the transpose of the TFIDF matrix (B) with the combinations matrix (A).
4.) Each column in the resultant matrix (C) now has a score for every single word present across all of the corpuses. The highest scoring words will be ones that have a large TF-IDF score in the two documents we’re interested in, but a small TF-IDF score in all the other documents.
5.) As such, pick to the top N for each column in the resultant matrix (C) and map them back to the feature names that generated the TFIDF matrix (A) in the first place.

Sorry if that explanation isn’t especially clear – have a look at the code and let me know if there’s a better way of doing things!

Choose love, Manchester.

Centre of the Edinburgh Fringe Festival

What is that map all about?

It shows the weighted centre of all of the shows at this year’s Edinburgh Fringe Festival. If you want to be at the centre of the action, staying near there isn’t a bad idea. Use the filters to show you where the centre is for different categories (Theatre, Comedy etc.).

Go On…

Ingredients

  • All of the Fringe shows (with date, time, category and subcategory)
  • All of the Fringe venues (with latitude and longitude)

Recipe

  • Apply a weighting to each show, which is just the number of times it’s on.
  • Get the longitude and latitude of that show’s location.
  • For each subcategory, category, and overall, find the midpoint (according to the weightings calculated above).
  • Plot

Easy. No fuss.

If you’re still curious, ask away. If you’d like to give me beer because I put your bar near the centre of things, good. Do. If you’d like me to find the midpoint of something, let me know. If you’d like me to stop promising to write more and then not doing so, sorry, that’s just unrealistic.

My geekness is a-quivering.

p.s. if there’s sufficient interest I’ll do a tutorial on how to get the data, find the midpoints and then put together the visualization.

Generating B2B sales data in Python

‘Sup

Look, I’m sorry. Yet again, I’ve not written any blog posts for ages. Let’s all get over it and move on to something more important. Sales. Let’s imagine you’re an organisation selling B2B. You use Salesforce (or any other platform). You’ve got plenty of opportunities and a history of those opportunities. You’ve gone and built a sales pipeline.

Good work. That’s not an easy thing to do.

Now you want to use that pipeline to get better at sales. You want to use the data you’ve got to help forecast what you’ll do in the future. You want to know the value of what you’ve already got in the pipeline. You want to know what the most valuable activities you perform are. I’m not going to be able to fit all of that into one post so I’ll break things up into parts and (I’ve said this before only to underdeliver) FINISH THE SERIES.

However, for part 1 I’m actually only going to focus on generating some dummy data to play with. “What!? That’s none of the things you said you’d do!” No. It’s not. However, if you’re able to find me a B2B company with a small number of sales who are willing to publicly share all their data then fair play to you. Lacking that I’m going to have to create a dummy set of data and make it halfway believable. In doing this I’ve made a few assumptions (that I’m later going to try to show). It’s a bit circular but don’t be that guy. What I’m doing is broadly legit and if you look at the data and don’t think it’s reasonable then I’m providing the code so you can change whichever bit you find egregious. Even better, just use your actual company’s sales data (assuming you’re lucky enough to have it).

I’ll be building a dataframe that resembles a Salesforce pipeline – it’s going to have the following rows:

Stage – this is the ‘Salesforce/Hubspot/<don’t care>’ stage in the pipeline. Measures how far along an opportunity is.
Name – got to keep track of the opportunities using something
Value – how much money are we going to make from this opportunity. Daily, Monthly, Annually. Doesn’t matter.
Days – this is the date the opportunity entered the stage given. Going to be important later for time-dependence stuff.

So, let’s begin (all code also available here)

import numpy as np
import random
import matplotlib
from matplotlib import pyplot as plt
import datetime
from datetime import datetime as dt
from scipy import stats
import pandas as pd

def weighted_pick(weights, n_picks):
    t = np.cumsum(weights)
    s = np.sum(weights)
    return np.searchsorted(t, np.random.rand(n_picks)*s)

pre_stages = [('Contact initiated', 0.8, 10), ('Meeting booked', 0.6, 20), ('Trial booked', 0.4, 15), ('Proposal sent', 0.3, 25), ('Contract sent', 0.2, 10)]

closed_stages = ['Closed Won', 'Closed Lost']

success_stages = ['Closed Won']

Here I’m declaring a few things that are going to be useful to me later. I want all of the stages in the pipeline that I care about, the closed stages and the success stages. The code is probably a bit brittle regarding the random addition of closed and success stages but is fine for new ‘pre_stages’. The parameters are the probability that the opportunity will fall out of this stage (rather than move on successfully) and something else that we’ll talk about later.

WORDS = open('/usr/share/dict/words', 'rb').read().splitlines()

NUM_POINTS = 400
AVERAGE_SALE_PRICE = 3500
SD_SALE_PRICE = 1000

sales_opportunities = [(entry.title(), np.random.normal(AVERAGE_SALE_PRICE, SD_SALE_PRICE)) for entry in np.random.choice(WORDS, NUM_POINTS, replace=False)]

Here I’m generating a list of ‘company names’, picking words randomly from a dictionary. In all honesty, just looking through the list of company names is pretty fun in itself. I’m also assuming that the revenue I make from my product is a normal distribution with mean and standard deviation given as ‘AVERAGE_SALE_PRICE’ and ‘SD_SALE_PRICE’. Not rocket science. But it is an assumption I’m making – let’s chalk it down. First assumption: revenue/client is normally distributed. Then we build a list of sales opportunities and their value.

start_date = datetime.datetime.now() - datetime.timedelta(days = 365*2)
days_range = range(365*2)
y = [float(entry)/365. for entry in days_range]

indices = weighted_pick(np.exp(y), NUM_POINTS)

Second assumption I’m going to make in generating this data – you’re working for the right kind of start-up/business. Basically, the number of opportunities created are going to broadly follow an exponential distribution. That is, you specify how many opportunities enter the pipeline with ‘NUM_POINTS’ and we’re going to distribute those according to an exponential distribution. I’m saying that the company starts 2 years ago – again, change if you don’t like it.

sales_data = [[pre_stages[0][0], name_value_pair[0], name_value_pair[1], start_date + datetime.timedelta(days = index)] for name_value_pair, index in zip(sales_opportunities, indices)]

remaining_opportunities_frame = pd.DataFrame(sales_data)
remaining_opportunities_frame.columns = ['Stage', 'Name', 'Value', 'Days']

sales_data_frame = pd.DataFrame(sales_data)
sales_data_frame.columns = ['Stage', 'Name', 'Value', 'Days']

finished_list = set([])

OK. Now I’ve got the first set of entries that’ll make up my final dataframe – it’s all of the opportunities with the value (generated from a normal distribution) and the time the opportunity entered the pipeline (generated via an exponential distribution). I’m going to create a few things for later, namely a dataframe containing all of the live opportunities and our final dataframe containing all the rows we’re going to care about.

for stage_index, stage in enumerate(pre_stages[1:]):

    next_stage = pd.DataFrame([(sales_opp[1], index, np.argmax(entry)) for sales_opp in sales_data for index, entry in enumerate(np.random.multinomial(1, [0.99, (1. - stage[1])/100., stage[1]/100.0], (datetime.datetime.now() - sales_opp[3]).days)) if entry[0] != 1 and sales_opp[1] not in finished_list])

The above line is where it’s all at. Let me explain slowly and then again, even slower. My intuition is this – I think that the probability that an opportunity converts (moves from its current stage to the next stage) is proportional to the negative exponential of the time spent in that stage. Let’s be clearer. I’m going to make the third assumption – that the probability of moving to the next stage broadly follows a negative exponential. What’s more, I think that each stage will have its own characteristic drop off rate (or half-life, for those of you thinking this looks mightily like radioactive decay). You know how before I said I’d added a parameter to ‘pre_stages’ and I’d explain it. That’s what ‘pre_stages[x][2]’ is. So, for a given stage in the sales pipeline, for each opportunity left in the previous stage, for every day between when the opportunity entered the stage and now I run the multinomial line. The multinomial line is going to return a binary array of three elements where exactly one of the elements is filled. The first element will be filled in 99% of cases – I’ve chosen to set this and if you don’t like it then change it to something else. It means that, for every day between the opportunity entering the state and today there’s a 99% the opportunity will still be in that state at the end of the day. If the second element is filled then that means that the opportunity succeeded on that particular day (with probability given by the stage parameter). Finally, if the third element is filled then the opportunity died on that particular day. ‘Index’ gives us the number of days that’ve happened since the opportunity entered the stage and the argmax gives us whether we succeeded or failed (you’ll see we’re ignoring days when we neither succeeded or failed).

    next_stage.columns = ['Name', 'Days', 'Status']

    meh = next_stage.ix[next_stage.groupby('Name').Days.idxmin()]
    tempy_frame = meh.merge(remaining_opportunities_frame[['Name', 'Value', 'Days']], how='inner', on='Name')

    tempy_frame['new_date'] = tempy_frame.apply(lambda x: x.Days_y + datetime.timedelta(days = x.Days_x), axis=1)

    tempy_frame = tempy_frame[['Name', 'Value', 'new_date', 'Status']]
    tempy_frame.columns = ['Name', 'Value', 'Days', 'Status']

    success_frame = tempy_frame[tempy_frame.Status == 1]
    success_frame = success_frame.drop('Status', 1)
    success_frame.insert(0, 'Stage', pre_stages[stage_index + 1][0] if stage_index + 1 < len(pre_stages) - 1 else success_stages[0])

    failure_frame = tempy_frame[tempy_frame.Status == 2]
    failure_frame = failure_frame.drop('Status', 1)
    failure_frame.insert(0, 'Stage', closed_stages[1])

    sales_data_frame = sales_data_frame.append(success_frame).append(failure_frame)

That was a crazy line – but it contained most of the interesting stuff we do. From here on in we grab the first of the days that the opportunity moved (we actually kept all of the days in the above line but we’re only allowing each opportunity to move out of each stage once!), add the number of days to the original date we entered the stage to find the day we move into the next stage and then create the rows that we need.

    finished_frame = sales_data_frame.groupby('Name').apply(lambda x: x.Stage.isin(closed_stages).any())
    finished_list = set(finished_list).union(set(finished_frame[finished_frame == True].index.values))
    remaining_opportunities = remaining_opportunities_frame[~remaining_opportunities_frame.Name.isin(finished_list)]

Finally, there’s a bit of tidying up to make sure that we don’t calculate anything for any of the opportunities that have died

dates = matplotlib.dates.date2num(sales_data_frame[sales_data_frame.Stage == success_stages[0]].sort('Days').Days.astype(dt))
revenue = sales_data_frame[sales_data_frame.Stage == success_stages[0]].sort('Days').Value.cumsum().values

plt.plot_date(dates, revenue, 'b-')
plt.xlabel('Date')
plt.ylabel('Revenue')
plt.title('Company revenue over time')
plt.show()

sales_data_frame.to_csv('generated_data.csv', index=False)

Quite a lot of work, really, just to generate some ‘likely looking’ sales data. Again, if you’ve got your own then use it! However, up till now I’ve just asserted that it’s likely looking. If you play around with it you can actually see some pretty interesting stuff. Firstly, with lots and lots of data point (N = 8000) you see that the company revenue growth looks very exponential:

8000points

However, it’s unlikely that you’ve got 8000 B2B transactions in your sales pipeline (if you do, kudos!). Let’s examine the situation where you’ve got 150:

150points

And a once more with 150:

150points3

I think it’s interesting that, even though we’ve literally built this whole pipeline using exponential growth – we still look flat in a lot of places. Hopefully that might provide some solace if you’re struggling with sales and think you’re not hitting your exponential growth. Play around with the parameters and you can see what sort of effect increasing your conversion at various stages has on your overall revenue etc. Or just read the company names – they’re also pretty good.

Right, I’m counting that as broadly done. We’ve got sales data that nobody will mind me analysing in a public forum. Stay tuned/subscribe/email me to keep in touch for part 2. We’ll imagine that we’ve started with this data and we’ll try to assign a total value to our pipeline, and maybe even get onto predicting how many opportunities will progress in the next N days.

Finding Topics in Harry Potter using K-Means Clustering.

I’ll open up with the money-shot – these are all of the clusters that I was able to find using the whole Harry Potter and grouping by chapter:

all_clusters_subplot

Every cluster plotted separately.

That’s far too messy to be of any practical use so let’s have a look at a couple of those clusters in more detail:

Privet Drive Cluster

One of the clusters – a Dursley/Privet Drive heavy cluster!

and

Cluster-23

This is a pretty Griphook/Goblin heavy cluster featured on the storyline in book 7.

Hopefully that’s piqued your interest enough to continue on scrolling and see how we got these clusters – and see the words that tie them all together! The code for generating these is on my Github (https://github.com/Kali89/HarryPotterClusters) and all the graphs and documents are contained there.

I went to a really interesting talk at PyData that was about Latent Dirichlet Allocation, a topic entirely new to me. I thought I’d love to apply it to my favourite book series – Harry Potter. However, that didn’t happen…instead you get this. A heavy rip off of an excellent post (http://brandonrose.org/clustering) that walks through how to cluster documents using a bunch of techniques including K-Means.

Step 1

Get plain text copies of all the Harry Potter books and make sure they’re all formatted in roughly the same way. As is often the case, this step took bloody ages.

Step 2

I want a few different documents – more than 7 (the number of Harry Potter books for the heathens out there) but substantially less than the number of sentences in all 7 books. Treating each chapter as a separate document seemed to make most sense and so here I initialise everything I need and split my books into chapter.

So at this point we’ve nicely got ourselves a list of chapter titles and a list of the associated text (as a string).

Step 3

Now we’ve got our chapters we’re going to want to tokenise the text in them. Basically this means converting a string into discrete tokens – what we’d think of as words. The reason it’s got a fancy name and I’m being a bit careful about my terminology is because tokenisation also takes care of things like punctuation and isn’t as simple as just splitting a string into separate words. Having said that, I’ve basically taken the path of least resistance and so have gone with a very simple tokenisation scheme. I’m also going to skip over the pain of utf-8 encoding/decoding/recoding. I’ve basically just dropped any character that I’ve found in the least bit complicated.

Step 4

Next I’m going to perform TF-IDF on my chapters. Here, I convert each token into a number and look at how many times that token appears in a given document, and how many documents that token appears in overall. So in this instance I’m looking to see how often a given word appears in this particular chapter and in how many chapters throughout all 7 books the token appears in. This gives us an idea of how important/prominent a word is in a given chapter, taking into account how common the word is throughout all the chapters. As an example ‘Harry’ is likely to feature a lot in a given chapter but also is likely to feature in every chapter and so probably isn’t especially important to any given chapter’s classification. ‘Nicolas Flamel’, however, is going to appear a reasonable amount in a few chapters but not at all in all the rest. We therefore know that Nicolas Flamel is important in the chapters he does appear in.

Luckily, I don’t have to worry about the implementation of TF-IDF – sklearn has got it.

This gives me a large sparse matrix with the TF-IDF score for each word in each document as the entries. If you pay close attention to the parameters I’ve passed along to TfidfVectorizer as they are ripe for the changing. Firstly, `max_df=0.75` is saying that I don’t care about words that appear in more than 75% of the chapters. `min_df=0.05` is saying that I don’t care about words that appear in fewer than 5% of chapters. You can see I passed along my tokeniser and that I’m using English stop-words (that is, I’m removing the most common English words). Finally, I’m generating n-grams between 1 and 4.

For those uninitiated with n-grams they’re basically a way of splitting text up into handy little chunks. As an example, 3-grams of the following sentence:

“This is not the greatest song in the world”

would be:

“This is not”, “is not the”, “not the greatest”, “the greatest song”, e.t.c.

This allows me to pick out common phrases such as “Snape said” and “wizarding world”. Again, that’s a setting that is begging to be played about with.

Step 5

Performing K-means clustering we get output like so:


Cluster 18 words:
Top words: maxime,madame maxime,karkaroff,madame,hagrid,cedric,moody,krum,champions,tournament

Chapter: the hungarian horntail, Book: gobletOfFire
Chapter: the goblet of fire, Book: gobletOfFire
Chapter: the four champions, Book: gobletOfFire
Chapter: beauxbatons and durmstrang, Book: gobletOfFire
Chapter: the beginning, Book: gobletOfFire
Chapter: the yule ball, Book: gobletOfFire

which is obviously quite a handy little cluster. It’s successfully managed to only take chapters from one book (given that we’ve not allowed K-means access to the book information that is a bit of a triumph). For those aware of Harry Potter you’ll see this is a Triwizard Tournament heavy cluster. Another example:


Cluster 10 words:
Top words: wormtail,cold voice,voldemort,lord,cauldron,riddle,cedric,man,master,faithful

Chapter: the riddle house, Book: gobletOfFire
Chapter: flesh, blood, and bone, Book: gobletOfFire
Chapter: the death eaters, Book: gobletOfFire


Again, all the clusters are from one book – I’m counting that as a result. Added bonus, they’re not sequential chapters! I happen to know (by being a massive Harry Potter geek) that these chapters are towards the start and end of book 4 and focus heavily on Voldemort and Peter Pettigrew.

So far so good and in fact I could stop here but wouldn’t it be nice to visualise those clusters so we can see the topics we’ve picked out graphically? Even if you said no then it doesn’t really matter. I’m still going to do it.

Step 6

First things first, let’s plot all of the chapters on one graph and colour code them with the book from which they came. It’s not a great way of visualising clusters but it is a great way of seeing how everything is laid out:

all_clusters

A messy picture of all the chapters projected into 2d space.

Again, this section is shamelessly copied from the aforementioned blog but ultimately we’re projecting the cosine differences between the tf-idf matrix terms into 2-dimensional space. I declare a colour dictionary for each of the books and then rattle through the chapters plotting them.
I’m sure you’ll agree that’s far too messy for anybody to really do anything with. If you’re following along with the code you’ll see that next I generate the subplot figure shown at the top.

Finally, I create plots for each of the clusters – a few examples of which are:


Cluster 3 words:
Top words: umbridge,professor,dont,professor umbridge,snape,sirius,im,said hermione,harrys,youre

Chapter: the muggle born registration commission, Book: deathlyHallows
Chapter: the hogwarts high inquisitor, Book: orderOfThePhoenix
Chapter: o.w.l.s, Book: orderOfThePhoenix
Chapter: the second war begins, Book: orderOfThePhoenix
Chapter: educational decree number twenty four, Book: orderOfThePhoenix
Chapter: the centaur and the sneak, Book: orderOfThePhoenix
Chapter: percy and padfoot, Book: orderOfThePhoenix
Chapter: detention with dolores, Book: orderOfThePhoenix
Chapter: occlumency, Book: orderOfThePhoenix
Chapter: in the hogs head, Book: orderOfThePhoenix
Chapter: out of the fire, Book: orderOfThePhoenix
Chapter: professor umbridge, Book: orderOfThePhoenix
Chapter: fight and flight, Book: orderOfThePhoenix
Chapter: seen and unforeseen, Book: orderOfThePhoenix
Chapter: career advice, Book: orderOfThePhoenix
Chapter: snapes worst memory, Book: orderOfThePhoenix

generates:

Umbridge's cluster

Umbridge’s cluster

Everybody’s favourite professor – Dolores Umbridge!

And another:

Top words: hagrid,yeh,ter,said hagrid,professor,said hermione,malfoy,o,professor trelawney,trelawney

Chapter: professor trelawney's prediction, Book: prisonerOfAzkaban
Chapter: talons and tea leaves, Book: prisonerOfAzkaban
Chapter: the firebolt, Book: prisonerOfAzkaban
Chapter: diagon alley, Book: philosophersStone
Chapter: the foribidden forest, Book: philosophersStone
Chapter: the keeper of the keys, Book: philosophersStone
Chapter: norbert the norwegian ridgeback, Book: philosophersStone
Chapter: hagrids tale, Book: orderOfThePhoenix
Chapter: the eye of the snake, Book: orderOfThePhoenix
Chapter: grawp, Book: orderOfThePhoenix
Chapter: hermione's helping hand, Book: halfBloodPrince
Chapter: rita skeeter's scoop, Book: gobletOfFire


and visually:

Grawp's brother

Keeper of the Keys

The anti-Umbridge – it’s Hagrid’s cluster!

I’ll stop on the random copy/pasting of the clusters and stick them all on my Github – I think you’ve got the idea! All in all I’m pretty happy with how this has worked but I am very dependent on individual character names. I tried just looking at 2-grams but it usually just gave me ” said” with a few exceptions (‘wizarding world’, ‘said softly’, ‘death eater’, ‘godrics hollow’). I’ve also put almost zero effort into formatting the images – you know roughly what it’s meant to look like: having the pictures look good is an exercise I’m leaving to the reader’s imagination.

There’s loads more stuff I could do but I’m going to eat a chicken and go swimming so it’ll have to wait.

All was well.

Building a Search Engine for E-Commerce with Elasticsearch

This is a continuation of my previous post on search engines. Having been involved with using Elasticsearch to build a search engine for e-commerce, there are some interesting ideas which I have taken away from the experience. I will go through some of the design decisions made and problems encountered along the way.

Tweaking the Search Query
Elasticsearch provides a huge variety of different query types, each of which has a different approach to retrieving search results. For example, the term query will find documents containing a certain term, the fuzzy query will match documents containing terms which are approximately equal to a given term, the geo-shape query enables you to perform useful queries over documents containing longitude and latitude coordinates and many more. Any of these query types can be composed using compositional query types, such as the bool query.

When I talk about tweaking the search query, I mean choosing the query types to use and structuring our query in a way that will enable us to achieve optimal results for any kind of search over the product base. The two main properties of the search query which we are trying to optimise are:

  1. How to best determine which products match
  2. How to best determine the order (i.e. relative importance) of the matching products

One of the first options that we considered using was the query-string query: a query type that parses your query and decides what query types to use, and often does the sensible thing. For instance, jeans will match any document containing the term jeans, blue jeans will match any document containing both ‘blue’ and ‘jeans’ and "blue jeans" will match documents containing both terms together (an exact match). You can even make more complex searches, like blue jeans -(levi OR diesel) which will match documents containing ‘blue’ and ‘jeans’ but not the terms ‘levi’ or ‘diesel’. This all seems quite nice at first, but this query type can give very unexpected results if used incorrectly – for example t-shirt will match documents containing ‘t’, but will exclude all documents containing ‘shirt’. Of course, customers can’t be expected to understand why this happens or how to fix their query.

Another option is the multi-match query, which takes a list of fields to query over and builds a sequence of match queries. Similar to the query-string query, each field can each be given a different boost factor, which is used in determining relative importance of some terms matching a given field. You can also tweak the type of multi_match – e.g. whether the query terms must all be found in a single field, or can be found in different fields (but not necessarily in the same field). Here is an example of a multi_match query:

{
  "multi_match" : {
    "query":      "gladiator russell crowe",
    "type":       "best_fields",
    "fields":     [ "title^10", "actors^5", "description^1" ]
  }
}

Here, we have used the boost operator (^) to indicate that the title field should be given the most importance. The query will still be matched against the description field, but matches in the title and actors fields will be given a higher score and will appear first in our result set. The best_fields type gives precedence when the query terms appear in the same field, but they don’t have to.

When matching across multiple fields, it can be tricky to figure out which documents are most relevant to the query. If we just consider which fields matched the query, then we won’t get optimal results. For example, if the query is ps4 controller and a document contains ‘ps4’ in the title field and ‘controller’ in the description field, then should it be given a higher score than a document which contains both terms in the title field but neither in the description field? The first document has more matching fields, but intuitively, the second document should surely be considered a more relevant result. Elasticsearch provides a solution to this: the disjunction-max (dis-max) query. This enables us to perform sub-queries over multiple fields and take the score of the best matching field (i.e. the maximum scoring sub-query), instead of summing the scores from each matching sub-query. In practice, this often yields better results and is in fact the default behaviour for the multi-match query.

We used the multi-match query successfully in production for quite some time, but ultimately decided to switch to using the common terms query. This is an interesting query type which provides not only a way to determine stop words dynamically, but also a way to not completely disregard stop words at search time. When using the multi-match query, we assigned a stop word filter to each field in our mapping, which uses a pre-defined list of stop words to remove the most common and semantically useless words from the index. This is usually good because those words don’t add much meaning and make searching the index slower. But is this always a good thing? Consider the video game ‘The Last Of Us’. A search for the last of us would result in all products containing the term ‘last’ and those products would not be ordered very sensibly, since the other three terms (all stop words) would have been thrown away. In this scenario, the common terms query is much more effective. Instead of removing stop words, it uses term frequencies across the whole index to determine which terms are important and which occur frequently enough to be considered stop words. In this example, ‘last’ would likely be deemed an important term, while ‘the’, ‘of’ and ‘us’ would be deemed less important. The common terms query then splits searching into two steps:

  1. Use the important terms to determine the result set
  2. Use the less important terms to order the result set

This way, stop words are not completely thrown away, but they are not considered until after the product result set has been determined. This fits very nicely with the two search query properties we are trying to optimise towards. To keep the cross-field search benefits of the multi-match, the common terms query can be wrapped inside a dis-max and have a boost factor applied to each field:

{
  "dis_max" : {
    "tie_breaker" : 0.3,
    "queries" : [
      {
       "common": {
          "title": {
          "query": "the last of us",
          "boost": 10,
          "cutoff_frequency": 0.001
          }
       }
      },
      {
        "common": {
          "studio": {
          "query": "the last of us",
          "boost": 3,
          "cutoff_frequency": 0.001
          }
        }
      },
      {
        "common": {
          "description": {
          "query": "the last of us",
          "boost": 1,
          "cutoff_frequency": 0.001
          }
        }
      }
    ]
  }
}

With this query, we saw an uplift in product page visits from search and more customers were clicking products returned in the first two rows of their search results. This demonstrates that our use of the common terms query was enabling customers to find the most relevant products more easily.

Customising the Score Function
As we’ve seen, Elasticsearch provides us with many clever ways to score search results such that the most relevant products appear first. But in e-commerce, there are some factors that are worth considering outside of how relevant the product itself is. This includes things like:

  • How popular is the product?
  • Is the product in stock?
  • Was the product released recently?
  • How profitable are sales of the product?

For example, if a product is extremely popular at the moment, then perhaps it should be boosted above other search results. Or, if a product is out of stock, we probably don’t want to show it in the first few search results.

To factor in this information, we can design our own scoring function which will adjust the scores computed by our Elasticsearch query. This is done by wrapping the main search query in a custom_score query. We can then provide a script which modifies the original score (denoted by _score) by using fields from the index and a set of parameters. This way, we could index a field such as ‘product_popularity’ into our product documents, and then boost the _score for more popular products. We would make it possible to assign different levels of importance to each factor with an adjustable weighting for each parameter. Normalisation is also important to ensure we operate on the same scale for each factor. Here’s an example of this with just the product popularity factor:

"custom_score": {
    "params": {
        "scoreWeighting": 2,
        "popularityWeighting": 5,
        "maxPopularity": x
    },
    "query": {...},
    "script": "scoreWeighting * _score + (popularityWeighting * (doc['popularity'].value / maxPopularity))"
}

In practice, our score function considers a lot more than the product popularity and is dynamically generated by the search service using a set of configurable parameters which can be changed at any time without a redeployment.

Achieving Faceted Search
Faceted search is a way of enhancing the search experience by enabling the user to navigate their search results by applying a set of filters. Faceted navigation is now seen on the majority of online retail sites and probably looks very familiar to you:

An example of faceted search

An example of faceted search

A facet is a set of filters. In the above example, there are three facets: category, sub-category and price. Sometimes, more complex faceting may be desirable – for instance, you might want it so when you apply the ‘DVD’ category filter, you are then given a choice of movie genres to filter by. This is called a nested facet, as it is a facet within a facet.

With Elasticsearch, it is fairly painless to set up faceted search. First, you will need to have in your mapping a non-analysed version of each field you want to facet on:

{
    "category": {
        "type":     "string",
        "index":    "not_analyzed"
    }
}

We use the not_analyzed setting because at index time we want the field to be mapped as an exact field, so that later, the filter options (in this case categories) will appear exactly as they were indexed.

Now, at query time, we can append a terms aggregation to our query:

{
    "aggs" : {
        "categories" : {
            "terms" : { "field" : "category" }
        }
    }
}

Our response will now contain all the information we need about categories for the given query. Elasticsearch will give us a breakdown of counts for each type of category within the result set for our query:

 "aggregations" : {
        "categories" : {
            "doc_count_error_upper_bound": 0, 
            "sum_other_doc_count": 0, 
            "buckets" : [ 
                {
                    "key" : "Merchandise",
                    "doc_count" : 856
                },
                {
                    "key" : "Clothing",
                    "doc_count" : 455
                },
                ... etc ...
            ]
        }
    }
}

Now, when a user clicks on a category, such as Clothing, we usually want our search results to be filtered to display only clothing, however the facet counts for categories should remain unchanged – the Merchandise facet count should still be 856. To achieve this, we can use Elasticsearch filters instead of extending the query. In this example, we would append a terms filter on the category field, with the term ‘Clothing’. This will achieve the behaviour we want because filters are not considered when computing the facet counts – the search results will be filtered, but the facet counts will remain unchanged.

Implementing Instant Search
Instant search is where the search engines assists you with your search while you type. There are several variants of this:

  1. Displaying products relevant to what the customer has typed so far
  2. Displaying search suggestions – predicting what the customer is going to type next (AKA auto-complete)
  3. Detecting spelling mistakes and suggesting corrections

It is actually possible to achieve all of the above with a single Elasticsearch query! We achieved this by using an n-grams analyzer for auto-complete, a shingle analyzer for search suggestions and Elasticsearch’s in-built term and phrase suggester for spelling correction. Check out my colleague’s post here for a complete example of how to achieve this.

Handling Distributed Search
Elasticsearch is an excellent example of a sophisticated distributed system which hides much of the inherent complexity from the user. Behind the scenes, problems like partitioning documents into shards, balancing shards across the cluster, replicating data to maintain fault-tolerance and efficiently routing requests between nodes are handled. All you have to do is configure a couple of settings in your elasticsearch.yaml – such as the number of shards to split each index into and the number of replicas to keep of each shard.

While configuring distributed search is pretty easy, there are some more complex issues which should be addressed. One of these issues is: how can we make a change to our mapping (i.e. the index) without causing any downtime to our search engine. For an e-commerce company, any form of downtime translates directly to a loss in revenue, and with our large product base, re-populating the search indices is a lengthy process which takes several hours. This problem can be solved using index aliases – an Elasticsearch feature which enables us to set up something similar to a symbolic link – an index alias which always points to a live and fully prepared index. For example, we can set up an alias, products which points to a specific version of our products index:

PUT /products_v1/_alias/products

Now, we can make a change to our mapping and populate a new index, products_v2, wait until we are satisfied that all data has been indexed and shards balanced, before finally switching our alias to point to the new index.

There are some problems that come with the distributed nature of Elasticsearch. Imagine performing a search, getting 10 search results in the response, then refreshing the page and seeing 50 search results. How can search be non-deterministic?! Well, this is a problem that we encountered. This problem comes about as a result of Elasticsearch making optimisations and using approximate term frequencies to determine results. Each shard has a subset of documents in the index, and by default, Elasticsearch will use the shard’s term frequencies as an approximation for the actual term frequencies. When using the common terms query as described earlier, a term may fall under the threshold for being considered a stop word on one shard, but may be over the threshold on another shard. So, depending on which node a query gets routed to, we can end up with different results. Most of the time, this isn’t a problem as term frequencies should be very similar across all shards, providing there is enough data and the data is evenly distributed. But, if it does become a problem, full accuracy can be achieved by changing the default query type to dfs_query_and_fetch, by appending &search_type=dfs_query_then_fetch to the search URI. This query type performs an additional round-trip, collecting term frequencies from all nodes and calculating a global term frequency, before sending the query to all shards and computing results using the global frequencies. This ensures results are always accurate, but comes at the cost of some additional latency.

A similar problem can be seen in faceting. Facet counts are computed on each shard and then aggregated on the node designated as coordinator. If, say, our request is for the top 10 terms within a facet, then each node will return it’s locally computed top 10 facet elements. In cases where there are more than 10 terms, accuracy can be lost. To address this problem, a recent version of Elasticsearch introduced a shard_size attributed which can be set on the facet query, and specifies the number of elements each shard should return. This is separate from the size attribute – i.e. the number of elements we actually want. Asking each shard to return more elements is of course more expensive, but will give higher accuracy when it is needed.

Conclusions

  • It is hard to find a query which works well for every search. If there is a particular search found to yield bad results, it can be easy to optimise towards improving and fixing that search, but then other searches end up suffering as a result. When making changes to the search query, always think: will this work well for both general searches and specific searches?
  • Use filters for faceting, to filter search results without affecting facet counts. Also, Elasticsearch filters are (by default) cached, so can boost performance.
  • The three types of instant search: product suggestions, search suggestions and spelling corrections can be achieved with a single Elasticsearch query – providing the title field is configured with both a shingles and n-grams analyzer.
  • You should always A/B test whenever you make a change to the search experience. It can be invaluable to have good reporting on things like ‘searches which yield no results’ to easily catch problem with changes to the query.
  • Use index aliases to make large changes while maintaining zero-downtime.
  • There can be non-deterministic results with a distributed search engine, but with Elasticsearch these problems can be resolved at the cost of additional latency.
  • The search experience makes a big difference. It not only enables customers to discover the products they are looking for, but a well-tuned search experience can also can help them discover things they weren’t explicitly searching for. We saw significant boosts in revenue from search every time we made improvements to the search engine.

What Makes Search Engines Special?

Having spent over six months working as part of a small team to design and build a new search engine for one of Europe’s largest online retailers, I found myself learning a lot about the inner workings of modern search engines. It was my first real exposure to search – a technology which is really central to everyone’s online experiences.

Search is more complex than it may seem at first. An e-commerce search engine is built to enable customers to find the products that they are looking for. A customer will search for a product, perhaps by name, and the search engine will check the database of products and return those that match the search query. Perhaps you could write a search engine with a simple SQL query:

SELECT * FROM products WHERE product_title LIKE '%{search_query}%'

This may seem to yield decent results for some search queries – but there are many problems with such an approach.

  • Exact substring matches only. A search for ‘matrix dvd’ won’t match ‘The Matrix (1999) – DVD’.
  • No way to order the results. What if the query is ‘xbox’ and 100 products have ‘xbox’ in their title? Xbox games, xbox accessories and xbox’s themselves. Our query will find these products, but we have no way to order our results based on the most relevant matches.
  • Only searching a single field. What if we want to enable our customers to search for authors, directors, brands etc? Sure, we could change our query to look in multiple fields, but surely those fields shouldn’t be given the same importance as the product title, right?
  • No support for synonyms. A search for ‘football’ won’t match any products with ‘soccer’ in their title.
  • No stop word filtering. These are common words like ‘and’, ‘the’ and ‘of’. These words are usually not important in a search, especially in determining which products match a query. When we are searching just on product title, the negative effect of stop words is not so noticeable. But imagine if we want to extend our search engine to search over a larger body of text, such as a product description. In this case, most of the products in our database will contain stop words in the description, and it certainly doesn’t mean they should match the query.
  • No stemming. This is the process of truncating a word to its simplest form, in order to extract its root meaning. For example, ‘fisher’, ‘fishing’ and ‘fished’ all have similar meanings. These words could all be stemmed to ‘fish’, so that a search for any one of these terms will match all products related to fish.
  • Term frequencies are not considered. The frequency of a term in the corpus (i.e. across the entire database of products) can actually give us a lot of information about how important each term in a query is. If you consider a search for ‘game of thrones’. There are three terms in this query, one of which is a stop word. Now imagine that we are a game retailer and all of our products are games. Many of the products in our database will match the term ‘game’ while only a few will match the term ‘thrones’. This is because the term frequency for ‘game’ across our entire database is a lot higher than it is for ‘thrones’. With this knowledge, we know that ‘thrones’ is the more important term – products which match this word are more relevant.
  • Typical SQL databases won’t scale. As our product base and the volume of queries per second grows, standard databases will quickly become slow – partly because they are not designed for full-text search, where we are searching bodies of text for partial matches.

All of these problems can be solved with ideas which have been developed in the field of information retrieval. There are two key concepts which give modern day search engines their speed and quality:

  1. Indexing – An index is a data structure which enables us to perform blazingly fast searches within our database for documents which match a query (in the e-commerce case, a document corresponds to some information describing a product). Without an index, we would have to look in every document and compare every term with our query – a slow and inefficient process. Every document that we want to be able to search over is indexed. To index a document, every field of the document is broken down into a set of tokens and these tokens are added to the index. This process is called tokenization. The way tokens are extracted will vary depending on what we want to achieve, but tokens are often stemmed words, with stop words omitted. Using our earlier example, if a document contains the word ‘fishing’, then perhaps the (stemmed) token ‘fish’ would be added to our index.
  2. Relevance Scoring – When executing a search query, the first step is to find all documents which match the query using our index. However, having found these documents, a search engine needs to give each document a score based on how well it matches the query – so that the most relevant results appear first in your search results. There are multiple ways to do relevancy scoring, but one of the most common ways is known as TF/IDF (term frequency – inverse document frequency). The idea is that, if a term from the query appears more times in a document, then that document should receive a higher score. But, if that term appears many times in our corpus (i.e. it is a more common term), then the document should receive a lower score. Combining these two ideas, we can do some linear algebra to find the most similar documents to our query. For a detailed explanation of this, check out this article.

In the diagram below I have tried my best to depict the processes of indexing (at index time) and document retrieval (at query time).

The processes of indexing and retrieval in a typical search engine

The processes of indexing and retrieval in a typical search engine

 

Today, two of the most widely used general purpose search frameworks are Apache Solr and Elasticsearch. Both are distributed search engines written on top of Apache Lucene, a high performance full-text search library which provides implementations of the above concepts. Solr is extremely mature and has long been the industry standard, but in the last few years Elasticsearch has received a lot of attention for a number of reasons: it is based on more modern principles, it is designed to deal with very large amounts of data and without the legacy constraints of Solr, the development community were able to make very rapid progress. In our case, (with our desire to always use the latest technologies) we decided to go with Elasticsearch.

In my next post, I will discuss some ideas, best practices and lessons learned from using Elasticsearch to build a search engine for e-commerce.

Emailing multiple inline images in Python

Hi all,

Yet again, sorry for the lack of blog posts. In my defence, I’ve been keeping busy what with the Mining of Massive Data Sets course on Coursera and a few Kaggle problems amongst other distractions. I legit have at least double-digit subscriber numbers so to those people, thanks! Ignoring all that, today I’m going to share with you something I had to puzzle out myself due to lack of information online about it. In sharing, hopefully the next poor schmuck who has to do this will just be able to copy what I’ve done.

We’ll look at building a fairly robust email; one which will feature multiple inline images and multiple attachments. We’ll do the whole thing using Gmail because it’s good and easy to use Gmail and everybody can create one. The original version I wrote used Outlook and integrated Windows security but not everybody has that so Gmail it is. As a bit of background, I used this as part of a reporting pipeline that queries the Google Analytics API. I grab a whole bunch of data, filter and arrange it, combine it with another data source and produce some tables/graphs. These get saved down into a directory every day and my emailer will come along, pick up all the relevant images and hey presto – you’ve got a fairly cool reporting pipeline.

First things first, this is how you send lots of basic plain-text emails.


import smtplib

with open('credentials.csv', 'rb') as f:
    gmail_user = f.readline().strip().split(',')[1]
    gmail_pwd = f.readline().strip().split(',')[1]

def send_email(msg, gmail_user, gmail_pwd, to_list):
    mailServer = smtplib.SMTP('smtp.gmail.com', 587)
    mailServer.ehlo()
    mailServer.starttls()
    mailServer.ehlo()
    mailServer.login(gmail_user, gmail_pwd)
    mailServer.sendmail(gmail_user, to_list, msg)
    mailServer.quit()

for i in range(100):
    send_email('Hello everybody', gmail_user, gmail_pwd, ['test_recipient@email.com'])

Would the below code send 100 of the same email to the same person? Yes. Does that have the potential to be abused for perhaps-hilarious purposes. Of course. I’ll leave that up to you.

Next, let’s take a fairly easy step and attach a few files.


from email.mime.multipart import MIMEMultipart
from email.header         import Header
import smtplib
from email.MIMEBase import MIMEBase
from email import Encoders
import os

with open('credentials.csv', 'rb') as f:
    gmail_user = f.readline().strip().split(',')[1]
    gmail_pwd = f.readline().strip().split(',')[1]

def attach_file(filename):
    part = MIMEBase('application', 'octect-stream')
    part.set_payload(open(filename, 'rb').read())
    Encoders.encode_base64(part)
    part.add_header('Content-Disposition', 'attachment; filename=%s' % os.path.basename(filename))
    return part

def generate_email(gmail_user, to_list, data_path):
    msg = MIMEMultipart('related')
    msg['Subject'] = Header(u'Test Attachment Email', 'utf-8')
    msg['From'] = gmail_user
    msg['To'] = ','.join(to_list)
    msg.attach(attach_file(data_path))
    return msg

def send_email(msg, gmail_user, gmail_pwd, to_list):
    mailServer = smtplib.SMTP('smtp.gmail.com', 587)
    mailServer.ehlo()
    mailServer.starttls()
    mailServer.ehlo()
    mailServer.login(gmail_user, gmail_pwd)
    mailServer.sendmail(gmail_user, to_list, msg.as_string())
    mailServer.quit()

email_msg = generate_email(gmail_user, ['recipient@email.com'], 'test_data.txt')
send_email(email_msg, gmail_user, gmail_pwd, ['recipient@email.com'])

All you need to do to allow multiple attachments is call the msg.attach(attach_file()) method a few times. Well, as many times as you’d like attachments would be ideal.

Glorious – so we can email multiple people with multiple attachments. You could argue at this point that, if we can attach the images we’re looking to send then surely we don’t need to create inline images and so we can cease this madness. If you can argue that point successfully then you’re better at arguing than me. Something to do with Blackberry compatibility, looking nicer and “can’t you just not argue and do it for once – is it possible or is it not?”

So we’ll be embedding images to the above email – the way to do so is…


import cgi
import uuid
from email.mime.multipart import MIMEMultipart
from email.mime.text      import MIMEText
from email.mime.image     import MIMEImage
from email.header         import Header
import os
import smtplib
from email.MIMEBase import MIMEBase
from email import Encoders

with open('credentials.csv', 'rb') as f:
    gmail_user = f.readline().strip().split(',')[1]
    gmail_pwd = f.readline().strip().split(',')[1]

def attach_image(img_dict):
    with open(img_dict['path'], 'rb') as file:
    msg_image = MIMEImage(file.read(), name = os.path.basename(img_dict['path']))
    msg_image.add_header('Content-ID', '<{}>'.format(img_dict['cid']))
    return msg_image

def attach_file(filename):
    part = MIMEBase('application', 'octect-stream')
    part.set_payload(open(filename, 'rb').read())
    Encoders.encode_base64(part)
    part.add_header('Content-Disposition', 'attachment; filename=%s' % os.path.basename(filename))
    return part

def generate_email(gmail_user, to_list, data_path_1, data_path_2,img1,img2):
    msg =MIMEMultipart('related')
    msg['Subject'] = Header(u'Images and Words', 'utf-8')
    msg['From'] = gmail_user
    msg['To'] = ','.join(to_list)
    msg_alternative = MIMEMultipart('alternative')
    msg_text = MIMEText(u'Image not working - maybe next time', 'plain', 'utf-8')
    msg_alternative.attach(msg_text)
    msg.attach(msg_alternative)
    msg_html = u'<h1>Some images coming up</h1>'
    msg_html += u'<h3>Image 1</h3><div dir="ltr">''<img src="cid:{cid}" alt="{alt}"><br></div>'.format(alt=cgi.escape(img1['title'], quote=True), **img1)
    msg_html += u'<h3>Image 2</h3><div dir="ltr">''<img src="cid:{cid}" alt="{alt}"><br></div>'.format(alt=cgi.escape(img2['title'], quote=True), **img2)
    msg_html = MIMEText(msg_html, 'html', 'utf-8')
    msg_alternative.attach(msg_html)
    msg.attach(attach_image(img1))
    msg.attach(attach_image(img2))
    msg.attach(attach_file(data_path_1))
    msg.attach(attach_file(data_path_2))
    return msg

def send_email(msg, gmail_user, gmail_pwd, to_list):
    mailServer = smtplib.SMTP('smtp.gmail.com', 587)
    mailServer.ehlo()
    mailServer.starttls()
    mailServer.ehlo()
    mailServer.login(gmail_user, gmail_pwd)
    mailServer.sendmail(gmail_user, to_list, msg.as_string())
    mailServer.quit()

img1 = dict(title = 'Image 1', path = 'test_image_1.png', cid = str(uuid.uuid4()))
img2 = dict(title = 'Image 2', path = 'test_image_2.png', cid = str(uuid.uuid4()))

email_msg = generate_email(gmail_user, ['recipient@email.com'], 'test_data.txt', 'test_data_2.txt', img1, img2)
send_email(email_msg, gmail_user, gmail_pwd, ['recipient@email.com'])

So all I need to do is make sure I’m downloading the right images and data files everyday and putting them in the correct directory. Ordinarily I’d do that using wget on Unix but as I’ve done the rest in Python I’ll include the download in the script. This way the script should work irrespective of OS – a handy little bonus.

Get that run and you’ll see a lovely set of inline images, all the attachments as they were and we’re done. Nothing too complicated I hope and we’ve made a nice little ’email image aggregation program’ thingy.

As a possible extension I think this could be better used to annoy/amuse and so you could create an ’email digest’ – make something that automatically goes on a few websites, downloads an image and then sticks them all inline in an email. It’ll also download a few data files and attach them. Let’s say I really like the NASA Astronomy picture of the day (I do), the XKCD comic (I do), the Google doodle (I do) and a CSV of the week’s weather forecast (I do not). I could easily set up a bash script (using wget) and a cronjob that would get the images/csv files each day and put them into a directory. Equally, if we wanted to play nicely with Windows as well as *nix, we could use Python to download the images. I’m not going to build this as it’s not really a problem that needs solving (as far as I see it).

Peace out yo.

« Older posts Newer posts »

© 2025 DogDogFish

Theme by Anders NorenUp ↑