Archive for the ‘Data’ Category

When is Tower Bridge opening

January 20, 2018

One of the traits that I believe makes a good product manager outside of knowing your customer(s) is being able to recognise a real problem. For a while now I have had a “problem” that has bugged me and I have wanted to solve for myself (I see the customers of this product those that suffer the same problem I do).

What is my problem?

I commute by bicycle across London. I live in South East London and I work north of the river. This means I have a choice of bridges to cross, from Waterloo down to Tower Bridge. In the morning I am more of a creature of habit and typically cross Southwark bridge, however in the evenings I like to mix it up a bit and typically as I leave my place of work I try to decide which bridge I should cross. Tower Bridge is actually one of my favourite bridges (from the choice) to cross, however Tower Bridge has a risk associated with it – will it be open when I arrive to cross it? This is definitely a first world problem, but if the bridge is open you don’t know until the approach at which time it is too late to turn around and take another bridge (I know I have tried), and then you have to sit and wait for the operation to complete. Whilst it is nice to witness, it adds at least 20 minutes to the commute (effectively doubling the time), and compresses all of the traffic so making it more dangerous.

My problem, fundamentally, is I need to know as I leave the office is Tower Bridge due to open that evening, if so I will take another bridge. I always check my phone before I leave the office, so if I had a notification I would be alerted to this fact and make an informed decision.

The research

So the first bit of research is, can you get the time of the openings on the web and the simple answer is yes, Tower Bridge kindly provide this on their website: a quick Google search gave me this information.

Next up is, do I need to scrape this data and then offer this data as an API service, or am I lucky and has someone already done this? Well, another Google search and I have an answer. provides a RESTful service where I can get the time and date for the next lift (as well as vessel name and direction of travel)! This work was completed by Andre Parmeggiani, you can see his other work on his github page – I am very grateful that he did this work.

So that is it, everything is perfect I have the foundations in place to build the product (something I wanted to do for myself) and I had a week break over the Christmas period. Time to break open the Head First Android Development book and start learning how to code up an app.

The Product

Well, the simple solution is an MVP in the shape of a mobile app that tells me when the next opening is. My MVP is to consist of:

  • I can interact with app to check what is the next opening of the bridge
  • I get an alert within 4 hours of the next opening of the bridge so that when I check my phone before setting off I will see the alert and make an informed decision of the course to take.


Now I will admit, I think I might have gone a too much on the minimum for the interface, as I went with as basic as possible (I only had a limited amount of time). However I think this MVP is viable as testing showed I can interact and get the next opening information and I get alerts / notifications. I have also become a bore as I can tell you when the next opening is…. πŸ™‚

I am pleased to announce the app is available from the play store: next up is to write the same app for the iPhone.

Please help yourself, and if you like it please rate me and I am happy to take suggestions.

And finally, thank yous

I would to thank Andre Parmeggiani and his API

Also, I would like to thank Nicolas de Camaret for his Flickr post which I used as background for the app – I really appreciate that.


Hadoop – examples of Java and Python Map and Reduce for the average word length

September 24, 2011

It’s been a while since my last post, which is a little disappointing. On the up side, the reason for my absence has not been because I am slothful, it is because I have started on a new project and I am deep in the world of ‘Big Data’ !!!!

Over the past three/four months I have been getting to grips with DB2, initially I couldn’t stand it but it is growing on me now. Especially when I switch back to MS SQL Server… But that is another conversation.

Excitingly on the horizon, in fact fast approaching, is Hadoop. So I have started playing with this technology, β€œjoined” the Hadoop user group (well when I say join, I have started attending the meetings) and quite enjoying it all.

I thought I would share some of what I have learned so far, in the hope others will benefit. I am working through multiple exercises but I liked this one because it was fairly straight forward and yet showed you how to write your own Mapper and Reducer in Java and then in Python!

In the first instance I found an amazing blog / tutorial on setting up a Hadoop server and writing a mapper and reducer in Python. Both amazingly useful posts, really appreciate the publication of them.

Ok, lets get on with it. A simple exercise – the problem:

Write a MapReduce job that reads any Text input and computes the average length of all words that start with each character.

The example, we will use is:

Now is definitely the time

With the output looking like:

N 3
d 10
I 2
t 3.5

So I will provide you with two solutions for this, one in Java (the native language) and the other in Python.

I used Netbeans to write my Java code, in the first instance lets look at the main section, it is basically the same as the WordCount main you will see everywhere. In this case, yes I did hardcode the values – just for the sake of ease.

public static void main(String[] args)
        JobClient client = new JobClient();

        JobConf conf = new JobConf(AvgLen.class);

        FileInputFormat.addInputPath(conf, new Path("avg/text.txt"));
        FileOutputFormat.setOutputPath(conf, new Path("avgout"));




        catch (Exception e)

One thing to point out here is the use of the DoubleWritable class, because we need to use doubes, or longs I suppose, as part of our calculation rather than ints.

Now we have the Mapper:

public class AvgLenMapper extends MapReduceBase
        implements Mapper<LongWritable, Text, Text, DoubleWritable>
    private DoubleWritable oned = new DoubleWritable(1.0);
    private Text word = new Text();

    public void map(LongWritable l, Text t, OutputCollector o, Reporter r) throws IOException
        String line = t.toString();

        int counter = 0;
        for (int i = 0; i < line.length(); i++)

            if (line.charAt(i) == ' ')
                word.set((line.substring(counter, i)).substring(0,1));
                counter = i+1;
                o.collect(word, oned);


        o.collect(word, oned);

Here we just collect the first letter of each word and the length of the word in our OutputCollector. The contents of this would be:

N 3
d 10
I 2
t 3
t 4

This is ready to be reduced. So lets look at the reducer:

public class AvgLenReducer extends MapReduceBase
        implements Reducer<Text, DoubleWritable, Text, DoubleWritable>
    public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException
        double sum = 0;
        double counter = 0.0;

            DoubleWritable value = (DoubleWritable);
            sum += value.get();
        output.collect(key, new DoubleWritable((sum/counter)));

Again, notice the use of DoubleWritable over IntWritable. Here we reduce down the resultset from the mapper. Pretty simple really until we get to t which would involve 7/2 which = 3.5

With this done, and compiled. We just execute it with

hadoop jar AvgLen a b

then we look at our output with:

hadoop fs -cat avgout/part-00000


So lets do the same but with Python!

Our Mapper:

#!/usr/bin/env python

import sys

for line in sys.stdin:

	line = line.strip()

	words = line.split()

	for word in words:

		print '%s\t%s' % (word[0:1], len(word))

Our Reducer:

#!/usr/bin/env python

from operator import itemgetter

import sys

current_word = None

current_count = 0.0

counter = 1.0

word = None

for line in sys.stdin:


	line = line.strip()

	word, count = line.split('\t', 1)


		count = int(count)

	except ValueError:


	if current_word == word:

		current_count += count

		counter = counter + 1


		current_count = current_count / counter

		if current_word:

			print '%s\t%s' % (current_word, current_count)

		current_count = count

		current_word = word

		counter = 1.0

if current_word == word:

	current_count = current_count / counter

	print '%s\t%s' % (current_word, current_count)

You can test your code, before involving hadoop with:

echo "Now is definitely the time" | /usr/local/hadoop/tmp/ | sort | /usr/local/hadoop/tmp/

And if you are happy with the result, submit it to Hadoop with:

hadoop jar contrib/streaming/hadoop-*streaming*.jar -file /usr/local/hadoop/tmp/ -mapper /usr/local/hadoop/tmp/  -file /usr/local/hadoop/tmp/ -reducer /usr/local/hadoop/tmp/ -input /user/hduser/avg/text.txt -output /user/hduser/pythonAvgTest

This is command just rolls off the keyboard doesn’t it πŸ™‚

Note I uploaded the file text.txt which contained the test text.

I hope this proves useful.

Further analysis of the scraped BBC data

May 4, 2011

Further to my previous post, I have been mining the data I collected to see if I could find anything of interest, and evaluate the data I have collected.

One of things that interested to me, was to look at the placement of the story compared to it’s position in the top ten over time. So I set off my program to scrape the data from 9AM through to 9PM, I didn’t want too much data to start with. Scraping this data every hour. I then spent a while writing SQL statements to capture the statistics I was looking for over the time period, below you can see a simple line graph (it is in Logarithmic scale):

I produced this using Open Office.

I find it is quite interesting to see the third position on the indexes doesn’t get utilised so much. You could also suggest as the day gets later the more people start reading other stories, i.e. features, second, and other stories. However, one can’t jump to any conclusion for two reasons which jump out at me:

1 – This sample set is way too small!
2 – I think I need more detail, it would be interesting (for example) which position of the features and analysis are being selected? The same goes for the other stories.

I decided I would cobble together a simple heat map of the index, aggregated, over the time period to see the “behaviour”. Feel free to take a look, personally I feel it re-enforces my previous points (I used JavaFx script 1.3 via Netbeans to produce this):

I think I might go back to the scraper program and see if I can expand the data gathered.

Scraping stats off of the BBC News website

March 28, 2011

A little project I have started, based on an interesting thought I had a while ago. A friend of mine commented, sometime ago, that he uses the Top 10 most Read as a form of navigation around the BBC news website. However, I typically go to the main page, glance over the stories and then head straight for the business news index (aka page), and only after reading these stories would I consider utilising the Top 10 most read.

Ever since he said it, and post my working at the Beeb (albeit only for a short while), I have always wondered about the correlation between the stories in the Top 10 and the position of those stories on indexes on the BBC News Website. As this question has been burning a hole in my brain I decided to try and address the question. Being a mere mortal, I don’t have access to the statistics of the BBC news website, so I decided I should try and source them myself.

After doing a bit of research I designed a very simple Java application to parse each of the main indexes on the BBC New website (that is: Main, World, UK, England, Northern Ireland, Scotland, Wales, Business, Politics, Health, Education, Sci/Environment, Technology, Entertainment and Arts), and then for each index I would ascertain details of each story in each position, i.e.:

Top Story, Second Story, Third Story, Other Stories, Features and Analysis.

For each of these, I would get the URL and Headline so allowing me to save this, along with the index and position of the story, to my database for further analysis.

With all this complete, all it left me to do was get the Top 10 most read. I decided I would do this by loading the Top Story on the Main index and then retrieving the Top 10 and storing this to the database (so reflecting how my friend would start browsing the site).

To complete this task I used HtmlCleaner, an open source parser written in Java. It is extremely easy to use and learn. Then, with all my testing complete. I scheduled the job to run at the top of each hour for 12 hours to give me my first dataset (12 Top Tens, and 12 instances of where each story is on the indexes previously listed).

Now things get really exciting (this is data from 6th March, yes it is a while ago but news is always changing and I have been rather busy), here are my initial stats:

Regarding the placing of stories, each index in the time period had a top, second and third story, so the average is 1. However the features and analysis averaged at 6.7546 stories per index, and other stories came in at 9.4249.

With this in mind, have a look at the distribution of story placing and where it featured in the top ten over said period:

As you can see, Other Stories was in the top ten for every position across the time period.

Now, what about the indexes?

Now, I always expected the Main index (i.e. the front page) to be the clear leader; It is the leader, but joint leader with Politics and UK (which is good).

Finally lets look at the index and position of stories:

It is interesting that Top Story is not the clear lead on each index, although there are X other stories and Y feature stories so this increases it advantage in this graph. I will deal with this when I get time to perform further analysis.

Finally, a very basic attempt at a word cloud from the headlines for the time period:

Data Scientist, and the Strata conference

December 20, 2010

Just a quick one, came across an advert for the Strata Conference earlier today on the O’Reilly site. This is a conference I would dearly love to go to, however it is a bit too expensive for me at the moment (including travel).

In the mean time I came across a couple of very interesting resources related to this conference:

There is a nice article What is Data Science, which gives you a good overview of the field and what people are doing. I have read bits before, but other bits not. As ever I felt quite inspired, more new things to learn!!


Machine Learning: A Love Story – By Hilary Mason. It is an hour long, but it just flies by. I really enjoyed it (she should turn it into a book) – her blog can be found here. Also, she is attending Strata.

Like I say, just a quick one. I just wanted to share.