Archive for the ‘Data Science’ Category

Hadoop – examples of Java and Python Map and Reduce for the average word length

September 24, 2011

It’s been a while since my last post, which is a little disappointing. On the up side, the reason for my absence has not been because I am slothful, it is because I have started on a new project and I am deep in the world of ‘Big Data’ !!!!

Over the past three/four months I have been getting to grips with DB2, initially I couldn’t stand it but it is growing on me now. Especially when I switch back to MS SQL Server… But that is another conversation.

Excitingly on the horizon, in fact fast approaching, is Hadoop. So I have started playing with this technology, โ€œjoinedโ€ the Hadoop user group (well when I say join, I have started attending the meetings) and quite enjoying it all.

I thought I would share some of what I have learned so far, in the hope others will benefit. I am working through multiple exercises but I liked this one because it was fairly straight forward and yet showed you how to write your own Mapper and Reducer in Java and then in Python!

In the first instance I found an amazing blog / tutorial on setting up a Hadoop server and writing a mapper and reducer in Python. Both amazingly useful posts, really appreciate the publication of them.

Ok, lets get on with it. A simple exercise โ€“ the problem:

Write a MapReduce job that reads any Text input and computes the average length of all words that start with each character.

The example, we will use is:

Now is definitely the time

With the output looking like:

N 3
d 10
I 2
t 3.5

So I will provide you with two solutions for this, one in Java (the native language) and the other in Python.

I used Netbeans to write my Java code, in the first instance lets look at the main section, it is basically the same as the WordCount main you will see everywhere. In this case, yes I did hardcode the values โ€“ just for the sake of ease.

public static void main(String[] args)
        JobClient client = new JobClient();

        JobConf conf = new JobConf(AvgLen.class);

        FileInputFormat.addInputPath(conf, new Path("avg/text.txt"));
        FileOutputFormat.setOutputPath(conf, new Path("avgout"));




        catch (Exception e)

One thing to point out here is the use of the DoubleWritable class, because we need to use doubes, or longs I suppose, as part of our calculation rather than ints.

Now we have the Mapper:

public class AvgLenMapper extends MapReduceBase
        implements Mapper<LongWritable, Text, Text, DoubleWritable>
    private DoubleWritable oned = new DoubleWritable(1.0);
    private Text word = new Text();

    public void map(LongWritable l, Text t, OutputCollector o, Reporter r) throws IOException
        String line = t.toString();

        int counter = 0;
        for (int i = 0; i < line.length(); i++)

            if (line.charAt(i) == ' ')
                word.set((line.substring(counter, i)).substring(0,1));
                counter = i+1;
                o.collect(word, oned);


        o.collect(word, oned);

Here we just collect the first letter of each word and the length of the word in our OutputCollector. The contents of this would be:

N 3
d 10
I 2
t 3
t 4

This is ready to be reduced. So lets look at the reducer:

public class AvgLenReducer extends MapReduceBase
        implements Reducer<Text, DoubleWritable, Text, DoubleWritable>
    public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException
        double sum = 0;
        double counter = 0.0;

            DoubleWritable value = (DoubleWritable);
            sum += value.get();
        output.collect(key, new DoubleWritable((sum/counter)));

Again, notice the use of DoubleWritable over IntWritable. Here we reduce down the resultset from the mapper. Pretty simple really until we get to t which would involve 7/2 which = 3.5

With this done, and compiled. We just execute it with

hadoop jar AvgLen a b

then we look at our output with:

hadoop fs -cat avgout/part-00000


So lets do the same but with Python!

Our Mapper:

#!/usr/bin/env python

import sys

for line in sys.stdin:

	line = line.strip()

	words = line.split()

	for word in words:

		print '%s\t%s' % (word[0:1], len(word))

Our Reducer:

#!/usr/bin/env python

from operator import itemgetter

import sys

current_word = None

current_count = 0.0

counter = 1.0

word = None

for line in sys.stdin:


	line = line.strip()

	word, count = line.split('\t', 1)


		count = int(count)

	except ValueError:


	if current_word == word:

		current_count += count

		counter = counter + 1


		current_count = current_count / counter

		if current_word:

			print '%s\t%s' % (current_word, current_count)

		current_count = count

		current_word = word

		counter = 1.0

if current_word == word:

	current_count = current_count / counter

	print '%s\t%s' % (current_word, current_count)

You can test your code, before involving hadoop with:

echo "Now is definitely the time" | /usr/local/hadoop/tmp/ | sort | /usr/local/hadoop/tmp/

And if you are happy with the result, submit it to Hadoop with:

hadoop jar contrib/streaming/hadoop-*streaming*.jar -file /usr/local/hadoop/tmp/ -mapper /usr/local/hadoop/tmp/  -file /usr/local/hadoop/tmp/ -reducer /usr/local/hadoop/tmp/ -input /user/hduser/avg/text.txt -output /user/hduser/pythonAvgTest

This is command just rolls off the keyboard doesn’t it ๐Ÿ™‚

Note I uploaded the file text.txt which contained the test text.

I hope this proves useful.

Scraping stats off of the BBC News website

March 28, 2011

A little project I have started, based on an interesting thought I had a while ago. A friend of mine commented, sometime ago, that he uses the Top 10 most Read as a form of navigation around the BBC news website. However, I typically go to the main page, glance over the stories and then head straight for the business news index (aka page), and only after reading these stories would I consider utilising the Top 10 most read.

Ever since he said it, and post my working at the Beeb (albeit only for a short while), I have always wondered about the correlation between the stories in the Top 10 and the position of those stories on indexes on the BBC News Website. As this question has been burning a hole in my brain I decided to try and address the question. Being a mere mortal, I don’t have access to the statistics of the BBC news website, so I decided I should try and source them myself.

After doing a bit of research I designed a very simple Java application to parse each of the main indexes on the BBC New website (that is: Main, World, UK, England, Northern Ireland, Scotland, Wales, Business, Politics, Health, Education, Sci/Environment, Technology, Entertainment and Arts), and then for each index I would ascertain details of each story in each position, i.e.:

Top Story, Second Story, Third Story, Other Stories, Features and Analysis.

For each of these, I would get the URL and Headline so allowing me to save this, along with the index and position of the story, to my database for further analysis.

With all this complete, all it left me to do was get the Top 10 most read. I decided I would do this by loading the Top Story on the Main index and then retrieving the Top 10 and storing this to the database (so reflecting how my friend would start browsing the site).

To complete this task I used HtmlCleaner, an open source parser written in Java. It is extremely easy to use and learn. Then, with all my testing complete. I scheduled the job to run at the top of each hour for 12 hours to give me my first dataset (12 Top Tens, and 12 instances of where each story is on the indexes previously listed).

Now things get really exciting (this is data from 6th March, yes it is a while ago but news is always changing and I have been rather busy), here are my initial stats:

Regarding the placing of stories, each index in the time period had a top, second and third story, so the average is 1. However the features and analysis averaged at 6.7546 stories per index, and other stories came in at 9.4249.

With this in mind, have a look at the distribution of story placing and where it featured in the top ten over said period:

As you can see, Other Stories was in the top ten for every position across the time period.

Now, what about the indexes?

Now, I always expected the Main index (i.e. the front page) to be the clear leader; It is the leader, but joint leader with Politics and UK (which is good).

Finally lets look at the index and position of stories:

It is interesting that Top Story is not the clear lead on each index, although there are X other stories and Y feature stories so this increases it advantage in this graph. I will deal with this when I get time to perform further analysis.

Finally, a very basic attempt at a word cloud from the headlines for the time period:

Data Scientist, and the Strata conference

December 20, 2010

Just a quick one, came across an advert for the Strata Conference earlier today on the O’Reilly site. This is a conference I would dearly love to go to, however it is a bit too expensive for me at the moment (including travel).

In the mean time I came across a couple of very interesting resources related to this conference:

There is a nice article What is Data Science, which gives you a good overview of the field and what people are doing. I have read bits before, but other bits not. As ever I felt quite inspired, more new things to learn!!


Machine Learning: A Love Story – By Hilary Mason. It is an hour long, but it just flies by. I really enjoyed it (she should turn it into a book) – her blog can be found here. Also, she is attending Strata.

Like I say, just a quick one. I just wanted to share.