Archive for the ‘Data’ Category

Hadoop – examples of Java and Python Map and Reduce for the average word length

September 24, 2011

It’s been a while since my last post, which is a little disappointing. On the up side, the reason for my absence has not been because I am slothful, it is because I have started on a new project and I am deep in the world of ‘Big Data’ !!!!

Over the past three/four months I have been getting to grips with DB2, initially I couldn’t stand it but it is growing on me now. Especially when I switch back to MS SQL Server… But that is another conversation.

Excitingly on the horizon, in fact fast approaching, is Hadoop. So I have started playing with this technology, “joined” the Hadoop user group (well when I say join, I have started attending the meetings) and quite enjoying it all.

I thought I would share some of what I have learned so far, in the hope others will benefit. I am working through multiple exercises but I liked this one because it was fairly straight forward and yet showed you how to write your own Mapper and Reducer in Java and then in Python!

In the first instance I found an amazing blog / tutorial on setting up a Hadoop server and writing a mapper and reducer in Python. Both amazingly useful posts, really appreciate the publication of them.

Ok, lets get on with it. A simple exercise – the problem:

Write a MapReduce job that reads any Text input and computes the average length of all words that start with each character.

The example, we will use is:

Now is definitely the time

With the output looking like:

N 3
d 10
I 2
t 3.5

So I will provide you with two solutions for this, one in Java (the native language) and the other in Python.

I used Netbeans to write my Java code, in the first instance lets look at the main section, it is basically the same as the WordCount main you will see everywhere. In this case, yes I did hardcode the values – just for the sake of ease.

public static void main(String[] args)
    {
        JobClient client = new JobClient();

        JobConf conf = new JobConf(AvgLen.class);


        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(DoubleWritable.class);
        
        FileInputFormat.addInputPath(conf, new Path("avg/text.txt"));
        FileOutputFormat.setOutputPath(conf, new Path("avgout"));
        
        conf.setMapperClass(AvgLenMapper.class);

        conf.setReducerClass(AvgLenReducer.class);
        conf.setCombinerClass(AvgLenReducer.class);

        client.setConf(conf);

         try
        {
            JobClient.runJob(conf);

        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
        System.out.println("Done!");
    }

One thing to point out here is the use of the DoubleWritable class, because we need to use doubes, or longs I suppose, as part of our calculation rather than ints.

Now we have the Mapper:

public class AvgLenMapper extends MapReduceBase
        implements Mapper<LongWritable, Text, Text, DoubleWritable>
{
    private DoubleWritable oned = new DoubleWritable(1.0);
    private Text word = new Text();

    public void map(LongWritable l, Text t, OutputCollector o, Reporter r) throws IOException
    {
        String line = t.toString();

        int counter = 0;
        
        for (int i = 0; i < line.length(); i++)
        {

            if (line.charAt(i) == ' ')
            {
                word.set((line.substring(counter, i)).substring(0,1));
                oned.set(i-counter);
                counter = i+1;
                o.collect(word, oned);
            }

        }

        word.set((line.substring(counter)).substring(0,1));
        oned.set(line.length()-counter);
        o.collect(word, oned);
    }
}

Here we just collect the first letter of each word and the length of the word in our OutputCollector. The contents of this would be:

N 3
d 10
I 2
t 3
t 4

This is ready to be reduced. So lets look at the reducer:

public class AvgLenReducer extends MapReduceBase
        implements Reducer<Text, DoubleWritable, Text, DoubleWritable>
{
    public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException
    {
        double sum = 0;
        double counter = 0.0;

        while(values.hasNext())
        {
            DoubleWritable value = (DoubleWritable)values.next();
            sum += value.get();
            counter++;
        }
        output.collect(key, new DoubleWritable((sum/counter)));
        
        
    }
}

Again, notice the use of DoubleWritable over IntWritable. Here we reduce down the resultset from the mapper. Pretty simple really until we get to t which would involve 7/2 which = 3.5

With this done, and compiled. We just execute it with

hadoop jar AvgLen a b

then we look at our output with:

hadoop fs -cat avgout/part-00000

Nice.

So lets do the same but with Python!

Our Mapper:

#!/usr/bin/env python



import sys



for line in sys.stdin:



	line = line.strip()



	words = line.split()



	for word in words:



		print '%s\t%s' % (word[0:1], len(word))

Our Reducer:

#!/usr/bin/env python



from operator import itemgetter

import sys



current_word = None

current_count = 0.0

counter = 1.0

word = None



for line in sys.stdin:

	

	line = line.strip()



	word, count = line.split('\t', 1)



	try:

		count = int(count)

	except ValueError:

		continue



	if current_word == word:

		current_count += count

		counter = counter + 1

	else:

		current_count = current_count / counter

		if current_word:

			print '%s\t%s' % (current_word, current_count)

		current_count = count

		current_word = word

		counter = 1.0



if current_word == word:

	current_count = current_count / counter

	print '%s\t%s' % (current_word, current_count)

You can test your code, before involving hadoop with:

echo "Now is definitely the time" | /usr/local/hadoop/tmp/MyAvgMapper.py | sort | /usr/local/hadoop/tmp/MyAvgReducer.py

And if you are happy with the result, submit it to Hadoop with:

hadoop jar contrib/streaming/hadoop-*streaming*.jar -file /usr/local/hadoop/tmp/MyAvgMapper.py -mapper /usr/local/hadoop/tmp/MyAvgMapper.py  -file /usr/local/hadoop/tmp/MyAvgReducer.py -reducer /usr/local/hadoop/tmp/MyAvgReducer.py -input /user/hduser/avg/text.txt -output /user/hduser/pythonAvgTest

This is command just rolls off the keyboard doesn’t it 🙂

Note I uploaded the file text.txt which contained the test text.

I hope this proves useful.

Further analysis of the scraped BBC data

May 4, 2011

Further to my previous post, I have been mining the data I collected to see if I could find anything of interest, and evaluate the data I have collected.

One of things that interested to me, was to look at the placement of the story compared to it’s position in the top ten over time. So I set off my program to scrape the data from 9AM through to 9PM, I didn’t want too much data to start with. Scraping this data every hour. I then spent a while writing SQL statements to capture the statistics I was looking for over the time period, below you can see a simple line graph (it is in Logarithmic scale):

I produced this using Open Office.

I find it is quite interesting to see the third position on the indexes doesn’t get utilised so much. You could also suggest as the day gets later the more people start reading other stories, i.e. features, second, and other stories. However, one can’t jump to any conclusion for two reasons which jump out at me:

1 – This sample set is way too small!
2 – I think I need more detail, it would be interesting (for example) which position of the features and analysis are being selected? The same goes for the other stories.

I decided I would cobble together a simple heat map of the index, aggregated, over the time period to see the “behaviour”. Feel free to take a look, personally I feel it re-enforces my previous points (I used JavaFx script 1.3 via Netbeans to produce this):

I think I might go back to the scraper program and see if I can expand the data gathered.

Scraping stats off of the BBC News website

March 28, 2011

A little project I have started, based on an interesting thought I had a while ago. A friend of mine commented, sometime ago, that he uses the Top 10 most Read as a form of navigation around the BBC news website. However, I typically go to the main page, glance over the stories and then head straight for the business news index (aka page), and only after reading these stories would I consider utilising the Top 10 most read.

Ever since he said it, and post my working at the Beeb (albeit only for a short while), I have always wondered about the correlation between the stories in the Top 10 and the position of those stories on indexes on the BBC News Website. As this question has been burning a hole in my brain I decided to try and address the question. Being a mere mortal, I don’t have access to the statistics of the BBC news website, so I decided I should try and source them myself.

After doing a bit of research I designed a very simple Java application to parse each of the main indexes on the BBC New website (that is: Main, World, UK, England, Northern Ireland, Scotland, Wales, Business, Politics, Health, Education, Sci/Environment, Technology, Entertainment and Arts), and then for each index I would ascertain details of each story in each position, i.e.:

Top Story, Second Story, Third Story, Other Stories, Features and Analysis.

For each of these, I would get the URL and Headline so allowing me to save this, along with the index and position of the story, to my database for further analysis.

With all this complete, all it left me to do was get the Top 10 most read. I decided I would do this by loading the Top Story on the Main index and then retrieving the Top 10 and storing this to the database (so reflecting how my friend would start browsing the site).

To complete this task I used HtmlCleaner, an open source parser written in Java. It is extremely easy to use and learn. Then, with all my testing complete. I scheduled the job to run at the top of each hour for 12 hours to give me my first dataset (12 Top Tens, and 12 instances of where each story is on the indexes previously listed).

Now things get really exciting (this is data from 6th March, yes it is a while ago but news is always changing and I have been rather busy), here are my initial stats:

Regarding the placing of stories, each index in the time period had a top, second and third story, so the average is 1. However the features and analysis averaged at 6.7546 stories per index, and other stories came in at 9.4249.

With this in mind, have a look at the distribution of story placing and where it featured in the top ten over said period:

As you can see, Other Stories was in the top ten for every position across the time period.

Now, what about the indexes?

Now, I always expected the Main index (i.e. the front page) to be the clear leader; It is the leader, but joint leader with Politics and UK (which is good).

Finally lets look at the index and position of stories:

It is interesting that Top Story is not the clear lead on each index, although there are X other stories and Y feature stories so this increases it advantage in this graph. I will deal with this when I get time to perform further analysis.

Finally, a very basic attempt at a word cloud from the headlines for the time period:

Data Scientist, and the Strata conference

December 20, 2010

Just a quick one, came across an advert for the Strata Conference earlier today on the O’Reilly site. This is a conference I would dearly love to go to, however it is a bit too expensive for me at the moment (including travel).

In the mean time I came across a couple of very interesting resources related to this conference:

There is a nice article What is Data Science, which gives you a good overview of the field and what people are doing. I have read bits before, but other bits not. As ever I felt quite inspired, more new things to learn!!

AND….

Machine Learning: A Love Story – By Hilary Mason. It is an hour long, but it just flies by. I really enjoyed it (she should turn it into a book) – her blog can be found here. Also, she is attending Strata.

Like I say, just a quick one. I just wanted to share.

A treemap representing the amount spent on Housing benefit

November 14, 2010

A colleague and I recently requested “Housing benefit cost data for every local authority within the UK (England, Scotland, Wales and NI), including the amount per local authority broken down into local authority and private properties per local authority. Specifically the maximum, minimum, mean, median, quartile 1, quartile 3, 10th percentile and 90th percentile for each local authority and private properties within each local authority.” via a Freedom of Information request of the Department for Work and Pensions.

Of course, now housing benefit, well all benefits, are in the news so this is almost timely (maybe a few weeks out, but I have been really busy with my day job!)

The response to our request was “As you have specifically requested cost information by the mean, median, quartile 1, quartile 3, 10th percentile and 90th percentile we have interrogated our databases and produced the attached analysis which is based on weekly award of Housing Benefit at March 2010. Further information on Housing Benefit is available on the DWP website at: http://research.dwp.gov.uk/asd/hbctb.asp . In addition, Housing Benefit expenditure by local authority is available at: http://research.dwp.gov.uk/asd/asd4/expenditure.asp

The information you requested for maximum and minimum is being withheld as it falls under the exemption in section 40 of the Freedom of Information Act. This exemption covers personal information about a third party and if provided could lead to a reasonable chance of individual claimants and their families being identified. This would breach the families’ right to privacy contrary to the Data Protection Act. ”

We thought it might be interesting to view this data as a treemap, it gives a good overview of the data allowing you to explore different areas and see which have the highest and lowest claims at each level. Each box represents the average weekly amount of housing benefit for the specified area, in sterling.

As before, I have produced the visualisation in Processing utilising the functionality I learnt about in Ben Fry’s book Visualising data. You can see my other example of a treemap here.