Posts Tagged ‘statistics’

Scraping stats off of the BBC News website

March 28, 2011

A little project I have started, based on an interesting thought I had a while ago. A friend of mine commented, sometime ago, that he uses the Top 10 most Read as a form of navigation around the BBC news website. However, I typically go to the main page, glance over the stories and then head straight for the business news index (aka page), and only after reading these stories would I consider utilising the Top 10 most read.

Ever since he said it, and post my working at the Beeb (albeit only for a short while), I have always wondered about the correlation between the stories in the Top 10 and the position of those stories on indexes on the BBC News Website. As this question has been burning a hole in my brain I decided to try and address the question. Being a mere mortal, I don’t have access to the statistics of the BBC news website, so I decided I should try and source them myself.

After doing a bit of research I designed a very simple Java application to parse each of the main indexes on the BBC New website (that is: Main, World, UK, England, Northern Ireland, Scotland, Wales, Business, Politics, Health, Education, Sci/Environment, Technology, Entertainment and Arts), and then for each index I would ascertain details of each story in each position, i.e.:

Top Story, Second Story, Third Story, Other Stories, Features and Analysis.

For each of these, I would get the URL and Headline so allowing me to save this, along with the index and position of the story, to my database for further analysis.

With all this complete, all it left me to do was get the Top 10 most read. I decided I would do this by loading the Top Story on the Main index and then retrieving the Top 10 and storing this to the database (so reflecting how my friend would start browsing the site).

To complete this task I used HtmlCleaner, an open source parser written in Java. It is extremely easy to use and learn. Then, with all my testing complete. I scheduled the job to run at the top of each hour for 12 hours to give me my first dataset (12 Top Tens, and 12 instances of where each story is on the indexes previously listed).

Now things get really exciting (this is data from 6th March, yes it is a while ago but news is always changing and I have been rather busy), here are my initial stats:

Regarding the placing of stories, each index in the time period had a top, second and third story, so the average is 1. However the features and analysis averaged at 6.7546 stories per index, and other stories came in at 9.4249.

With this in mind, have a look at the distribution of story placing and where it featured in the top ten over said period:

As you can see, Other Stories was in the top ten for every position across the time period.

Now, what about the indexes?

Now, I always expected the Main index (i.e. the front page) to be the clear leader; It is the leader, but joint leader with Politics and UK (which is good).

Finally lets look at the index and position of stories:

It is interesting that Top Story is not the clear lead on each index, although there are X other stories and Y feature stories so this increases it advantage in this graph. I will deal with this when I get time to perform further analysis.

Finally, a very basic attempt at a word cloud from the headlines for the time period:

Statistics on the similarity algorithms

July 27, 2010

So, just under a month ago I posted the following on similarity algorithms.

I have found it quite interesting to see which have been viewed, and the percentage share. Initially Damerau Levenshtein was the hot favourite, but then N-gram started coming through the ranks and is a clear leader with a 39% share of the hits!

I thought I would put together a simple histogram of the percentage share of hit for each similarity posting:

Disappointed my favourite, Markov chains, is languishing so far back…