Get started with screenscraping using Google Chrome’s Scraper extension

How do you get information from a website to a Excel spreadsheet? The answer is screenscraping. There are a number of softwares and plattforms (such as OutWit Hub, Google Docs and Scraper Wiki) that helps you do this, but none of them are – in my opinion – as easy to use as the Google Chrome extension Scraper, which has become one of my absolutely favourite data tools.

What is a screenscraper?

I like to think of a screenscraper as a small robot that reads websites and extracts pieces of information. When you are able to unleash a scraper on hundreads, thousands or even more pages it can be an incredibly powerful tool.

In its most simple form, the one that we will look at in this blog post, it gathers information from one webpage only.

Google Chrome’s Scraper

Scraper is an Google Chrome extension that can be installed for free at Chrome Web Store.

Image

Now if you installed the extension correctly you should be able to see the option “Scrape similar” if you right-click any element on a webpage.

The Task: Scraping the contact details of all Swedish MPs

Image

This is the site we’ll be working with, a list of all Swedish MPs, including their contact details. Start by right-clicking the name of any person and chose Scrape similar. This should open the following window.

Understanding XPaths

At w3schools you’ll find a broader introduction to XPaths.

Before we move on to the actual scrape, let me briefly introduce XPaths. XPath is a language for finding information in an XML structure, for example an HTML file. It is a way to select tags (or rather “nodes”) of interest. In this case we use XPaths to define what parts of the webpage that we want to collect.

A typical XPath might look something like this:

//div[@id="content"]/table[1]/tr

Which in plain English translates to:

// - Search the whole document...
div[@id="content"] - ...for the div tag with the id "content".
table[1] -  Select the first table.
tr - And in that table, grab all rows.

Over to Scraper then. I’m given the following suggested XPath:

//section[1]/div/div/div/dl/dt/a

The results look pretty good, but it seems we only get names starting with an A. And we would also like to collect to phone numbers and party names. So let’s go back to the webpage and look at the HTML structure.

Right-click one of the MPs and chose Inspect element. We can see that each alphabetical list is contained in a section tag with the class “grid_6 alpha omega searchresult container clist”.

 And if we open the section tag we find the list of MPs in div tags.

We will do this scrape in two steps. Step one is to select the tags containing all information about the MPs with one XPath. Step two is to pick the specific pieces of data that we are interested in (name, e-mail, phone number, party) and place them in columns.

Writing our XPaths

In step one we want to try to get as deep into the HTML structure as possible without losing any of the elements we are interested in. Hover the tags in the Elements window to see what tags correspond to what elements on the page.

In our case this is the last tag that contains all the data we are looking for:

//section[@class="grid_6 alpha omega searchresult container clist"]/div/div/div/dl

Click Scrape to test run the XPath. It should give you a list that looks something like this.

Scroll down the list to make sure it has 349 rows. That is the number of MPs in the Swedish parliament. The second step is to split this data into columns. Go back to the webpage and inspect the HTML code.

I have highlighted the parts that we want to extract. Grab them with the following XPaths:

name: dt/a
party: dd[1]
region: dd[2]/span[1]
seat: dd[2]/span[2]
phone: dd[3]
e-mail: dd[4]/span/a

Insert these paths in the Columns field and click Scrape to run the scraper.

Click Export to Google Docs to get the data into a spreadsheet. Here is my output.

Now that wasn’t so difficult, was it?

Taking it to the next level

Congratulations. You’ve now learnt the basics of screenscraping. Scraper is a very useful tool, but it also has its limitations. Most importantly you are only able to scrape one page at a time. If you need to collect data from several pages Scraper is no longer very efficient.

To take your screenscraping to the next level you need to learn a bit of programming. But fear not. It is not rocket science. If you understand XPaths you’ve come a long way.

Here is how you continue from here:


Finding the optimal marathon strategy in data

After half a year of training I ran my third marathon this weekend in Stockholm. It was a great race and managed to take three minutes of my own PB and finish on 3.10.

However, the race got me thinking: what is the optimal strategy for a marathon race? That is, how fast should I run in the beginning? I turned to data to look for the answer.

The Data

Stockholm Marathon gathers up to 20 000 runners every year. For each runner split times are recorded every 5 km. So there is plenty of data to work with here.

To get the data I wrote a Ruby script that scans the results of 2010 and records the results of all the (male) runners that participated in both 2010 and 2009. The scrape gave me a dataset of 3965 runners. For each runner I stored the variables age, finish time and time at 10 km.

To make the to races comparable I multiplied the times of 2009 by 0,967, as it turned out that the average time of 2010 was only 4.04.44, compared to 4.12.58 in 2009. Presumably the conditions were better in 2010.

The Analysis

We assume that all runners are equally good in 2009 and 2010. This means that a good run can be defined as run where the time is better than other year. If runner A does 4.00 in 2009 and 3.45 in 2010 he has made a better performance in 2010. But the question is: did the starting pace affect the result?

Lets start by splitting the runners in to three groups. Firstly the ones that improved their times by at least 10 percent in 2010 (green color), secondly the ones that did close to the same result in 2009 and 2010 (blue) and thirdly the ones that did at least 10 percent worse in 2010 (red).

Fear not if you don’t get this picture immediately. I’ll try to explain:

  • The y-axis is the relative pace at 10 km. Values above 0 means that the starting pace is slower than the average speed for the whole race. Values below 0 means that the starting pace has been higher than the average pace for the whole race.
  • The x-axis is the finish time for 2010 divided by the finish time for 2009. If it is less than 1 it means that the time in 2010 was better than the time in 2009.

What does the graph tell us? As you can see the green runners (the once that improved their times in 2010 compared to 2009) start the race in almost the same pace as they will keep for the rest of the race. In other words: the starting pace is almost the same as the average pace for the whole 42 km.

The red runners (the ones that did worse in 2010 than 2009), on the other hand, don’t manage to keep up the speed. Their first 10 km are considerably faster than the average speed for the rest of the race.

How big is the difference? An average green runner (that is heading for a time of 4.00) does the first 10 km 20 seconds faster than his average speed for the whole race. The blue runner 36 seconds faster. The red runner 1.24 minutes faster.

The Conclusion

What is the optimal marathon strategy then? It should come as no surprise to anyone that has ever run 42 km that it is only after 30 km that the race starts. The difference between the runners that managed to improve their times between 2009 and 2010 and the ones that didn’t, is that the latter  group was not able to keep a steady pace throughout the race.

Myself I started in an optimistic pace of 4.15-4-25 min/km and ended up around 4.45-4.55 min/km. Not a perfect way to set up a race according to this analysis, but good enough to perform a PB.

A Sidenote

So, we now know that a steady pace is what you should aim for when you run a marathon. Starting too fast is a risky strategy. As we could see in the previous graph many of the fastest starters in 2010 ended up performing poorly. But what about the fast starters in 2009? Did they learn from their mistakes?

Actually, yes. In this next graph I have singled out the 10 percent of the runners started fastest in 2009 and compared their results in 2009 and 2010.


The setup is the same as in the previous graph. Y-axis is the relative starting pace. X-axis the finish time compared to the other year (for 2010: 2010/2009, for 2009: 2009/2010). The blue dots represent 2009, the green 2010.

Two conclusions can be made:

  • The majority of the runners started slower in 2010 (or they were able to keep the pace better).
  • Most of the runners improved their times (apparently they had learnt something).

Want to explore the data yourself?


Finland on Facebook – according to candidates

With whom should Finland be friends on Facebook? Helsingin Sanomat asked this question of all the candidates in the parliamentary elections. I screen scraped that the 1747 answers to see what they thought. If the candidates would get to choose, Finland’s Facebook profile would look something like this:

Russia
1057 friends in common
Sweden
885 friends in common
Estonia
595 friends in common
Norway
591 friends in common
Germany
373 friends in common
USA
219 friends in common
Denmark
145 friends in common
China
119 friends in common
Cuba
71 friends in common
India
61 friends in common

So Russia is apparently our best friend. Or at least that is what we want to make them believe. Cuba ends up surprisingly high, but that is much beacuse of the Communist Party that is still keeping it real.

You’ll find the data on Google Docs if you want to examine it yourself.


Vaalikone of Yle scraped and ready to download

The public broadcaster Yle published its voting advice application (vaalikone) last week (in Finnish and Swedish only, which is quite a shame for a public broadcaster – should we not encourage new citizens to take part in politics?). I took the chance to practice some screen scraping skills. You’ll find the result here:

1585 candidates answered the 35 questions, which means you got a pretty interesting set of data. A first analysis and visualization on one of the questions is coming up shortly.

A few remarks:

  • Questions 31-33 have been left out, because they were different in every district and therefore not comparable.
  • Question 34 is multiple choice and therefore listed in several columns.
  • Questions and answers are listed in the second sheet of the spreadsheet in Google Docs.

Enjoy!

 

Edit: The dataset has been updated with a new scrape from 24.3.2011.


Vaalikone visualizaition: Why the big parties lose voters

Elections are coming up here in Finland and the first voting advice applications (vaalikoneet) are just being opened. This is a bit like Christmas time if you are interested in data. Hundreds of candidates give their views on political issues and on the same time creating awsome data materials.

Unfortunately media houses have not learned to see the possiblities with open data. At least I have never seen anyone share the raw data from these voting advice applications publicly. But with some web scarping skills the information could be yours anyway.

Kepa, the Service Centre for Development Cooperation in Finland, has a nice little voting advice application focusing on foreign policy – migration, foreign aid, Nato, climate change, peace-keeping and so on. Scraping the site wasn’t too difficult. My main issue was the umaluts (åäö) in the urls, but after a few hours of discussion board and tutorial readings I figured out.

The scrape resulted in a dataset with the answers of 1045 candidates on 19 questions. I then grouped some of the questions that related to eachother into four thematic indexes: migration, military interventionism, climate change and free trade. As migration is one of the hottest issues now in the elections I chose to look more closely at this on.

I spent about half a day trying to put together a visualization in Google’s new Public Data Explorer. I think this service could potentially become a really powerful tool. However, at present it is very difficult to upload and use own datasets. You have to define the visualization manually in XML and despite a pretty straight forward tutorial, I couldn’t get my data uploaded.

Instead I turned to Many Eyes, one of the best tools around for online data visualization. Many Eyes rendered this very interesting graph (click to open and explore):

Finnish parties on migration

How to read this graph: The x-axis is an index based on question one and two in the poll (should Finland allow more immigrants and refugees?). On the right side you have liberal, pro-immigration candidates on the left side conservative one.

For the past couple of months everyone today has been talking about the progress of the True Finns, Perussuomalaiset, a right-wing populist party that is said to be gaining disappointed voters from the tradtional parties. People basically feel that the big parties all say the same thing. The True Finns provide an alternative. This data shows this is more or less true. The candidates of all the three biggest parties – Coalition Party (Kokoomus), Center party (Keskusta) party and Social democrats – more or less share opinions (or lack of opinions) on migrations. There is hardly even a difference between Kokoomus and SDP (select each of the parties in the leftside menu to explore the difference)!

I don’t have time to look any deeper into this dataset now, but I will later. There are plenty of things to explore here. Do we see the same lack of differences in other questions? Is there a difference between old and young candidates? Do the different regions differ?

I’ll leave you with the link to the complete spreadsheet on Google Docs if you want to use the data yourself. You can also use my data on Many Eyes to build your own visualization.


Twitter proofs: winter is good

I’ve been getting a lot of complaints on my Facebook feed lately about the weather just because of a few cold days. For some reason people expect February to be a spring month nowadays. I’d say global warming isn’t quite there yet.

I wanted to find out what the (Swedish) Twitter community has had to say about the weather last couple of days. For me (a person who enjoys a good cold and snowy winter) the results were positive:

“Wonderful”, “beautiful” and “fantastic” are some of the words Twitter users have used to describe the weather last couple of days. I think they have been right.

How did I do this? It was quite simple actually.

  • I started off with a simple scrape of a search for “väder” on search.twitter.com (only Swedish and excluding posts including “darth” – the search engine does not see the difference between “väder” and “vader”). Again, Scraper Wiki provided a good base for the script.
  • I then pasted all the scraped tweets in Wordle, removed a bounch of irrelevant small words and some verbs.

Easy as that.


Mapping Ratata: Who’s Hot?

I wanted to play around in Gephi a bit more after my previous post about visualizing my social network on Facebook. So for my second project I turned my eyes to Ratata, a Swedish blog community in Finland with just over 1200 bloggers. A friend of mine, Poppe (also on Ratata), has been talking about analyzing the Swedish blogosphere. I hope he doesn’t mind me “borrowing” the idea.

I have almost no prior programming experience, but for some time now I have been trying to learn more about screen scraping. Guided mostly by the Dan Nguyen’s brilliant tutorial on coding for journalist I have started to know my way around Ruby. Scraper wiki also provides good guidance for those of us who still mostly do copy-paste programming.

After two days of trial and error I managed to put together a script that extracts all the links to fellow Ratata blogs from all the 1207 blogs. That gave me a data set of almost 2000 connections (due to some technical issues I had to exclude a couple of blogs). I obviously wanted to find out who is most popular. That is, who gets the most in-links? This is the result (click for full scale pdf):

The size depends on the number of in-links. Karin, one of the founders of the blog community, is maybe not to surprisingly number one with 70 other Ratata bloggers linking to her, followed by Mysfabon (43) and Kisimyran (37).
You’ll also notice that the gap between the haves and the have-nots is big when comes to links. The core of the map is surrounded by a cloud of unconnected blogs (and shattered dreams of blogger fame perhaps?).

I’ve uploaded the Gephi file if you want to take a closer look at the dataset yourself.

Here is the complete top ten:

Blog Links
karin.ratata.fi 70
mysfabo.ratata.fi 43
kisimyran.ratata.fi 37
soxxy.ratata.fi 33
smulansliv.ratata.fi 33
smilla.ratata.fi 32
klaffen.ratata.fi 31
svensken.ratata.fi 30
skinka.ratata.fi 28
madison.ratata.fi 27

Follow

Get every new post delivered to your Inbox.

Join 27 other followers