Get started with screenscraping using Google Chrome’s Scraper extension

Posted: October 12, 2012 | Author: Jens Finnäs | Filed under: Tutorials | Tags: google chrome, parliament, politics, scraper, screen scraping, tutorial |45 Comments

How do you get information from a website to a Excel spreadsheet? The answer is screenscraping. There are a number of softwares and plattforms (such as OutWit Hub, Google Docs and Scraper Wiki) that helps you do this, but none of them are – in my opinion – as easy to use as the Google Chrome extension Scraper, which has become one of my absolutely favourite data tools.

What is a screenscraper?

I like to think of a screenscraper as a small robot that reads websites and extracts pieces of information. When you are able to unleash a scraper on hundreads, thousands or even more pages it can be an incredibly powerful tool.

In its most simple form, the one that we will look at in this blog post, it gathers information from one webpage only.

Google Chrome’s Scraper

Scraper is an Google Chrome extension that can be installed for free at Chrome Web Store.

Now if you installed the extension correctly you should be able to see the option “Scrape similar” if you right-click any element on a webpage.

The Task: Scraping the contact details of all Swedish MPs

This is the site we’ll be working with, a list of all Swedish MPs, including their contact details. Start by right-clicking the name of any person and chose Scrape similar. This should open the following window.

Understanding XPaths

At w3schools you’ll find a broader introduction to XPaths.

Before we move on to the actual scrape, let me briefly introduce XPaths. XPath is a language for finding information in an XML structure, for example an HTML file. It is a way to select tags (or rather “nodes”) of interest. In this case we use XPaths to define what parts of the webpage that we want to collect.

A typical XPath might look something like this:

//div[@id="content"]/table[1]/tr

Which in plain English translates to:

// - Search the whole document...
div[@id="content"] - ...for the div tag with the id "content".
table[1] -  Select the first table.
tr - And in that table, grab all rows.

Over to Scraper then. I’m given the following suggested XPath:

//section[1]/div/div/div/dl/dt/a

The results look pretty good, but it seems we only get names starting with an A. And we would also like to collect to phone numbers and party names. So let’s go back to the webpage and look at the HTML structure.

Right-click one of the MPs and chose Inspect element. We can see that each alphabetical list is contained in a section tag with the class “grid_6 alpha omega searchresult container clist”.

And if we open the section tag we find the list of MPs in div tags.

We will do this scrape in two steps. Step one is to select the tags containing all information about the MPs with one XPath. Step two is to pick the specific pieces of data that we are interested in (name, e-mail, phone number, party) and place them in columns.

Writing our XPaths

In step one we want to try to get as deep into the HTML structure as possible without losing any of the elements we are interested in. Hover the tags in the Elements window to see what tags correspond to what elements on the page.

In our case this is the last tag that contains all the data we are looking for:

//section[@class="grid_6 alpha omega searchresult container clist"]/div/div/div/dl

Click Scrape to test run the XPath. It should give you a list that looks something like this.

Scroll down the list to make sure it has 349 rows. That is the number of MPs in the Swedish parliament. The second step is to split this data into columns. Go back to the webpage and inspect the HTML code.

I have highlighted the parts that we want to extract. Grab them with the following XPaths:

name: dt/a
party: dd[1]
region: dd[2]/span[1]
seat: dd[2]/span[2]
phone: dd[3]
e-mail: dd[4]/span/a

Insert these paths in the Columns field and click Scrape to run the scraper.

Click Export to Google Docs to get the data into a spreadsheet. Here is my output.

Now that wasn’t so difficult, was it?

Taking it to the next level

Congratulations. You’ve now learnt the basics of screenscraping. Scraper is a very useful tool, but it also has its limitations. Most importantly you are only able to scrape one page at a time. If you need to collect data from several pages Scraper is no longer very efficient.

To take your screenscraping to the next level you need to learn a bit of programming. But fear not. It is not rocket science. If you understand XPaths you’ve come a long way.

Here is how you continue from here:

Read Scraping for journalists by Paul Bradshaw. This e-books introduces scraping to non-programmers using tools such as Google Docs, OutWit Hub and ScraperWiki.
Learn Ruby. Dan Nguyen shows you how with his Bastards Book of Ruby and Coding for Journalists 101.
Or learn Python. One way to get started is the P2PU course Python for Journalists.

45 Comments on “Get started with screenscraping using Google Chrome’s Scraper extension”

Dokumentation: Slidesen, klippen, tweetsen « Fajk says:

November 26, 2012 at 23:12

[…] Blogg: Get started with screenscraping using Google Chrome’s Scraper extension […]

Reply
Continued says:

May 17, 2013 at 15:56

I think what makes Sims so universally popular is the
ability to completely manipulate your Sim. This includes the voodoo doll,
cauldron, crystal ball and new hairstyles and clothing for your magical and fortune telling needs.
If your Sims have been using the gem spawner all
this time, they should have found several different types of gem.

Reply
visit this website says:

June 5, 2013 at 08:24

‘s brother Sweet is not too happy with him for ditching his gang for all these years and bullies him into helping take back the neighborhood from their rival gang, the Ballas. The needed driver was automatically found and installed in a few seconds, and I was ready to start getting camcorder clips. I don’t
believe they have the taxi, EMT, or firefighter missions
anymore.

Reply
#Tip: Scrape web pages using this Chrome extension | Editors Blog | Journalism.co.uk says:

June 28, 2013 at 10:10

[…] Jens Finnäs has provided a tutorial on his Dataist blog. He explains how he used the extension to scrape the contact details of all Swedish […]

Reply
multiple sclerosis facts says:

June 30, 2013 at 00:02

New research suggests this ‘intestinal’ multiple sclerosis
journal could even do, before it happened to me. Adults and older children who have had
an allergic reaction. I believe if you were an only child
or your siblings were so many years older than you that they already had multiple sclerosis journal.
But what else would I want free piriton for? Often remains dormant in your
body for life after someone has been exposed to Multiple Sclerosis Journal, a viral disease also called varicella, hides in the nerve cells that line your spinal cord.

Reply
DreamHost Discount says:

July 11, 2013 at 08:17

This scaper plugin is really cool. I didn’t know that this kind of plugin really exists on the earth. Thanks for giving such a detailed instructions. I am gonna install it right now.

Reply
Anton says:

July 11, 2013 at 09:59

This plugin is not so bad, but http://convextra.com can do the same faster in 1 click..

Reply
Tory Burch バリー財布 says:

September 10, 2013 at 03:04

ブランド財布メンズ

Reply
ブーツファッション says:

September 14, 2013 at 04:50

iphone 防水スピーカーブーツファッション http://www.cnjgov.com/

Reply
Tools, Slides and Links from NICAR13 // Ricochet by Chrys Wu says:

September 16, 2013 at 07:51

[…] • Scrape screen scraper Chrome extension. Journalist Jens Finnäs wrote a tutorial for it on Dataists. • Time Flow by Martin Wattenberg & Fernanda Viegas • Stately – a symbol font […]

Reply
Craig Gooden says:

September 26, 2013 at 14:55

I was wondering if anyone can help me – I’m very new to Google data scraping (and loving it!). I need to get data from a website (1000s of pages) – the information I’m after is in the source code – XX12345 01/2013 – what I need for each URL is the ‘XX12345 01/2013. I have a list of the URLs in Excel / CSV format. Is there any automated way of going it? Someting like Google Spreadsheet function – importHTML. Any help appreciated.

Reply
- Craig Gooden says:
  
  September 26, 2013 at 14:57
  
  Sorry, post has removed my code… I need to find the text in the code…
  XX12345 01/2013
  the bit I need is XX12345 01/2013
  
  Reply
Craig Gooden says:

September 26, 2013 at 14:58

ah….

__XX12345 01/2013__

Reply
Craig Gooden says:

September 26, 2013 at 14:59

I give up!!!!

XX12345 01/2013

Reply
Craig Gooden says:

September 26, 2013 at 15:00

Basically it’s the text from within , XX12345 01/2013 , < , / , p

Reply
Craig Gooden says:

September 26, 2013 at 15:07

Ok, this site basically strips out the code I want to show you – I’m trying to get the text which is after the phrase ” dateCode ” on every webpage. I have a list of URLs in Excel/CSV. Can I use Google spreadsheet function like importHTML?

Reply
solarscourge says:

December 6, 2013 at 10:56

You should try GrabzIt’s on-line screen scraping tool, which offers more advanced features and is much more customizable: http://grabz.it/scraper

Reply
john bole says:

January 11, 2014 at 05:52

hello I was trying to scrape the business name, address, phone and URL from this list. The problem is that the scraper first needs to enter the hyperlink for each company and scrape info from second page. Can anybody help?

Below is the website I need to get the info as you can see there are 37 pages

http://ces14.mapyourshow.com/5_0/exhibitor_results.cfm?alpha=%40&type=alpha&page=1#GotoResults

thanks

Reply
Khloe says:

March 1, 2014 at 00:19

excellent points altogether, you just won a logo new reader. What could you recommend about your submit that you simply made some days ago? Any positive?

Reply
john bole says:

March 1, 2014 at 22:03

? what are you talking about ?

Reply
iwebscraping says:

April 9, 2014 at 11:01

Nice Post, Thanks for explaining screen scraper and how to use Google chrome’s scraper by step by step method.

Reply
affordable online marketing says:

April 23, 2014 at 04:44

I needed to thank you for this excellent read!!
I definitely loved every bit of it. I have you book-marked to check out new things you post…

Reply
Noelia says:

August 16, 2014 at 06:33

I’m gone to inform my little brother, that he should alo visit this webpage on regular basis to get updated from newest news.

Reply
Serge says:

September 21, 2014 at 22:31

Hi,

I don’t see the Scrape similar when I right click.

Any idea why?

I’m using chrome on win 8.

Thanks

Reply
- Serge says:
  
  September 21, 2014 at 22:41
  
  Everything is fine. I had the wrong app installed.
  
  Thanks
  
  Reply
naehmaschine says:

September 23, 2014 at 00:57

It’s truly a nice and usefuil piece of information.
I am glad that you just shared thiis useful infcormation with us.
Please keep us up to date like this. Thanks for sharing.

Reply
Non-Programmers Guide to Scraping Data | zenagiwa says:

October 11, 2014 at 17:23

[…] https://dataist.wordpress.com/2012/10/12/get-started-with-screenscraping-using-google-chromes-scraper… […]

Reply
hack moviestarplanet dansk says:

December 17, 2014 at 03:15

fantastic points altogether, you simply received a new reader.
What might you recommend in regards to your publish that
yoou simply made a few days ago? Any certain?

Reply
Leilani says:

December 24, 2014 at 04:08

Quality content is the secret to interest the visitors to pay a visit the website,
that’s what this site is providing.

Reply
Lamont Smith says:

December 31, 2014 at 12:15

My business has grown a lot just because it gets the appearance by the help of yellow pages scraper. Tremendous work.
For More Info : https://www.youtube.com/watch?v=WPKnSFIXKB0

Reply
https://www.boisecashloan.com says:

January 1, 2015 at 09:56

Undeniably consider that that you stated. Your favourite reason appeared to be at the web
the simplest thing to have in mind of. I say to you, I certainly get
annoyed at the same time as people think about worries that they plainly do
not understand about. You managed to hit
the nail upon the top and outlined out the whole thing with no need side effect , other folks can take a signal.
Will probably be back to get more. Thank you

Reply
Abrahem Aarde says:

January 6, 2015 at 07:26

Huge effort have been done on this blog. I appreciate it. It helps me a lot.
For More : http://youtu.be/5Hv1JXY2wL4

Reply
How To Gift Card Churn Like A Pro – Finding The Best Rate – Chasing The Points says:

January 22, 2015 at 12:00

[…] you need Google Chrome, then I highly encourage you read this article and download the Scraper extension for Google […]

Reply
alex says:

April 21, 2015 at 22:39

Hi! Great post and of very good quality, congratulations as it was very clear and very easy to follow. I do have a question. Once exported to excel, is there a way to make it update every “x” time or every time changes in the original website so we don’t have to check every time if the data matches?

And second of all, I do have the Xpath, but the last “container” “li” has several “li” how we could create a loop to take all the “li/@class’ X’, as they all have the same name and it only takes one. Also, to take un “container” upper isn’t an option as the last container has another “class” as well and I’m only interested in this last one.

Also to consider that the number of “li” (rows) changes with time.

Thank you in advance and great post 🙂

Reply
smoky quartz meaning says:

July 1, 2015 at 06:51

It’s hard to come by well-informed people about this topic,
however, you seem like you know what you’re talking about!

Thanks

Reply
v2 electronic cigarette review uk says:

June 7, 2016 at 07:38

6% v2 vape pens store locator and 0% nicotine.
If you experience side effects such as nausea, vomiting, dizziness, diarrhea, dizziness then it should be thought about a mixed true blessing of types.

Reply
Quick EAD changes with Notepad ++ and Scraper – Scott Louis Ziegler says:

December 23, 2016 at 23:10

[…] this tutorial from a few years ago, I used the Scraper Chrome extension to select an added name. I modified the XPath as needed (see […]

Reply
Samara says:

December 25, 2016 at 21:41

Compreende-se então que possuir um sítio na internet é extremamente vantajoso para desenvolvimento de seu negócio.

Reply
serviÃ§os de buffet em sao jose do rio preto says:

January 5, 2017 at 10:45

Constatou-se que Grupo Científico e Jocoso Assum Preto é considerado ícone do movimento junino brasiliano, reconhecido pela CONFEBRAQ (Confederação Brasileira de Quadrilhas Juninas), de que
maneira uma das melhores quadrilhas juninas do país.

Reply
Smithe718 says:

February 8, 2017 at 04:11

Where To Purchase Generic Ciprofloxacin 0.3 5ml in Albuquerque cgbfddedageakefa

Reply
xe o tô cũ says:

February 10, 2017 at 10:00

If some one wants to be updated with most up-to-date technologies afterward
he must be go to see this site and be up to date every day.

Reply
24H mỹ phẩm says:

February 20, 2017 at 04:25

Good write-up. I definitely love this site. Keep it up!

Reply
Rodney says:

July 9, 2017 at 16:25

Skype has opened up its online-dependent consumer beta to
the entire world, soon after introducing it broadly from the United states and You.K.
before this 30 days. Skype for Internet also now facilitates Linux and Chromebook for immediate
online messaging communication (no video and voice yet,
those call for a connect-in set up).

The expansion in the beta brings assist for an extended
listing of languages to aid reinforce that global usability

Reply
Nisha Puri says:

December 11, 2017 at 15:01

Thanks for sharing useful information ..

Is there is any Google map scraper tool finding prospects and leads for your business needs by collecting various Information through Intelligence on prospective customers by utilizing advance web research and generating there interest in the offerings, as per Business Specifications.

Reply
Cleveland says:

September 4, 2018 at 17:11

You can trust us for expert help with uPVC windows.

Reply