Tutorial: Using Google Refine to clean mortgage data

The Swedish newspaper Svenska Dagbladet got a lot of attention last week for its succesful call to homeowners to share their interest rate of their mortgage loans. About 20 000 people from around the country has answered the call giving the paper – and us! (thanks for sharing the data) – some great insight into how the housing market works.

For a data nerd the dataset is also interesting from another perspective. It offers an excellent playground for aspiring datajournos that want to discover the power of Google Refine.

The data

SvD uses a Google Docs form to collect the data. The data is gathered in a public spreadsheet. As you will quickly notice the data is pretty messy. The size of the loan is written in different forms, bank names are spelled differently and so on.

Some good cleaning needs to be done. Enter Google Refine.

Open Refine, click Create Project and chose to import data from Google Docs. The url of the file can be found if you inspect the html code of this page and locate the iframe containing the spreadsheet:

https://docs.google.com/a/schibsted.se/spreadsheet/pub?key=0AksbqKrp3WArdEwwYktyel9nNWhBeDBnalpfaGhHZ1E&output=html

The import might take a couple of minutes as we are working with almost 20.000 rows of data (maybe even more by now).

Step 1: Correct misspelled bank names

Lets start with the names of the banks. If you browse the list you will soon see how the same bank could be spelled in a lot of different ways. Take Danske Bank for example:

  • Dansk Bank: 1
  • Danska Bank: 2
  • Danska bank: 1
  • danske: 8
  • Danske: 23
  • DANSKE BANK: 3
  • danske bank: 95
  • Danske bank: 142
  • Danske BAnk: 1
  • Danske Bank: 439
  • Danske bank: 1
  • Danske Bank (ÖEB): 1
  • Danske Banke: 1

We want to merge all these spellings into one single category. Doing this is simple in Google Refine. Click the arrow in the header of the column Bank? and choose Facet > Text facet.

You’ll see a list of 257 banks. Click Cluster to start merging. I’m not going to be able to give you a detailed account of what the different methods technically do, but if you play around with the settings you should be able to reduce the number of banks to about 100. Go through the list manually to do additional merges by changing the names yourself (move the mouse over the bank you want to rename and click edit).

Step 2: Turn strings into values

The interest rates (Ränta) has typically been entered with a comma as the decimal separator. Therefore Refine thinks it’s a string and not a number. Fixing this is pretty easy though. Click Ränta? > Edit Cells > Transform.

We are going to write a simple replace() function to do this:

replace(value, “,” , “.” )

That is: take the value in the cell and replace the comma with a dot. And to make it a number we put it all inside a toNumber() function:

toNumber(replace(value,”,”,”.”))

However, not everyone has entered the values in this format. If you click Ränta? > Facet > Numeric Facet you’ll see that there are almost 1500 non-numeric values. Most of these cells contain dots and have therefore wrongly been formated as dates.

Fortunately we can rescue these entries. Select the cells by clicking the non-numeric checkbox in the left-side panel. Then click Ränta? > Edit cells > Transform. First we are going to use the split() function separate the values:

split(value,”.”)

I value like 03.55.00 is transformed into an array: [ “03”, “55”, “00” ]. Secondly we want to take the first and the second value in this array and make a value out of them:

split(value,”.”)[0]+”.”+split(value,”.”)[1]

But the cell is still a string, so we need to use the toNumber() function once again to make it a number that we can use in calculations:

toNumber(split(value,”.”)[0]+”.”+split(value,”.”)[1])

And bam…

…we are down to 72 non-numeric values. And that will be good enough for now.

Step 3: Unify “1,9 mkr”, “3000000” and “2,270 mkr”

The real mess in this dataset is the price column (Hur stort är ditt lån?). Some state the size of the loan in millions, others thousands. Some use “kr”, others don’t. We want to try to unify these values.

Our first objective is to remove all letters. But before we do that we will change all the “t” to “000” to get rid of that category. You should know how by now:

replace(value, “t”, “000 “)

Now let’s get rid of the letters. Go to the transform window (Hur stort är ditt lån? > Edit cells > Transform). Once again we are going to let the replace() function does the work, but this time using a regular expression:

replace(value,/[A-Ö a-ö]/,””)

That is: take any character from A-Ö and repalce it with nothing. The following step should be a piece of cake by now. Replace the commas with dots and make numbers out of the strings:

toNumber(replace(value, “,”, “.”))

Lets examine what the dataset looks like by now. Click Hur stort är ditt lån? > Facet > Numeric faset. There are still a few hundred non-numeric values, some of which could be turned into numbers in the same way as we did with the interest rate, but lets leave those for now and instead focus on the 13.500 values that we got.

These values come in two formats: 1.1 (million) and 1100000. We want to use the latter in order to have the size of the loan in actual kronor. And we do this with an if() statement and some simple algebra:

  • If the value is less than 100, multiply by 1.000.000 (assuming that no one took a loan of more than 100 million kronor).
  • Else, do nothing.

Which translates into this in GREL:

if(value<100,value*1000000,value)

Step 4: Export and conclude

Export data as csv or any other format and open it in Excel or Google Docs. You’ll find my output here, uploaded on Google Docs.

The dataset is still not perfect. You will see a couple of odd outliers. But it is in much better conditions than when we started and we are able to draw some pretty solid conclusions. A pivot table report is a good way to sum up the results:

Bank Interest rate Number of replies
SEB 3.44 2701
Danske Bank 3.67 582
Sparbanken Nord 3.82 10
Skandiabanken 3.82 359
Nordea 3.87 1764
Handelsbanken 3.88 2884
Bättre Bolån 3.91 15
Swedbank 3.91 3132
ICA 3.93 10
Landshypotek 3.95 24
Icabanken 3.96 24
Sparbanken 1826 3.98 13
Länsförsäkringar 4 803
SBAB 4.03 673
Ålandsbanken 4.05 10
Sparbanken Öresund 4.07 113

SEB seems to be your bank of choice if your just looking for a good interest rate, in other words. But this is just the beginning. With the clean dataset in hand we can start to ask questions like: how does the size of the mortgage and the interest rate correlate? What is the relationship between interest rate and area of residence? And so on.

Good luck.

 


Combining D3 and Raphael to make a network graph

During the past week I have been working on a visualization for Sveriges Radio about Melodifestivalen, the Swedish qualification for the Eurovision Song Contest.

Every year there is a HUGE fuzz about this show over here in Sweden. I wanted to explore the songwriters in the competition from a dataist perspective. Who are the guys behind the scene?

If you follow Melodifestivalen a few years you will notice how many names occur year after year. By linking every songwriter to the years when they contributed I came up with this network graph.

In making this graph I managed to draw several quite interesting conclusions, for example that there are by far more men than women among the songwriters. And that there is a small elite of songwriters that does particularly well in the competition almost every year.

But this is not what I wanted to blog about today, but rather about the making of this visualization.

D3+Raphael=true

I have really come to like the Raphael.js library, but unfortunately it does not provide the same robust support for advanced data visualizations (for example network graphs) as its big brother D3.js. D3 on the other hand lacks Raphael’s broad browser compability, which is important when you are working with a public broadcaster like Sveriges Radio. So what if you could combine the two?

D3 has a really powerful library for making network graphs, or force-directed layouts. I used this library to make the foundation for the graph (take a look at the draft here). I won’t go into details about the code. The bulk is borrowed from this Stack Overflow thread.

The problem with force-directed layouts in D3 is that they quickly tend to become very burdensome for the browser. The user will have to wait for the graph to equilibrate. And that can take some time if you have 100+ nodes. But since I in this case only needed  a static layout I might as well have the computer do all those calculations in advance.

This is the idea: Raphael doesn’t have a built-in way to draw force-directed layouts, instead I take the svg-output from D3 and continue building my visualization (interactivity etc.) on top of that in Raphael. In brief, this is how I went about:

  • I started by copying the svg code from in Firebug (inspect the element and click Copy SVG) and pasted it into an empty document and saved it as an xml-file.
  • Iterated the nodes (circles) in the file and extracted the coordinates (cx,cy). I did this in Ruby using the Hpricot gem.
  • Saved the coordinates and the radius as Javascript objects: id:{ cx: 12.34, cy: 43.21, r: 5}
  • Here is the simple piece of code:
    doc = Hpricot(open("mf-graph.svg"))
     doc.search("//circle").each do |node|
       x = (node.attributes["cx"].to_f*100).round.to_f / 100 # I round the nodes to two decimals to reduce the size of the file.
       y = (node.attributes["cy"].to_f*100).round.to_f / 100
       r = (node.attributes["r"].to_f*100).round.to_f / 100
       id = node.attributes["id"]
       puts "#{id}: {x: #{x}, y: #{y}, r: #{r} },"
     end

With the coordinates of the nodes in hand it was easy to rebuild the graph in Raphael. This way I managed to vastly reduce the loading time and make it more cross-browser friendly. Here is the result once again:


Interactive: Haavisto’s great challenge

The first round of the presidential election in Finland was held on Sunday. It was a super-exciting race with Sauli Niinistö coming out on top, but with Pekka Haavisto of the Green Party as the great surprise. Haavisto finished second, just before the grand ol’ man of the Center Party, Paavo Väyrynen.

Haavisto is having great momentum and is quickly rising in the polls. But can he really take on Sauli Niinistö who has been the favorite for years already?

With this interactive visualization I will show that Pekka Haavisto face a great challenge in the second round. In the first round he got 570.000 votes. Niinistö got twice as much. 1,4 million voters will have to find a new candidate in the second round. Haavisto will have to get about 70 percent of those votes, which won’t be easy considering he is the liberal alternative of the two finalists and a lot of the undecided voters are conservatives.

Anyway here is the visualization. It lets you drag and drop the votes of the candidates that didn’t make it to the second round. Hopefully it gives you an idea of the effort that Haavisto will have to go through to stand a chance. But who knows? He has surprised us once already.

Open interactive visualization in new window.


Pimp your Google Map

Today I’m having my first visualization published at Helsingin Sanomat. A map of all the discharges in Helsinki during one night (21-22.12) showing reported crimes and accidents. Check it out:

Go to HS.fi.

It’s a pretty basic visualizations. I got a bunch of addresses that I geocoded using the Yahoos place finder API and projected on Google Maps. I’m a huge fan of a lot of the Google tools out there. Most notably the Docs platform, which step by step is phasing out my decency on MS Office (at least MS Word).

However, I have been quite sceptical towards Google Maps. Mostly because of its – in lack of a better word – ugliness. We’ve seen enough of that pale blue-green-yellow layout.

The good news is that Google has made it possible to easily style your maps using a simple online interface. In this example I’ve just inverted the lightness, added a pink hue and reduced the saturation of the water.

The options are endless. You can easily spend hours playing around with the different settings.

Not too many developers and designers seem to have found this tool yet, but my prediction for 2012 is that there will be a lot more styled Google Maps. And why not a portal with open map skins for anyone to use? I can’t find anything like that at the moment.


Interactive: The 100 richest people in Finland

November is the big gossip fest in Finland. Every year in the beginning of the month the tax records from last year are published. In other words: you get to know who made the most money.

Every year the Finnish media outlets do a very conventional presentation of this material. Page after page of lists of top-earners. Rarely does anyone do anything more creative with the data.

I gave it a shot. This is what came out:

Open the interactive visualization in new window.

How?

This is my first visualization in Raphael.js. Previously I have been working with D3 and Protovis, but the weak browser support of these two libraries is becoming a growing concern. Especially when one tries to do sell the work. However, I have found Raphael to be very useful and somehow more intuitive than D3.

The idea for this presentation came from the super-visualization, The Sexperience, by British Channel 4, a survey about the sex life of ordinary Brits (don’t worry, you can open it at work as well). I think the geniality behind this setup is that you can follow the respondents in the quiz from question to question, which gives the user the possibility to explore the relation between different questions instead of just looking at one question at a time. What are for example the sexual preferences of the people who lost their virginity late?

To some extent my presentation of the 100 top earners let you do the same thing. You can select the persons you are interested in and follow them through the presentation. This is a potential of the modern web that I think we will see much more of in the future.


One month Wall Street occupation mapped

For a month now we have been getting news about the Occupy movement that started on Wall Street in the beginning of October. There has been some arguing about the size of this movement. Guardian has made and interesting attempt to answer the question using crowdsourcing. I took a different approach.

The protest are coordinated at the site meetup.com. Here you find a complete list of the 2 506 occupy communities. I wrote a Ruby scraper that goes through this list and gathers information about all the meetups that has been arranged so far (more than 4 000 in a month).

I used the D3.js library to visualize the the list of meetups. This is the result (opens in new window):

The movement clearly peaked on Octboer 15th with meetups in around 600 different locations around the world. Protestors have continued to rally on Saturdays, but not with the same intensity.

Note that there is a number of protests that are missing here. I had some technical difficulties geocoding special characters (using the Yahoo Place Finder API), but that should not distort the picture of how the movement has developed. I didn’t have time to resolve the problem at the moment, but if someone knows how to get the API to understand odd characters such as ä, é and ü I’d appreciate the assistance.


Animation: World terrorism 2004-2011

After the terror attacks of nine-eleven the USA set out to fight terrorism. It has been a succesful quest in the sense that the Americans themselves have not been hit by terrorist since – but others have. According to statistics from the American Worldwide Incident Tracking System 37,798 lethal attacks have been carried out since 2004 killing 174,547. That’s a lot of nine-elevens.

Since the WITS provides such easily accessible data it would be a shame not to do something with it. So I did and this is what I ended up with (click to open in new window):

A few words about how I did this visualization.

The data

The basic data was really easy to gather here. I just filtered the attacks with ten or more casualties and downloaded the spreadsheet from WITS. The challenge was to geocode the places. I hadn’t done this before.

I wrote a Ruby script that called the Yahoo Place Finder API to transform the place names to longitudes and latitudes. For some reason a few locations got completely wrong coordinates (I started to wonder when the USA was suddenly hit by major attacks that I had never heard of). These were filtered away.

The visualization

This job provided two new challenges. One, working with dates. Two, working with maps. Just as the last time I used the JavaScript library d3.js to put the visualization together.

For the map I used the provided Albers example as a base script. With some assistance from this thread on Google groups I managed to figure out how to make a map in d3 (my heureka moment was when I realized that you can modify d3.geo.js to center the world map wherever you want).

Getting a hold of the dates in JavaScript became much easier with the date.js library. Highly recommended.

Final thoughts

A lot could have been done to polish the animation. One could have added some sort of timeline with key events, graphs and so on. But I think this is a pretty neat base for visualizing, lets says, earthquakes of other catastrophes. And you gotta like a viz on black.