Combining D3 and Raphael to make a network graph

During the past week I have been working on a visualization for Sveriges Radio about Melodifestivalen, the Swedish qualification for the Eurovision Song Contest.

Every year there is a HUGE fuzz about this show over here in Sweden. I wanted to explore the songwriters in the competition from a dataist perspective. Who are the guys behind the scene?

If you follow Melodifestivalen a few years you will notice how many names occur year after year. By linking every songwriter to the years when they contributed I came up with this network graph.

In making this graph I managed to draw several quite interesting conclusions, for example that there are by far more men than women among the songwriters. And that there is a small elite of songwriters that does particularly well in the competition almost every year.

But this is not what I wanted to blog about today, but rather about the making of this visualization.


I have really come to like the Raphael.js library, but unfortunately it does not provide the same robust support for advanced data visualizations (for example network graphs) as its big brother D3.js. D3 on the other hand lacks Raphael’s broad browser compability, which is important when you are working with a public broadcaster like Sveriges Radio. So what if you could combine the two?

D3 has a really powerful library for making network graphs, or force-directed layouts. I used this library to make the foundation for the graph (take a look at the draft here). I won’t go into details about the code. The bulk is borrowed from this Stack Overflow thread.

The problem with force-directed layouts in D3 is that they quickly tend to become very burdensome for the browser. The user will have to wait for the graph to equilibrate. And that can take some time if you have 100+ nodes. But since I in this case only needed  a static layout I might as well have the computer do all those calculations in advance.

This is the idea: Raphael doesn’t have a built-in way to draw force-directed layouts, instead I take the svg-output from D3 and continue building my visualization (interactivity etc.) on top of that in Raphael. In brief, this is how I went about:

  • I started by copying the svg code from in Firebug (inspect the element and click Copy SVG) and pasted it into an empty document and saved it as an xml-file.
  • Iterated the nodes (circles) in the file and extracted the coordinates (cx,cy). I did this in Ruby using the Hpricot gem.
  • Saved the coordinates and the radius as Javascript objects: id:{ cx: 12.34, cy: 43.21, r: 5}
  • Here is the simple piece of code:
    doc = Hpricot(open("mf-graph.svg"))"//circle").each do |node|
       x = (node.attributes["cx"].to_f*100).round.to_f / 100 # I round the nodes to two decimals to reduce the size of the file.
       y = (node.attributes["cy"].to_f*100).round.to_f / 100
       r = (node.attributes["r"].to_f*100).round.to_f / 100
       id = node.attributes["id"]
       puts "#{id}: {x: #{x}, y: #{y}, r: #{r} },"

With the coordinates of the nodes in hand it was easy to rebuild the graph in Raphael. This way I managed to vastly reduce the loading time and make it more cross-browser friendly. Here is the result once again:


Tutorial: How to extract street coordinates from Open Street Map geodata

I’ve spent almost a year learning about data-driven journalism and tools for analysis and visualization of data. I have now become confident enough to think that I might even be able to teach someone else something. So here it goes: my first tutorial.

The task

Earlier this fall Helsingin Sanomat published a huge dump of price data from Oikotie, a Finnish market place for apartments. I had an idea to build a kind of heat map where every street would be colored based on the average price of the apartments.

With the JavaScript library Polymaps you can easily make stylish web maps. The problem is that you need an overlay GeoJSON layer with the colored streets. Finnish authorities do not – yet! – provide open street-level geodata. Fortunately Open Street Map does.

From .shp to .geojson

The raw data from Open Street Map is downloadable in shape-format. So in my case I download the shapefile package of Finland and opened it in Quantum GIS (Layer > Add vector layer). This is what the finland_highway.shp file looks like.

This is A LOT of geodata, but in this case I’m only interested in the Helsinnki region. So I zoom in Helsinki an and select, roughly, the streets that I’m interested in using the lasso tool (select object tool ).

To export the selected part of the map to the GeoJSON format that Polymaps can read, chose Layer > Save Selection as vector file and GeoJSON as your format. Save! Done!

Filtering the GeoJSON file

We got a our GeoJSON-file. Now there is just one problem: it is huge, 18 MB! But there are a lot of streets here that we don’t need. We want to filter these streets. This will require some programming skills. I turn to Ruby.

This is the structure of an object in the GeoJSON file:

{ "type": "Feature", "properties": { "TYPE": "cycleway", "NAME": "", "ONEWAY": "", "LANES": 0.000000 }, "geometry": { "type": "LineString", "coordinates": [ [ 24.773350, 60.203288 ], [ 24.774540, 60.203008 ], [ 24.777840, 60.202300 ], [ 24.781013, 60.201565 ], [ 24.781098, 60.201546 ], [ 24.782735, 60.201199 ], [ 24.784300, 60.201045 ], [ 24.785846, 60.201085 ], [ 24.787381, 60.201133 ], [ 24.787812, 60.201169 ], [ 24.788101, 60.201207 ], [ 24.797454, 60.201623 ], [ 24.797636, 60.201620 ], [ 24.799625, 60.201405 ], [ 24.801848, 60.201089 ] ] } }

This street does apparently not have a name, but the others do, which means I can extract that streets that I’m interested in based on their name.

In another array I list the streets that I want to be included in the visualization. Like this:

streets = [
# and so on...

I now want to tell the computer to iterate through the GeoJSON file and extract the streets that are included in the streets array. Or in practice I approach it the other way around: I check what streets in the GeoJSON file that are not included in the array and remove them.

This is is the code:

def process(data)
json = JSON.parse(data)

#-- STEP 1. Go through the geojson file and add the index numbers ("i") of the street names that are not found in the array "streets" to a new array ("del")
i = 0
del = []

json["features"].each do |a|

unless $streets.include? a["properties"]["NAME"]


i += 1


#-- STEP 2: Iterate through the del array from the back and remove the streets with the corresponding index numbers in the geojson data ---
del.reverse.each do |d|



#-- Open a new json file and save the filtered geojson ---"hki.json", 'a'){ |f| f.write(JSON.generate(json))}

In this case data is the GeoJSON file and $streets the array of the selected streets. And voilà: you got yourself a new GeoJSON file. In my case I managed to shrink it down to 1.6 MB.

The visualization

I now got what I wanted in the beginning: the geographical coordinates for the streets that I want to plot, which means I’m halfway to making my visualization.

I won’t go in to details on how the actual visualization was put together. The short version is that I used this pavement quality example as base script and made some small modifications. The price data is then picked from a separate file. This is the result, the housing prices in Helsinki, street by street:

Open the full map in new window.

Not too shabby, right? I managed to sell this visualization to Hufvudstadsbladet which now runs it on their website.



One month Wall Street occupation mapped

For a month now we have been getting news about the Occupy movement that started on Wall Street in the beginning of October. There has been some arguing about the size of this movement. Guardian has made and interesting attempt to answer the question using crowdsourcing. I took a different approach.

The protest are coordinated at the site Here you find a complete list of the 2 506 occupy communities. I wrote a Ruby scraper that goes through this list and gathers information about all the meetups that has been arranged so far (more than 4 000 in a month).

I used the D3.js library to visualize the the list of meetups. This is the result (opens in new window):

The movement clearly peaked on Octboer 15th with meetups in around 600 different locations around the world. Protestors have continued to rally on Saturdays, but not with the same intensity.

Note that there is a number of protests that are missing here. I had some technical difficulties geocoding special characters (using the Yahoo Place Finder API), but that should not distort the picture of how the movement has developed. I didn’t have time to resolve the problem at the moment, but if someone knows how to get the API to understand odd characters such as ä, é and ü I’d appreciate the assistance.

Finding the optimal marathon strategy in data

After half a year of training I ran my third marathon this weekend in Stockholm. It was a great race and managed to take three minutes of my own PB and finish on 3.10.

However, the race got me thinking: what is the optimal strategy for a marathon race? That is, how fast should I run in the beginning? I turned to data to look for the answer.

The Data

Stockholm Marathon gathers up to 20 000 runners every year. For each runner split times are recorded every 5 km. So there is plenty of data to work with here.

To get the data I wrote a Ruby script that scans the results of 2010 and records the results of all the (male) runners that participated in both 2010 and 2009. The scrape gave me a dataset of 3965 runners. For each runner I stored the variables age, finish time and time at 10 km.

To make the to races comparable I multiplied the times of 2009 by 0,967, as it turned out that the average time of 2010 was only 4.04.44, compared to 4.12.58 in 2009. Presumably the conditions were better in 2010.

The Analysis

We assume that all runners are equally good in 2009 and 2010. This means that a good run can be defined as run where the time is better than other year. If runner A does 4.00 in 2009 and 3.45 in 2010 he has made a better performance in 2010. But the question is: did the starting pace affect the result?

Lets start by splitting the runners in to three groups. Firstly the ones that improved their times by at least 10 percent in 2010 (green color), secondly the ones that did close to the same result in 2009 and 2010 (blue) and thirdly the ones that did at least 10 percent worse in 2010 (red).

Fear not if you don’t get this picture immediately. I’ll try to explain:

  • The y-axis is the relative pace at 10 km. Values above 0 means that the starting pace is slower than the average speed for the whole race. Values below 0 means that the starting pace has been higher than the average pace for the whole race.
  • The x-axis is the finish time for 2010 divided by the finish time for 2009. If it is less than 1 it means that the time in 2010 was better than the time in 2009.

What does the graph tell us? As you can see the green runners (the once that improved their times in 2010 compared to 2009) start the race in almost the same pace as they will keep for the rest of the race. In other words: the starting pace is almost the same as the average pace for the whole 42 km.

The red runners (the ones that did worse in 2010 than 2009), on the other hand, don’t manage to keep up the speed. Their first 10 km are considerably faster than the average speed for the rest of the race.

How big is the difference? An average green runner (that is heading for a time of 4.00) does the first 10 km 20 seconds faster than his average speed for the whole race. The blue runner 36 seconds faster. The red runner 1.24 minutes faster.

The Conclusion

What is the optimal marathon strategy then? It should come as no surprise to anyone that has ever run 42 km that it is only after 30 km that the race starts. The difference between the runners that managed to improve their times between 2009 and 2010 and the ones that didn’t, is that the latter  group was not able to keep a steady pace throughout the race.

Myself I started in an optimistic pace of 4.15-4-25 min/km and ended up around 4.45-4.55 min/km. Not a perfect way to set up a race according to this analysis, but good enough to perform a PB.

A Sidenote

So, we now know that a steady pace is what you should aim for when you run a marathon. Starting too fast is a risky strategy. As we could see in the previous graph many of the fastest starters in 2010 ended up performing poorly. But what about the fast starters in 2009? Did they learn from their mistakes?

Actually, yes. In this next graph I have singled out the 10 percent of the runners started fastest in 2009 and compared their results in 2009 and 2010.

The setup is the same as in the previous graph. Y-axis is the relative starting pace. X-axis the finish time compared to the other year (for 2010: 2010/2009, for 2009: 2009/2010). The blue dots represent 2009, the green 2010.

Two conclusions can be made:

  • The majority of the runners started slower in 2010 (or they were able to keep the pace better).
  • Most of the runners improved their times (apparently they had learnt something).

Want to explore the data yourself?

Twitter proofs: winter is good

I’ve been getting a lot of complaints on my Facebook feed lately about the weather just because of a few cold days. For some reason people expect February to be a spring month nowadays. I’d say global warming isn’t quite there yet.

I wanted to find out what the (Swedish) Twitter community has had to say about the weather last couple of days. For me (a person who enjoys a good cold and snowy winter) the results were positive:

“Wonderful”, “beautiful” and “fantastic” are some of the words Twitter users have used to describe the weather last couple of days. I think they have been right.

How did I do this? It was quite simple actually.

  • I started off with a simple scrape of a search for “väder” on (only Swedish and excluding posts including “darth” – the search engine does not see the difference between “väder” and “vader”). Again, Scraper Wiki provided a good base for the script.
  • I then pasted all the scraped tweets in Wordle, removed a bounch of irrelevant small words and some verbs.

Easy as that.

Mapping Ratata: Who’s Hot?

I wanted to play around in Gephi a bit more after my previous post about visualizing my social network on Facebook. So for my second project I turned my eyes to Ratata, a Swedish blog community in Finland with just over 1200 bloggers. A friend of mine, Poppe (also on Ratata), has been talking about analyzing the Swedish blogosphere. I hope he doesn’t mind me “borrowing” the idea.

I have almost no prior programming experience, but for some time now I have been trying to learn more about screen scraping. Guided mostly by the Dan Nguyen’s brilliant tutorial on coding for journalist I have started to know my way around Ruby. Scraper wiki also provides good guidance for those of us who still mostly do copy-paste programming.

After two days of trial and error I managed to put together a script that extracts all the links to fellow Ratata blogs from all the 1207 blogs. That gave me a data set of almost 2000 connections (due to some technical issues I had to exclude a couple of blogs). I obviously wanted to find out who is most popular. That is, who gets the most in-links? This is the result (click for full scale pdf):

The size depends on the number of in-links. Karin, one of the founders of the blog community, is maybe not to surprisingly number one with 70 other Ratata bloggers linking to her, followed by Mysfabon (43) and Kisimyran (37).
You’ll also notice that the gap between the haves and the have-nots is big when comes to links. The core of the map is surrounded by a cloud of unconnected blogs (and shattered dreams of blogger fame perhaps?).

I’ve uploaded the Gephi file if you want to take a closer look at the dataset yourself.

Here is the complete top ten:

Blog Links 70 43 37 33 33 32 31 30 28 27