Combining D3 and Raphael to make a network graph

During the past week I have been working on a visualization for Sveriges Radio about Melodifestivalen, the Swedish qualification for the Eurovision Song Contest.

Every year there is a HUGE fuzz about this show over here in Sweden. I wanted to explore the songwriters in the competition from a dataist perspective. Who are the guys behind the scene?

If you follow Melodifestivalen a few years you will notice how many names occur year after year. By linking every songwriter to the years when they contributed I came up with this network graph.

In making this graph I managed to draw several quite interesting conclusions, for example that there are by far more men than women among the songwriters. And that there is a small elite of songwriters that does particularly well in the competition almost every year.

But this is not what I wanted to blog about today, but rather about the making of this visualization.

D3+Raphael=true

I have really come to like the Raphael.js library, but unfortunately it does not provide the same robust support for advanced data visualizations (for example network graphs) as its big brother D3.js. D3 on the other hand lacks Raphael’s broad browser compability, which is important when you are working with a public broadcaster like Sveriges Radio. So what if you could combine the two?

D3 has a really powerful library for making network graphs, or force-directed layouts. I used this library to make the foundation for the graph (take a look at the draft here). I won’t go into details about the code. The bulk is borrowed from this Stack Overflow thread.

The problem with force-directed layouts in D3 is that they quickly tend to become very burdensome for the browser. The user will have to wait for the graph to equilibrate. And that can take some time if you have 100+ nodes. But since I in this case only needed  a static layout I might as well have the computer do all those calculations in advance.

This is the idea: Raphael doesn’t have a built-in way to draw force-directed layouts, instead I take the svg-output from D3 and continue building my visualization (interactivity etc.) on top of that in Raphael. In brief, this is how I went about:

  • I started by copying the svg code from in Firebug (inspect the element and click Copy SVG) and pasted it into an empty document and saved it as an xml-file.
  • Iterated the nodes (circles) in the file and extracted the coordinates (cx,cy). I did this in Ruby using the Hpricot gem.
  • Saved the coordinates and the radius as Javascript objects: id:{ cx: 12.34, cy: 43.21, r: 5}
  • Here is the simple piece of code:
    doc = Hpricot(open("mf-graph.svg"))
     doc.search("//circle").each do |node|
       x = (node.attributes["cx"].to_f*100).round.to_f / 100 # I round the nodes to two decimals to reduce the size of the file.
       y = (node.attributes["cy"].to_f*100).round.to_f / 100
       r = (node.attributes["r"].to_f*100).round.to_f / 100
       id = node.attributes["id"]
       puts "#{id}: {x: #{x}, y: #{y}, r: #{r} },"
     end

With the coordinates of the nodes in hand it was easy to rebuild the graph in Raphael. This way I managed to vastly reduce the loading time and make it more cross-browser friendly. Here is the result once again:


Interactive: Haavisto’s great challenge

The first round of the presidential election in Finland was held on Sunday. It was a super-exciting race with Sauli Niinistö coming out on top, but with Pekka Haavisto of the Green Party as the great surprise. Haavisto finished second, just before the grand ol’ man of the Center Party, Paavo Väyrynen.

Haavisto is having great momentum and is quickly rising in the polls. But can he really take on Sauli Niinistö who has been the favorite for years already?

With this interactive visualization I will show that Pekka Haavisto face a great challenge in the second round. In the first round he got 570.000 votes. Niinistö got twice as much. 1,4 million voters will have to find a new candidate in the second round. Haavisto will have to get about 70 percent of those votes, which won’t be easy considering he is the liberal alternative of the two finalists and a lot of the undecided voters are conservatives.

Anyway here is the visualization. It lets you drag and drop the votes of the candidates that didn’t make it to the second round. Hopefully it gives you an idea of the effort that Haavisto will have to go through to stand a chance. But who knows? He has surprised us once already.

Open interactive visualization in new window.


Interactive: The 100 richest people in Finland

November is the big gossip fest in Finland. Every year in the beginning of the month the tax records from last year are published. In other words: you get to know who made the most money.

Every year the Finnish media outlets do a very conventional presentation of this material. Page after page of lists of top-earners. Rarely does anyone do anything more creative with the data.

I gave it a shot. This is what came out:

Open the interactive visualization in new window.

How?

This is my first visualization in Raphael.js. Previously I have been working with D3 and Protovis, but the weak browser support of these two libraries is becoming a growing concern. Especially when one tries to do sell the work. However, I have found Raphael to be very useful and somehow more intuitive than D3.

The idea for this presentation came from the super-visualization, The Sexperience, by British Channel 4, a survey about the sex life of ordinary Brits (don’t worry, you can open it at work as well). I think the geniality behind this setup is that you can follow the respondents in the quiz from question to question, which gives the user the possibility to explore the relation between different questions instead of just looking at one question at a time. What are for example the sexual preferences of the people who lost their virginity late?

To some extent my presentation of the 100 top earners let you do the same thing. You can select the persons you are interested in and follow them through the presentation. This is a potential of the modern web that I think we will see much more of in the future.


Tutorial: How to extract street coordinates from Open Street Map geodata

I’ve spent almost a year learning about data-driven journalism and tools for analysis and visualization of data. I have now become confident enough to think that I might even be able to teach someone else something. So here it goes: my first tutorial.

The task

Earlier this fall Helsingin Sanomat published a huge dump of price data from Oikotie, a Finnish market place for apartments. I had an idea to build a kind of heat map where every street would be colored based on the average price of the apartments.

With the JavaScript library Polymaps you can easily make stylish web maps. The problem is that you need an overlay GeoJSON layer with the colored streets. Finnish authorities do not – yet! – provide open street-level geodata. Fortunately Open Street Map does.

From .shp to .geojson

The raw data from Open Street Map is downloadable in shape-format. So in my case I download the shapefile package of Finland and opened it in Quantum GIS (Layer > Add vector layer). This is what the finland_highway.shp file looks like.

This is A LOT of geodata, but in this case I’m only interested in the Helsinnki region. So I zoom in Helsinki an and select, roughly, the streets that I’m interested in using the lasso tool (select object tool ).

To export the selected part of the map to the GeoJSON format that Polymaps can read, chose Layer > Save Selection as vector file and GeoJSON as your format. Save! Done!

Filtering the GeoJSON file

We got a our GeoJSON-file. Now there is just one problem: it is huge, 18 MB! But there are a lot of streets here that we don’t need. We want to filter these streets. This will require some programming skills. I turn to Ruby.

This is the structure of an object in the GeoJSON file:

{ "type": "Feature", "properties": { "TYPE": "cycleway", "NAME": "", "ONEWAY": "", "LANES": 0.000000 }, "geometry": { "type": "LineString", "coordinates": [ [ 24.773350, 60.203288 ], [ 24.774540, 60.203008 ], [ 24.777840, 60.202300 ], [ 24.781013, 60.201565 ], [ 24.781098, 60.201546 ], [ 24.782735, 60.201199 ], [ 24.784300, 60.201045 ], [ 24.785846, 60.201085 ], [ 24.787381, 60.201133 ], [ 24.787812, 60.201169 ], [ 24.788101, 60.201207 ], [ 24.797454, 60.201623 ], [ 24.797636, 60.201620 ], [ 24.799625, 60.201405 ], [ 24.801848, 60.201089 ] ] } }

This street does apparently not have a name, but the others do, which means I can extract that streets that I’m interested in based on their name.

In another array I list the streets that I want to be included in the visualization. Like this:

streets = [
'Mannerheimintie',
'Hämeentie',
'Ulvilantie'
# and so on...

I now want to tell the computer to iterate through the GeoJSON file and extract the streets that are included in the streets array. Or in practice I approach it the other way around: I check what streets in the GeoJSON file that are not included in the array and remove them.

This is is the code:

def process(data)
json = JSON.parse(data)

#-- STEP 1. Go through the geojson file and add the index numbers ("i") of the street names that are not found in the array "streets" to a new array ("del")
i = 0
del = []

json["features"].each do |a|

unless $streets.include? a["properties"]["NAME"]

del.push(i)

end
i += 1

end

#-- STEP 2: Iterate through the del array from the back and remove the streets with the corresponding index numbers in the geojson data ---
del.reverse.each do |d|

json["features"].delete_at(d)

end

#-- Open a new json file and save the filtered geojson ---

File.open("hki.json", 'a'){ |f| f.write(JSON.generate(json))}
end

In this case data is the GeoJSON file and $streets the array of the selected streets. And voilà: you got yourself a new GeoJSON file. In my case I managed to shrink it down to 1.6 MB.

The visualization

I now got what I wanted in the beginning: the geographical coordinates for the streets that I want to plot, which means I’m halfway to making my visualization.

I won’t go in to details on how the actual visualization was put together. The short version is that I used this pavement quality example as base script and made some small modifications. The price data is then picked from a separate file. This is the result, the housing prices in Helsinki, street by street:

Open the full map in new window.

Not too shabby, right? I managed to sell this visualization to Hufvudstadsbladet which now runs it on their website.

 

 


One month Wall Street occupation mapped

For a month now we have been getting news about the Occupy movement that started on Wall Street in the beginning of October. There has been some arguing about the size of this movement. Guardian has made and interesting attempt to answer the question using crowdsourcing. I took a different approach.

The protest are coordinated at the site meetup.com. Here you find a complete list of the 2 506 occupy communities. I wrote a Ruby scraper that goes through this list and gathers information about all the meetups that has been arranged so far (more than 4 000 in a month).

I used the D3.js library to visualize the the list of meetups. This is the result (opens in new window):

The movement clearly peaked on Octboer 15th with meetups in around 600 different locations around the world. Protestors have continued to rally on Saturdays, but not with the same intensity.

Note that there is a number of protests that are missing here. I had some technical difficulties geocoding special characters (using the Yahoo Place Finder API), but that should not distort the picture of how the movement has developed. I didn’t have time to resolve the problem at the moment, but if someone knows how to get the API to understand odd characters such as ä, é and ü I’d appreciate the assistance.


Animation: World terrorism 2004-2011

After the terror attacks of nine-eleven the USA set out to fight terrorism. It has been a succesful quest in the sense that the Americans themselves have not been hit by terrorist since – but others have. According to statistics from the American Worldwide Incident Tracking System 37,798 lethal attacks have been carried out since 2004 killing 174,547. That’s a lot of nine-elevens.

Since the WITS provides such easily accessible data it would be a shame not to do something with it. So I did and this is what I ended up with (click to open in new window):

A few words about how I did this visualization.

The data

The basic data was really easy to gather here. I just filtered the attacks with ten or more casualties and downloaded the spreadsheet from WITS. The challenge was to geocode the places. I hadn’t done this before.

I wrote a Ruby script that called the Yahoo Place Finder API to transform the place names to longitudes and latitudes. For some reason a few locations got completely wrong coordinates (I started to wonder when the USA was suddenly hit by major attacks that I had never heard of). These were filtered away.

The visualization

This job provided two new challenges. One, working with dates. Two, working with maps. Just as the last time I used the JavaScript library d3.js to put the visualization together.

For the map I used the provided Albers example as a base script. With some assistance from this thread on Google groups I managed to figure out how to make a map in d3 (my heureka moment was when I realized that you can modify d3.geo.js to center the world map wherever you want).

Getting a hold of the dates in JavaScript became much easier with the date.js library. Highly recommended.

Final thoughts

A lot could have been done to polish the animation. One could have added some sort of timeline with key events, graphs and so on. But I think this is a pretty neat base for visualizing, lets says, earthquakes of other catastrophes. And you gotta like a viz on black.


Interactive: Athletics world record progression

The IAAF athletics world championships just came to an end with the one and only world record set by Jamaica in the short relay. This (the lack of world records) comes as no surprise. It is getting harder and harder to beat the old records, as the graph below shows.

Number of new world records per year.

More than 2 000 official IAAF world records have been set since the beginning of the 20th century. In other words:  a very interesting set of data. Inspired by this visualization by The New York Times from 2008 I decided to do my own mashup with this data. This is the result (click to open in new window):

Interactive visualization: click to open in new window.

The data

There were two challenges with this visualization: getting the data and visualizing it. It was surprisingly difficult to find world record data in an accessible format. Wikipedia provides some help, but the data contains plenty of holes. Instead I had to turn to the only thing the IAAF has to offer: a 700 page pdf with all the athletics statistics you can think of. The open data gospel has apparently not reached IAAF quite yet.

On the other hand this was an opportunity to practice some Excel formatting skills. To copy-paste the data into Excel was easy, transforming into readable columns and rows took some time. But I did it and you’ll find the result in Google Docs. I didn’t figure out how to make Google Docs format seconds, tenths and hundredths correctly, but if you open the spreadsheet in Excel you should be able to get the correct times.

With the data in a pretty spreadsheet I indexed all the results with 1951 as a base year (or the first recorded record for new events) and manually added the newest records, such as the one set by the Jamaican relay team.

The visualization

For the first time I used the JavaScript library d3.js for a visualization. With my short Protovis background d3.js was a charm to work with. The main advantages with d3.js compared to Protovis are that d3.js provides much greater animation support and makes it easier to interact with other elements on the page (such as div-tags).

As a d3-n00b I used Jan Willem Tulps tutorial as a base script and built around that. The d3.js documentation is still not conclusive, so for a beginner it takes some trial and error to progress, but undoubtedly this is a very powerful library for making handmade interactive visualizations.

All in all a very educative process and a result that I’m quite content with.

Post scriptum

Do you, by the way, know which the sixth greatest athletics nation of all time is (measured in number of world records)? FINLAND! A bit hard to believe a year like this when non of our athletes made the top-eight.

Country Number of records
USA 367
Soviet union 199
East Germany 109
Great Britain 55
Germany 51
Finland 49
Poland 47
Australia 41
West Germany 39
Russia 36

The Finnish “immigration critics” blog network

In my previous post I mapped the network of anti-jihadist bloggers mentioned in the manifesto of Anders Behring Breivik. This afternoon I stumbled upon a tweet by Martti Tulenheimo requesting something similar on the Finnish blogosphere.

I had actually tried to do something like that a few days ago, but didn’t manage to write the script I wanted. The plan was to use Yahoo’s inbound link API and have it build the network automatically. But I didn’t figure out how to only include links from the main page of a site (as Analyze Banklinks lets you do). So instead I took a more manual approach.

The method

As in the previous post I used Analyze Banklinks to list incoming links. The backlink analysis was done on the following blogs:

Site Links
http://www.jussi-halla.com/scripta 93
http://kullervokalervonpoika.wordpress.com/ 38
http://octavius1.wordpress.com/ 28
http://www.hommaforum.org 23
http://yrjoperskeles.blogspot.com/ 8
http://marialohela.fi/blogi/ 7
http://izrailit.blogspot.com/ 6
http://vasarahammer.blogspot.com/ 6
http://turkkila.blogspot.com/ 5
http://tuuritapio.blogspot.com/ 3
http://nikopol2008.blogspot.com/ 2
http://reinoblog.blogspot.com/ 2
http://laivaontaynna.blogspot.com/ 1

These are the blogs listed as “critical voices” on the blog of Jussi Halla-aho (or “the master” as he is refered to on the Homma discussion board) and they serve as a good starting point for this purpose.

I run all of the blogs through Analyze Backlinks to get a network of 133 blogs. I have obviously not read all of these blogs to check if it is correct to label them as immigration critics. The 133 blogs included in this network are merely to sites that link to at least one of the blogs mentioned above.

The results have to be read with some caution. I am not sure how reliable Analyze Backlinks is. Their own disclaimer warns that the results may not be accurate.

The results

Again I used Gephi to draw the network and this is what came out (click to open as pdf):

Click to open as pdf.

The size of the sites are determined by the number of inlinks, that is the number of sites that link to the page (not only front page links are counted here). A large number of inlinks indicate that the site is popular. Hence the big dots should be seen as key nodes in the Finnish immigration critic blogosphere.

However, I am not quite sure about the quality of the inlink count that Analyze Backlinks provide. I had a quick look at what numbers that Yahoo’s backlink API throws out and there seems to be a significant discrepancy.

So this analysis is far from perfect, but it’s a start and it gives you a decent idea of what the most important sites are in blogoshere of Finnish immigration critics. If you have thoughts on how the methodology could be improved I would love to hear your comments.


Finding the optimal marathon strategy in data

After half a year of training I ran my third marathon this weekend in Stockholm. It was a great race and managed to take three minutes of my own PB and finish on 3.10.

However, the race got me thinking: what is the optimal strategy for a marathon race? That is, how fast should I run in the beginning? I turned to data to look for the answer.

The Data

Stockholm Marathon gathers up to 20 000 runners every year. For each runner split times are recorded every 5 km. So there is plenty of data to work with here.

To get the data I wrote a Ruby script that scans the results of 2010 and records the results of all the (male) runners that participated in both 2010 and 2009. The scrape gave me a dataset of 3965 runners. For each runner I stored the variables age, finish time and time at 10 km.

To make the to races comparable I multiplied the times of 2009 by 0,967, as it turned out that the average time of 2010 was only 4.04.44, compared to 4.12.58 in 2009. Presumably the conditions were better in 2010.

The Analysis

We assume that all runners are equally good in 2009 and 2010. This means that a good run can be defined as run where the time is better than other year. If runner A does 4.00 in 2009 and 3.45 in 2010 he has made a better performance in 2010. But the question is: did the starting pace affect the result?

Lets start by splitting the runners in to three groups. Firstly the ones that improved their times by at least 10 percent in 2010 (green color), secondly the ones that did close to the same result in 2009 and 2010 (blue) and thirdly the ones that did at least 10 percent worse in 2010 (red).

Fear not if you don’t get this picture immediately. I’ll try to explain:

  • The y-axis is the relative pace at 10 km. Values above 0 means that the starting pace is slower than the average speed for the whole race. Values below 0 means that the starting pace has been higher than the average pace for the whole race.
  • The x-axis is the finish time for 2010 divided by the finish time for 2009. If it is less than 1 it means that the time in 2010 was better than the time in 2009.

What does the graph tell us? As you can see the green runners (the once that improved their times in 2010 compared to 2009) start the race in almost the same pace as they will keep for the rest of the race. In other words: the starting pace is almost the same as the average pace for the whole 42 km.

The red runners (the ones that did worse in 2010 than 2009), on the other hand, don’t manage to keep up the speed. Their first 10 km are considerably faster than the average speed for the rest of the race.

How big is the difference? An average green runner (that is heading for a time of 4.00) does the first 10 km 20 seconds faster than his average speed for the whole race. The blue runner 36 seconds faster. The red runner 1.24 minutes faster.

The Conclusion

What is the optimal marathon strategy then? It should come as no surprise to anyone that has ever run 42 km that it is only after 30 km that the race starts. The difference between the runners that managed to improve their times between 2009 and 2010 and the ones that didn’t, is that the latter  group was not able to keep a steady pace throughout the race.

Myself I started in an optimistic pace of 4.15-4-25 min/km and ended up around 4.45-4.55 min/km. Not a perfect way to set up a race according to this analysis, but good enough to perform a PB.

A Sidenote

So, we now know that a steady pace is what you should aim for when you run a marathon. Starting too fast is a risky strategy. As we could see in the previous graph many of the fastest starters in 2010 ended up performing poorly. But what about the fast starters in 2009? Did they learn from their mistakes?

Actually, yes. In this next graph I have singled out the 10 percent of the runners started fastest in 2009 and compared their results in 2009 and 2010.


The setup is the same as in the previous graph. Y-axis is the relative starting pace. X-axis the finish time compared to the other year (for 2010: 2010/2009, for 2009: 2009/2010). The blue dots represent 2009, the green 2010.

Two conclusions can be made:

  • The majority of the runners started slower in 2010 (or they were able to keep the pace better).
  • Most of the runners improved their times (apparently they had learnt something).

Want to explore the data yourself?


Eurovision Song Contest voting data mapped

It is time for the European championship of neighbour voting again, that is the Eurovision Song Contest. I came across a great dataset last weekend with all the entries since 1998 including voting data. I wrote a Ruby script that reshaped the list of entries into nodes and links, which made it possible to construct a network analysis. With a bit of Excel magic I managed to put together an interactive Protovis visualization (opens in new window):

Click to open the interactive visualization.

A few things to note:

  • Displayed here are the links between countries that, in average, give each other the highest points.
  • I have filtered links with a count of two or less. In other words: a country must have gotten points from another country at least three times to get a link. That means you won’t find a country like Cyprus in the network.
  • You need an updated browser to view the visualization.

Want to build your own visualization?