6 Degrees Of Wikipedia



Six Degrees of Wikipedia is a side project I've been sporadically hacking on over the past few years. It was an interesting technical challenge and it's fun to play with the end result. Following our example of “6 Degrees of Wikipedia” in lecture, construct a graph where vertices are wikipedia pages and edges represent links between wikipedia pages. Your graph should have at least 10 vertices; at least 1 cycle; and vertices for “Global positioning system”, “Equator” and “Coordinate system” https://en.wikipedia. Most of us are familiar with the concept of six degrees of separation - the idea is that anyone in the planet can be connected to anyone else in just six steps. So through just five other people, you're effectively connected to the Queen of England, Tom Cruise, or even a Mongolian sheep herder.

This may or may not have come up recently due to my circumstances; I don’t think me blogging about my efforts is going to significantly improve anyone else’s chances when it comes to the moment.

The idea in principal is dead simple: Given a page on Wikipedia how many steps/links/pages does it take to get to a given second page?

We don’t care how we get there, there is no need to record the route, just count how many steps. The only other rule is you have to go from A -> B, you can’t start at both ends and converge somewhere in the middle, i.e. A -> C <- B.

I really thought I’d have this nailed. I’ve done so much with Ruby and Mechanize in the past and this was an ideal fit. But I found this quite a humbling experience. For whatever reason (could have perhaps been the cat relentlessly pawing at the door) I lost my cool in the last thirty minutes of the ninety minutes allotted time, shut down a bit and stopped thinking/communicating out loud, and just couldn’t get my brain to work: I knew what I was trying to do, but couldn’t even communicate that to myself, let alone anyone else.

Here’s exactly what I had when time was up:

Wikipedia6 degrees of separation wikipedia game

Not pretty is it? Doesn’t work, at all. It’s more like a collection of random thoughts. Like I said: a humbling experience. But how it nagged me. I couldn’t leave it like that when I knew I had the right idea somewhere in my brain.

This is what I had a few hours later:

Which worked in principal, but was extremely slow and memory inefficient. I had to kill it before I got a result as it was using too much of the 1GB total I had on Linode.

You can see (maybe) what I was trying to with my initial attempt, i.e:

  1. Keeping track of the previous levels urls
  2. Keeping track of the current levels urls
  3. Keeping track of urls checked so time isn’t wasted checking them again

The current level becomes the next previous level when we reach the end, etc.

You can also see bits that I knew were a problem and that were inefficient, but that I didn’t really care about because I was sure I could figure them out later if need be (I.e. I didn’t see them as the main problem):

6 Degrees Of Separation Wikipedia Game

  • Filtering out irrelevant links. I.e. all the header and footer links on a Wikipedia page which aren’t anything to do with the article itself.
  • Checking all the links at the end of a step - slower than checking on the fly.

I decided to use Redis as a more memory efficient alternative to Ruby arrays for my final attempt for that day (by that point it was purely an effort for my own self-esteem). Redis is really awesome, by the way:

This was good enough to get it to run reasonably ok on the resources I had, but I knew improvements could still be made. That would have to wait for another day though as it was bedtime; I hadn’t worked on this constantly that evening, I’d had other stuff to do.

Next, I figured out what I’d said to myself in a comment and managed to used Redis’s built in databases for both the previous and current levels meaning I could easily switch between them. I also made it command line callable to make it a bit easier to use.

6 Degrees Of Wikipedia

The next attempt was to use threading. I realised that, since I would have concurrent threads going on, I couldn’t use multiple Redis databases as each thread could override the database selection made in another thread and things would get very messy. I decided to use Redis Sets instead in just one database. The mechanize bit became a separate function to make threading it easier - I.e. I could extend this to three instances/threads easily. I also added in better feedback (counting down urls remaining).

In making the threaded version I discovered the need for better error handling, e.g. checks for images and checks for a successful page load: On a single thread I never had failure of the program to work; However, after adding threads (before I put in the page load check) it would blitz through, but not always find the result. I.e. some page loads must have failed, but been lost in the begin/rescue. This approach made sure to re-add the page to the list to be checked if the page failed to load.

I also realised I had lied to myself when I’d commented “No way not to avoid an array at some point”.

However, GIL is still a thing so threading was not providing any performance benefits which prompted a fairly trivial switch from threads to processes instead. This represents the final state of the program; Obviously further improvements are possible, but I was happy enough with everything in principal.

I decided to chuck some resources at the processes version and see if I could speed things up. The original task I had been set was to go from the Wikipedia Ruby page to the David Bowie page. But whilst developing this program I checked for a known single step of Ruby to Metastable (The Ruby page links to Diamond, which links to Metastable).

On a two-core system:

On a four-core system:

So the additional processes are definitely helping. Mind you, doing six levels deep would be crazy on a single machine as it turns out David Bowie is only two levels deep and on a four-core with four processes this still took: 108m6.189/173m5.299/7m24.900.

Six Degrees of Wikipedia

In class, we have discussed about the six degrees of separation, the theory that everyone in the world is separated by a maximum of six connections. Recently, a friend introduced me to a game called “The Wiki Game”. Each round, you are given a start page and a goal page on Wikipedia. The challenge is to navigate from the start page to the goal page only by clicking links in the article. Game modes include competing to get to the destination page the fastest or with the fewest clicks. This game closely relates to the topic of “Six Degrees of Wikipedia” – studying the relationships between Wikipedia pages.

Degrees

What are some topics that have no obvious connections, but are separated by only a few links? For example, it only take two links to go from the article on xkcd (a popular webcomic) to the article about detergent (xkcd -> Golden Gate Park -> Detergent).

There is a hypothesis that if you click the first link in the main text of a Wikipedia article, and then repeat, you will eventually get to the article on Philosophy. As of May 2011, there are around 95% articles that follow this hypothesis. The remaining ones either lead to broken links, articles with no Wikipedia links or cycles. There is a website called Xefer that visually maps out the paths that leads you to Philosophy. The most prevalent theory explaining this phenomenon is that clicking the first link of a Wikipedia page usually moves up a classification chain, because the Wikipedia Style Guide suggests starting the lead section of the article by defining the topic of the article. Therefore, clicking on the first link leads to a broader subjects, such as Mathematics, Science, Language and eventually Philosophy.

6 Degrees Of Wikipedia

On the other hand, you may be interested in the distance between the two Wikipedia pages that are the furthest apart. However, the results are not as interesting as expected. They happen to be parts of the List of Minor Planets, in which each of the pages are only linked to the previous and next parts of the big list.

For a more hands on experience with the “Six Degrees of Wikipedia”, you may want to check out this assignment from a Social Networks class at Princeton.

6 Degrees Of Wikipedia Game

Sources:

Comments

Leave a Reply

Blogging Calendar

Longest 6 Degrees Of Wikipedia

October 2015
MTWTFSS
1234
567891011
12131415161718
19202122232425
262728293031