Wikipedia: All Roads Lead to ... Philosophy? (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sat, 20 Nov 2021

Wikipedia: All Roads Lead to ... Philosophy?

At a recent LUG meeting, we were talking about various uses for web scraping, and someone brought up a Wikipedia game: start on any page, click on the first real link, then repeat on the page that comes up. The claim is that this chain always gets to Wikipedia's page on Philosophy.

We tried a few rounds, and sure enough, every page we tried did eventually get to Philosophy, usually via languages, which goes to communication, goes to discipline, action, intention, mental, thought, idea, philosophy.

It's a perfect game for a discussion of scraping. It should be an easy exercise to write a scraper to do this, right?

Well, in that charming way of real-world programming, it's a trivial exercise to fetch and scrape a wikipedia page and find the first <a href=""> ... but it quickly inflates as you try to define the real "first link". Obviously you don't want the links in the sidebar (the first link is always "/wiki/Wikipedia:Protection_policy#pending"), or disambiguation link, or links to numbered footnotes, or ... you get the idea.

Which makes it a perfect scraping exercise, because web scraping, in the real world, almost always becomes a question of which parts of the page are important, and how to find the most reliable way of identifying them.

And it's almost always a value judgement. For instance, if you start on a page like "horse" or "dog", the first real link is "domesticated", which takes you to "Charles Darwin", and on his page, after his name is a list of honorifics: Charles Robert Darwin FRS FRGS FLS FZS. So is FRS the first real link, or should you skip the honorifics and go on to naturalist? I think most people would choose the latter; so you need to look at those honorifics and find something special about them that makes them skippable. Turns out they're inside a <span class="noexcerpt nowraplinks" ...>, so you can ignore them by skipping anything inside such a span.

Anyway, it's a fun exercise, and a good introduction to scraping, and Python's BeautifulSoup module (which I was recommending to the folks at the LUG meeting) handles it quite well.

I won't go over the code, but if you want to see it, it's at

The guy who originally brought this up said that at one point he had written code to take the results and run them through graphviz, so you could see the structure of articles visually. That sounded really fun, but I haven't yet managed to persuade him to send me the code for the graphs. I'll keep trying.


This all happened maybe a month and a half ago. As far as I know, the original game dates back several years. But as I was writing up this article, I discovered something amusing: it's no longer true that Wikipedia articles lead to Philosophy!

For instance, here's one run:

$ horse       ~/Docs/lwv/2021/Vote411/Stats
  0 horse         
  1 domesticated  
  2 Charles Darwin
  3 naturalist    
  4 organisms     
  5 biology       
  6  scientific   
  7 Latin         
  8 classical language
  9 language      
 10 communication 
 11 Latin         

$ "python language"
  0 python_language
  1 interpreted   
  2 computer science
  3 computation   
  4 calculation   
  5 arithmetical  
  6 Greek         
  7 Greek language
  8 Modern Greek  
  9 dialects      
 10 Latin         
 11 classical language
 12 language      
 13 communication 

All articles still go through the classical language/communication articles they did before, but now those articles no longer lead to philosophy, but instead, to an infinite loop.

Some articles do still pass through philosophy on their way to communication:

$ arduino     ~/Docs/lwv/2021/Vote411/Stats
  0 arduino       
  1 open-source hardware
  2 artifacts     
  3 use cases     
  4 software      
  5 engineering   
  6 scientific principles
  7 empirical     
  8 proposition   
  9 logic         
 10 truth         
 11 fact          
 12 experience    
 13 conscious     
 14 sentience     
 15 feelings      
 16 psychological states
 17 mind          
 18 thought       
 19 idea          
 20 philosophy    
 21 Greek         
 22 Modern Greek  
 23 dialects      
 24 Latin         
 25 classical language
 26 language      
 27 communication 

It's fun how this changed in only a couple of months after apparently being stable for years. And it's good to know that all Wikipedia paths now lead, not to Philosophy, but to Communication. Somehow that fits my world view a little better.

Tags: , , , ,
[ 19:31 Nov 20, 2021    More programming | permalink to this entry | ]

Comments via Disqus:

blog comments powered by Disqus