Wikipedia: All Roads Lead to ... Philosophy? (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sat, 20 Nov 2021

Wikipedia: All Roads Lead to ... Philosophy?

At a recent LUG meeting, we were talking about various uses for web scraping, and someone brought up a Wikipedia game: start on any page, click on the first real link, then repeat on the page that comes up. The claim is that this chain always gets to Wikipedia's page on Philosophy.

We tried a few rounds, and sure enough, every page we tried did eventually get to Philosophy, usually via languages, which goes to communication, goes to discipline, action, intention, mental, thought, idea, philosophy.

It's a perfect game for a discussion of scraping. It should be an easy exercise to write a scraper to do this, right?

Well, in that charming way of real-world programming, it's a trivial exercise to fetch and scrape a wikipedia page and find the first <a href=""> ... but it quickly inflates as you try to define the real "first link". Obviously you don't want the links in the sidebar (the first link is always "/wiki/Wikipedia:Protection_policy#pending"), or disambiguation link, or links to numbered footnotes, or ... you get the idea.

Which makes it a perfect scraping exercise, because web scraping, in the real world, almost always becomes a question of which parts of the page are important, and how to find the most reliable way of identifying them.

And it's almost always a value judgement. For instance, if you start on a page like "horse" or "dog", the first real link is "domesticated", which takes you to "Charles Darwin", and on his page, after his name is a list of honorifics: Charles Robert Darwin FRS FRGS FLS FZS. So is FRS the first real link, or should you skip the honorifics and go on to naturalist? I think most people would choose the latter; so you need to look at those honorifics and find something special about them that makes them skippable. Turns out they're inside a <span class="noexcerpt nowraplinks" ...>, so you can ignore them by skipping anything inside such a span.

Anyway, it's a fun exercise, and a good introduction to scraping, and Python's BeautifulSoup module (which I was recommending to the folks at the LUG meeting) handles it quite well.

I won't go over the code, but if you want to see it, it's at wikifollow.py.

The guy who originally brought this up said that at one point he had written code to take the results and run them through graphviz, so you could see the structure of articles visually. That sounded really fun, but I haven't yet managed to persuade him to send me the code for the graphs. I'll keep trying.

Ch-ch-ch-changes

This all happened maybe a month and a half ago. As far as I know, the original game dates back several years. But as I was writing up this article, I discovered something amusing: it's no longer true that Wikipedia articles lead to Philosophy!

For instance, here's one run:

$ wikifollow.py horse       ~/Docs/lwv/2021/Vote411/Stats
  0 horse                   https://en.wikipedia.org/wiki/horse
  1 domesticated            https://en.wikipedia.org/wiki/Domestication
  2 Charles Darwin          https://en.wikipedia.org/wiki/Charles_Darwin
  3 naturalist              https://en.wikipedia.org/wiki/Natural_history#Before_1900
  4 organisms               https://en.wikipedia.org/wiki/Organism
  5 biology                 https://en.wikipedia.org/wiki/Biology
  6  scientific             https://en.wikipedia.org/wiki/Science
  7 Latin                   https://en.wikipedia.org/wiki/Latin_language
  8 classical language      https://en.wikipedia.org/wiki/Classical_language
  9 language                https://en.wikipedia.org/wiki/Language
 10 communication           https://en.wikipedia.org/wiki/Communication
 11 Latin                   https://en.wikipedia.org/wiki/Latin
LOOP DETECTED!

$ wikifollow.py "python language"
  0 python_language         https://en.wikipedia.org/wiki/python_language
  1 interpreted             https://en.wikipedia.org/wiki/Interpreted_language
  2 computer science        https://en.wikipedia.org/wiki/Computer_science
  3 computation             https://en.wikipedia.org/wiki/Computation
  4 calculation             https://en.wikipedia.org/wiki/Calculation
  5 arithmetical            https://en.wikipedia.org/wiki/Arithmetic
  6 Greek                   https://en.wikipedia.org/wiki/Ancient_Greek
  7 Greek language          https://en.wikipedia.org/wiki/Greek_language
  8 Modern Greek            https://en.wikipedia.org/wiki/Modern_Greek
  9 dialects                https://en.wikipedia.org/wiki/Dialect
 10 Latin                   https://en.wikipedia.org/wiki/Latin
 11 classical language      https://en.wikipedia.org/wiki/Classical_language
 12 language                https://en.wikipedia.org/wiki/Language
 13 communication           https://en.wikipedia.org/wiki/Communication
LOOP DETECTED!

All articles still go through the classical language/communication articles they did before, but now those articles no longer lead to philosophy, but instead, to an infinite loop.

Some articles do still pass through philosophy on their way to communication:

$ wikifollow.py arduino     ~/Docs/lwv/2021/Vote411/Stats
  0 arduino                 https://en.wikipedia.org/wiki/arduino
  1 open-source hardware    https://en.wikipedia.org/wiki/Open-source_hardware
  2 artifacts               https://en.wikipedia.org/wiki/Artifact_(software_development)
  3 use cases               https://en.wikipedia.org/wiki/Use_case
  4 software                https://en.wikipedia.org/wiki/Software_engineering
  5 engineering             https://en.wikipedia.org/wiki/Engineering
  6 scientific principles   https://en.wikipedia.org/wiki/Scientific_method
  7 empirical               https://en.wikipedia.org/wiki/Empirical_evidence
  8 proposition             https://en.wikipedia.org/wiki/Proposition
  9 logic                   https://en.wikipedia.org/wiki/Logic
 10 truth                   https://en.wikipedia.org/wiki/Truth
 11 fact                    https://en.wikipedia.org/wiki/Fact
 12 experience              https://en.wikipedia.org/wiki/Experience
 13 conscious               https://en.wikipedia.org/wiki/Conscious
 14 sentience               https://en.wikipedia.org/wiki/Sentience
 15 feelings                https://en.wikipedia.org/wiki/Emotion
 16 psychological states    https://en.wikipedia.org/wiki/Mental_state
 17 mind                    https://en.wikipedia.org/wiki/Mind
 18 thought                 https://en.wikipedia.org/wiki/Thought
 19 idea                    https://en.wikipedia.org/wiki/Idea
 20 philosophy              https://en.wikipedia.org/wiki/Philosophy
 21 Greek                   https://en.wikipedia.org/wiki/Greek_language
 22 Modern Greek            https://en.wikipedia.org/wiki/Modern_Greek
 23 dialects                https://en.wikipedia.org/wiki/Dialect
 24 Latin                   https://en.wikipedia.org/wiki/Latin
 25 classical language      https://en.wikipedia.org/wiki/Classical_language
 26 language                https://en.wikipedia.org/wiki/Language
 27 communication           https://en.wikipedia.org/wiki/Communication
LOOP DETECTED!

It's fun how this changed in only a couple of months after apparently being stable for years. And it's good to know that all Wikipedia paths now lead, not to Philosophy, but to Communication. Somehow that fits my world view a little better.

Tags: , , , ,
[ 19:31 Nov 20, 2021    More programming | permalink to this entry | ]

Comments via Disqus:

blog comments powered by Disqus