Wikipedia: All Roads Lead to ... Philosophy?
At a recent LUG meeting, we were talking about various uses for web scraping, and someone brought up a Wikipedia game: start on any page, click on the first real link, then repeat on the page that comes up. The claim is that this chain always gets to Wikipedia's page on Philosophy.
We tried a few rounds, and sure enough, every page we tried did eventually get to Philosophy, usually via languages, which goes to communication, goes to discipline, action, intention, mental, thought, idea, philosophy.
It's a perfect game for a discussion of scraping. It should be an easy exercise to write a scraper to do this, right?
Well, in that charming way of real-world programming, it's a trivial
exercise to fetch and scrape a wikipedia page and find the first
<a href="">
... but it quickly inflates as you
try to define the real "first link". Obviously you don't want the
links in the sidebar (the first link is always
"/wiki/Wikipedia:Protection_policy#pending"), or
disambiguation link, or links to numbered footnotes, or ...
you get the idea.
Which makes it a perfect scraping exercise, because web scraping, in the real world, almost always becomes a question of which parts of the page are important, and how to find the most reliable way of identifying them.
And it's almost always a value judgement. For instance, if you start
on a page like "horse" or "dog", the first real link is "domesticated",
which takes you to "Charles Darwin", and on his page, after his name
is a list of honorifics: Charles Robert Darwin FRS FRGS FLS FZS.
So is FRS the first real link, or should you skip the honorifics
and go on to naturalist? I think most people would choose
the latter; so you need to look at those honorifics and find
something special about them that makes them skippable.
Turns out they're inside a
<span class="noexcerpt nowraplinks" ...>
,
so you can ignore them by skipping anything inside such a span.
Anyway, it's a fun exercise, and a good introduction to scraping, and Python's BeautifulSoup module (which I was recommending to the folks at the LUG meeting) handles it quite well.
I won't go over the code, but if you want to see it, it's at wikifollow.py.
The guy who originally brought this up said that at one point he had written code to take the results and run them through graphviz, so you could see the structure of articles visually. That sounded really fun, but I haven't yet managed to persuade him to send me the code for the graphs. I'll keep trying.
Ch-ch-ch-changes
This all happened maybe a month and a half ago. As far as I know, the original game dates back several years. But as I was writing up this article, I discovered something amusing: it's no longer true that Wikipedia articles lead to Philosophy!
For instance, here's one run:
$ wikifollow.py horse ~/Docs/lwv/2021/Vote411/Stats 0 horse https://en.wikipedia.org/wiki/horse 1 domesticated https://en.wikipedia.org/wiki/Domestication 2 Charles Darwin https://en.wikipedia.org/wiki/Charles_Darwin 3 naturalist https://en.wikipedia.org/wiki/Natural_history#Before_1900 4 organisms https://en.wikipedia.org/wiki/Organism 5 biology https://en.wikipedia.org/wiki/Biology 6 scientific https://en.wikipedia.org/wiki/Science 7 Latin https://en.wikipedia.org/wiki/Latin_language 8 classical language https://en.wikipedia.org/wiki/Classical_language 9 language https://en.wikipedia.org/wiki/Language 10 communication https://en.wikipedia.org/wiki/Communication 11 Latin https://en.wikipedia.org/wiki/Latin LOOP DETECTED! $ wikifollow.py "python language" 0 python_language https://en.wikipedia.org/wiki/python_language 1 interpreted https://en.wikipedia.org/wiki/Interpreted_language 2 computer science https://en.wikipedia.org/wiki/Computer_science 3 computation https://en.wikipedia.org/wiki/Computation 4 calculation https://en.wikipedia.org/wiki/Calculation 5 arithmetical https://en.wikipedia.org/wiki/Arithmetic 6 Greek https://en.wikipedia.org/wiki/Ancient_Greek 7 Greek language https://en.wikipedia.org/wiki/Greek_language 8 Modern Greek https://en.wikipedia.org/wiki/Modern_Greek 9 dialects https://en.wikipedia.org/wiki/Dialect 10 Latin https://en.wikipedia.org/wiki/Latin 11 classical language https://en.wikipedia.org/wiki/Classical_language 12 language https://en.wikipedia.org/wiki/Language 13 communication https://en.wikipedia.org/wiki/Communication LOOP DETECTED!
All articles still go through the classical language/communication articles they did before, but now those articles no longer lead to philosophy, but instead, to an infinite loop.
Some articles do still pass through philosophy on their way to communication:
$ wikifollow.py arduino ~/Docs/lwv/2021/Vote411/Stats 0 arduino https://en.wikipedia.org/wiki/arduino 1 open-source hardware https://en.wikipedia.org/wiki/Open-source_hardware 2 artifacts https://en.wikipedia.org/wiki/Artifact_(software_development) 3 use cases https://en.wikipedia.org/wiki/Use_case 4 software https://en.wikipedia.org/wiki/Software_engineering 5 engineering https://en.wikipedia.org/wiki/Engineering 6 scientific principles https://en.wikipedia.org/wiki/Scientific_method 7 empirical https://en.wikipedia.org/wiki/Empirical_evidence 8 proposition https://en.wikipedia.org/wiki/Proposition 9 logic https://en.wikipedia.org/wiki/Logic 10 truth https://en.wikipedia.org/wiki/Truth 11 fact https://en.wikipedia.org/wiki/Fact 12 experience https://en.wikipedia.org/wiki/Experience 13 conscious https://en.wikipedia.org/wiki/Conscious 14 sentience https://en.wikipedia.org/wiki/Sentience 15 feelings https://en.wikipedia.org/wiki/Emotion 16 psychological states https://en.wikipedia.org/wiki/Mental_state 17 mind https://en.wikipedia.org/wiki/Mind 18 thought https://en.wikipedia.org/wiki/Thought 19 idea https://en.wikipedia.org/wiki/Idea 20 philosophy https://en.wikipedia.org/wiki/Philosophy 21 Greek https://en.wikipedia.org/wiki/Greek_language 22 Modern Greek https://en.wikipedia.org/wiki/Modern_Greek 23 dialects https://en.wikipedia.org/wiki/Dialect 24 Latin https://en.wikipedia.org/wiki/Latin 25 classical language https://en.wikipedia.org/wiki/Classical_language 26 language https://en.wikipedia.org/wiki/Language 27 communication https://en.wikipedia.org/wiki/Communication LOOP DETECTED!
It's fun how this changed in only a couple of months after apparently being stable for years. And it's good to know that all Wikipedia paths now lead, not to Philosophy, but to Communication. Somehow that fits my world view a little better.
[ 19:31 Nov 20, 2021 More programming | permalink to this entry | ]