Finding orphaned files on websites (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Tue, 21 Apr 2015

Finding orphaned files on websites

I recently took over a website that's been neglected for quite a while. As well as some bad links, I noticed a lot of old files, files that didn't seem to be referenced by any of the site's pages. Orphaned files.

So I went searching for a link checker that also finds orphans. I figured that would be easy. It's something every web site maintainer needs, right? I've gotten by without one for my own website, but I know there are some bad links and orphans there and I've often wanted a way to find them.

An intensive search turned up only one possibility: linklint, which has a -orphan flag. Great! But, well, not really: after a few hours of fiddling with options, I couldn't find any way to make it actually find orphans. Either you run it on a http:// URL, and it says it's searching for orphans but didn't find any (because it ignors any local directory you specify); or you can run it just on a local directory, in which case it finds a gazillion orphans that aren't actually orphans, because they're referenced by files generated with PHP or other web technology. Plus it flags all the bad links in all those supposed orphans, which get in the way of finding the real bad links you need to worry about.

I tried asking on a couple of technical mailing lists and IRC channels. I found a few people who had managed to use linklint, but only by spidering an entire website to local files (thus getting rid of any server side dependencies like PHP, CGI or SSI) and then running linklint on the local directory. I'm sure I could do that one time, for one website. But if it's that much hassle, there's not much chance I'll keep using to to keep websites maintained.

What I needed was a program that could look at a website and local directory at the same time, and compare them, flagging any file that isn't referenced by anything on the website. That sounded like it would be such a simple thing to write.

So, of course, I had to try it. This is a tool that needs to exist -- and if for some bizarre reason it doesn't exist already, I was going to remedy that.

Naturally, I found out that it wasn't quite as easy to write as it sounded. Reconciling a URL like "http://mysite.com/foo/bar.html" or "../asdf.html" with the corresponding path on disk turned out to have a lot of twists and turns.

But in the end I prevailed. I ended up with a script called weborphans (on github). Give it both a local directory for the files making up your website, and the URL of that website, for instance:

$ weborphans /var/www/ http://localhost/

It's still a little raw, certainly not perfect. But it's good enough that I was able to find the 10 bad links and 606 orphaned files on this website I inherited.

Tags: , ,
[ 14:55 Apr 21, 2015    More tech/web | permalink to this entry | comments ]
(Commenting requires Javascript from ShallowSky.com and Disqus.com, and a cookie from Disqus.com.)
blog comments powered by Disqus