The first thing I do is take the apache log and run webalizer on it, to give me a breakdown of some of the "top" lists.
Of course, I'm extremely interested in the user agent list: which browsers are being used most often? As of last month, the Shallowsky list still has MSIE 6.0 in the lead ... but it's not as big a lead as it used to be, at 56.04%. Mozilla 5.0 (which includes all Gecko- based browsers, as far as I know, including Mozilla, Firefox, Netscape 6 and 7, Camino, etc.) is second with 20.31%. Next are four search engine 'bots, and then we're into the single digit percentages with a couple of old IE versions and Opera.
AvantGo (they're still around?) is number 11 with 0.37% -- interesting. It looks like they're grabbing the Hitchhiker's Guide to the Moon; then there are a bunch of lines like:
sync37.avantgo.com - - [05/Apr/2006:14:29:25 -0700] "GET / HTTP/1.0" 200 4549 "http://www.nineplanets.org/" "Mozilla/4.0 (compatible; AvantGo 6.0; FreeBSD)"and I'm not sure how to read that (nineplanets.org is The Nine Planets, Bill Arnett's excellent and justifiably popular planetary site, and he and I have cross-links, but I'm not sure what that has to do with avantgo and my site). Not that it's a problem: of course, anyone is welcome to read my site on a PDA, via AvantGo or otherwise. I'm just curious.
Amusingly, the last user agent in the top fifteen is GIMP Layers, syndicating this blog.
Another interesting list is the search queries: what search terms did people use which led them to my site? Sometimes that's more interesting than other times: around Christmas, people were searching for "griffith park light show" and ending up at my lame collection of photos from a previous year's light show. I felt so sorry for them: Griffith Park never puts any information on the web so it's impossible to find out what hours and dates the light show will be open, so I know perfectly well why they were googling, and they certainly weren't getting any help from me. I would have put the information there if I'd known -- but I tried to find out and couldn't find it either.
But this month, no one is searching on anything unusual. The top searches leading to my site for the past two months are terms like birds, gimp plugins, linux powerpoint, mini laptops, debian chkconfig, san andreas fault, pandora, hummingbird pictures, fiat x1/9, jupiter's features, linux photo, and a rather large assortment of dirt bike queries. (I have very little dirt bike content on my site, but people must be desperate to find web pages on dirt bikes because those always show up very prominently in the search string list.)
Most popular pages are this blog (maybe just because of RSS readers), the Hitchhiker's Guide to the Moon, and bird photos, with an assortment of other pages covering software, linux tips, assorted photo collections, and, of course, dirt bikes.
That's most of what I can get from webalizer. Now it's time to look at the apache error logs. I have quite a few 404s (missing files). I can clean up some of the obvious ones, and others are coming from external sites I can't do anything about that for some reason link to filenames I deleted seven years ago; but how can I get a list of all the broken internal links on my site, so at least I can fix the errors that are my own fault?
Kathryn on Linuxchix pointed me to dead-links.com, a rather cool site. But it turns out it only looks for broken external links, not internal ones. That's useful, too, just not what I was after this time. Warning: if you try to save the page from firefox, it will start running all over again. You have to copy the content and paste it into a file if you want to save it.
But Kathryn and Val opined that wget was probably the way to go for finding internal links. Turns out wget has an option to delete each file after downloading it, so you can wget a whole site but not actually need to use the local space to duplicate the site. Use this command:
wget --recursive -nd -nv --delete-after --domains=domain.com http://domain.com/ | tee wget.out 2>&1
Now open the resulting file in an editor and search repeatedly for ERROR to find all the broken links. Unfortunately the errors are on a separate line from the filenames they reference, so you can't just use a grep. wget also gets some things wrong: for instance, it tries to download the .class file of a Java applet inside a .jar, then reports an error when the class doesn't exist. (--reject .class might help that.) Still, it's not hard to skip past these errors, and wget does seem to be a fairly good way of finding broken internal links.
There's one more check left to do in the access log. But that's a longer story, and a posting for another day.
[ 21:43 Apr 14, 2006 More tech/web | permalink to this entry | comments ]