Shallow Thoughts : : web

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Tue, 21 Apr 2015

Finding orphaned files on websites

I recently took over a website that's been neglected for quite a while. As well as some bad links, I noticed a lot of old files, files that didn't seem to be referenced by any of the site's pages. Orphaned files.

So I went searching for a link checker that also finds orphans. I figured that would be easy. It's something every web site maintainer needs, right? I've gotten by without one for my own website, but I know there are some bad links and orphans there and I've often wanted a way to find them.

An intensive search turned up only one possibility: linklint, which has a -orphan flag. Great! But, well, not really: after a few hours of fiddling with options, I couldn't find any way to make it actually find orphans. Either you run it on a http:// URL, and it says it's searching for orphans but didn't find any (because it ignors any local directory you specify); or you can run it just on a local directory, in which case it finds a gazillion orphans that aren't actually orphans, because they're referenced by files generated with PHP or other web technology. Plus it flags all the bad links in all those supposed orphans, which get in the way of finding the real bad links you need to worry about.

I tried asking on a couple of technical mailing lists and IRC channels. I found a few people who had managed to use linklint, but only by spidering an entire website to local files (thus getting rid of any server side dependencies like PHP, CGI or SSI) and then running linklint on the local directory. I'm sure I could do that one time, for one website. But if it's that much hassle, there's not much chance I'll keep using to to keep websites maintained.

What I needed was a program that could look at a website and local directory at the same time, and compare them, flagging any file that isn't referenced by anything on the website. That sounded like it would be such a simple thing to write.

So, of course, I had to try it. This is a tool that needs to exist -- and if for some bizarre reason it doesn't exist already, I was going to remedy that.

Naturally, I found out that it wasn't quite as easy to write as it sounded. Reconciling a URL like "http://mysite.com/foo/bar.html" or "../asdf.html" with the corresponding path on disk turned out to have a lot of twists and turns.

But in the end I prevailed. I ended up with a script called weborphans (on github). Give it both a local directory for the files making up your website, and the URL of that website, for instance:

$ weborphans /var/www/ http://localhost/

It's still a little raw, certainly not perfect. But it's good enough that I was able to find the 10 bad links and 606 orphaned files on this website I inherited.

Tags: , ,
[ 14:55 Apr 21, 2015    More tech/web | permalink to this entry | comments ]

Fri, 27 Mar 2015

Hide Google's begging (or any other web content) via a Firefox userContent trick

Lately, Google is wasting space at the top of every search with a begging plea to be my default search engine.

[Google begging: Switch your default search engine to Google] Google already is my default search engine -- that's how I got to that page. But if you don't have persistent Google cookies set, you have to see this begging every time you do a search. (Why they think pestering users is the way to get people to switch to them is beyond me.)

Fortunately, in Firefox you can hide the begging with a userContent trick. Find the chrome directory inside your Firefox profile, and edit userContent.css in that directory. (Create a new file with that name if you don't already have one.) Then add this:

#taw { display: none !important; }

Restart Firefox, do a Google search and the begs should be gone.

In case you have any similar pages where there's pointless content getting in the way and you want to hide it: what I did was to right-click inside the begging box and choose Inspect element. That brings up Firefox's DOM inspector. Mouse over various lines in the inspector and watch what gets highlighted in the browser window. Find the element that highlights everything you want to remove -- in this case, it's a div with id="taw". Then you can write CSS to address that: hide it, change its style or whatever you're trying to do.

You can even use Inspect element to remove elements immediately. That won't help you prevent them from showing up later, but it can be wonderful if you need to use a page that has an annoying blinking ad on it, or a mis-designed page that has images covering the content you're trying to read.

Tags: , ,
[ 08:17 Mar 27, 2015    More tech/web | permalink to this entry | comments ]

Sat, 14 Mar 2015

Making a customized Firefox search plug-in

It's getting so that I dread Firefox's roughly weekly "There's a new version -- do you want to upgrade?" With every new upgrade, another new crucial feature I use every day disappears and I have to spend hours looking for a workaround.

Last week, upgrading to Firefox 36.0.1, it was keyword search: the feature where, if I type something in the location bar that isn't a URL, Firefox would instead search using the search URL specified in the "keyword.URL" preference.

In my case, I use Google but I try to turn off the autocomplete feature, which I find it distracting and unhelpful when typing new search terms. (I say "try to" because complete=0 only works sporadically.) I also add the prefix allintext: to tell Google that I only want to see pages that contain my search term. (Why that isn't the default is anybody's guess.) So I set keyword.URL to: http://www.google.com/search?complete=0&q=allintext%3A+ (%3A is URL code for the colon character).

But after "up"grading to 36.0.1, search terms I typed in the location bar took me to Yahoo search. I guess Yahoo is paying Mozilla more than Google is now.

Now, Firefox has a Search tab under Edit->Preferences -- but that just gives you a list of standard search engines' default searches. It would let me use Google, but not with my preferred options.

If you follow the long discussions in bugzilla, there are a lot of people patting each other on the back about how much easier the preferences window is, with no discussion of how to specify custom searches except vague references to "search plugins". So how do these search plugins work, and how do you make one?

Fortunately a friend had a plugin installed, acquired from who knows where. It turns out that what you need is an XML file inside a directory called searchplugins in your profile directory. (If you're not sure where your profile lives, see Profiles - Where Firefox stores your bookmarks, passwords and other user data, or do a systemwide search for "prefs.js" or "search.json" or "cookies.sqlite" and it should lead you to your profile.)

Once you have one plugin installed, it's easy to edit it and modify it to do anything you want. The XML file looks roughly like this:

<SearchPlugin xmlns="http://www.mozilla.org/2006/browser/search/" xmlns:os="http://a9.com/-/spec/opensearch/1.1/">
<os:ShortName>MySearchPlugin</os:ShortName>
<os:Description>The search engine I prefer to use</os:Description>
<os:InputEncoding>UTF-8</os:InputEncoding>
<os:Image width="16" height="16">data:image/x-icon;base64,ICON GOES HERE</os:Image>
<SearchForm>http://www.google.com/</SearchForm>
<os:Url type="text/html" method="GET" template="https://www.google.com/search">
  <os:Param name="complete" value="0"/>
  <os:Param name="q" value="allintext: {searchTerms}"/>
  <!--os:Param name="hl" value="en"/-->
</os:Url>
</SearchPlugin>

There are four things you'll want to modify. First, and most important, os:Url and os:Param control the base URL of the search engine and the list of parameters it takes. {searchTerms} in one of those Param arguments will be replaced by whatever terms you're searching for. So <os:Param name="q" value="allintext: {searchTerms}"/> gives me that allintext: parameter I wanted.

(The other parameter I'm specifying, <os:Param name="complete" value="0"/>, used to make Google stop the irritating autocomplete every time you try to modify your search terms. Unfortunately, this has somehow stopped working at exactly the same time that I upgraded Firefox. I don't see how Firefox could be causing it, but the timing is suspicious. I haven't been able to figure out another way of getting rid of the autocomplete.)

Next, you'll want to give your plugin a ShortName and Description so you'll be able to recognize it and choose it in the preferences window.

Finally, you may want to modify the icon: I'll tell you how to do that in a moment.

Using your new search plugin

[Firefox search prefs]

You've made all your modifications and saved the file to something inside the searchplugins folder in your Firefox profile. How do you make it your default?

I restarted firefox to make sure it saw the new plugin, though that may not have been necessary. Then Edit->Preferences and click on the Search icon at the top. The menu near the top under Default search engine is what you want: your new plugin should show up there.

Modifying the icon

Finally, what about that icon?

In the plugin XML file I was copying, the icon line looked like:

<os:Image width="16"
height="16">data:image/x-icon;base64,AAABAAEAEBAAAAEAIABoBAAAFgAAACgAAAAQAAAAIAAAAAEAIAAAAAAAAAAAAAAA
... many more lines like this then ... ==</os:Image>
So how do I take that and make an image I can customize in GIMP?

I tried copying everything after "base64," and pasting it into a file, then opening it in GIMP. No luck. I tried base64 decoding it (you do this with base64 -d filename >outfilename) and reading it in with GIMP. Still no luck: "Unknown file type".

The method I found is roundabout, but works:

  1. Copy everything inside the tag: data:image/x-icon;base64,AA ... ==
  2. Paste that into Firefox's location bar and hit return. You'll see the icon from the search plugin you're modifying.
  3. Right-click on the image and choose Save image as...
  4. Save it to a file with the extension .ico -- GIMP won't open it without that extension.
  5. Open it in GIMP -- a 16x16 image -- and edit to your heart's content.
  6. File->Export as...
  7. Use the type "Microsoft Windows icon (*.ico)"
  8. Base64 encode the file you just saved, like this: base64 yourfile.ico >newfile
  9. Copy the contents of newfile and paste that into your os:Image line, replacing everything after data:image/x-icon;base64, and before </os:Image>

Whew! Lots of steps, but none of them are difficult. (Though if you're not on Linux and don't have the base64 command, you'll have to find some other way of encoding and decoding base64.)

But if you don't want to go through all the steps, you can download mine, with its lame yellow smiley icon, as a starting point: Google-clean plug-in.

Happy searching! See you when Firefox 36.0.2 comes out and they break some other important feature.

Tags: ,
[ 12:35 Mar 14, 2015    More tech/web | permalink to this entry | comments ]

Mon, 09 Feb 2015

Making flashblock work again; and why HTML5 video doesn't work in Firefox

Back in December, I wrote about Problems with Firefox 35's new deprecation of flash, and a partial solution for Debian. That worked to install a newer version of the flash plug-in on my Debian Linux machine; but it didn't fix the problem that the flashblock program no longer works properly on Firefox 35, so that clicking on the flashblock button does nothing at all.

A friend suggested that I try Firefox's built-in flash blocking. Go to Tools->Add-ons and click on Plug-ins if that isn't the default tab. Under Shockwave flash, choose Ask to Activate.

Unfortunately, the result of that is a link to click, which pops up a dialog that requires clicking a button to dismiss it -- a pointless and annoying extra step. And there's no way to enable flash for just the current page; once you've enabled it for a domain (like youtube), any flash from that domain will auto-play for the remainder of the Firefox session. Not what I wanted.

So I looked into whether there was a way to re-enable flashblock. It turns out I'm not the only one to have noticed the problem with it: the FlashBlock reviews page is full of recent entries from people saying it no longer works. Alas, flashblock seems to be orphaned; there's no comment about any of this on the main flashblock page, and the links on that page for discussions or bug reports go to a nonexistent mailing list.

But fortunately there's a comment partway down the reviews page from user "c627627" giving a fix.

Edit your chrome/userContent.css in your Firefox profile. If you're not sure where your profile lives, Mozilla has a poorly written page on it here, Profiles - Where Firefox stores your bookmarks, passwords and other user data, or do a systemwide search for "prefs.js" or "search.json" or "cookies.sqlite" and it will probably lead you to your profile.

Inside yourprofile/chrome/userContent.css (create it if it doesn't already exist), add these lines:

@namespace url(http://www.w3.org/1999/xhtml);
@-moz-document domain("youtube.com"){
#theater-background { display:none !important;}}

Now restart Firefox, and flashblock should work again, at least on YouTube. Hurray!

Wait, flash? What about HTML5 on YouTube?

Yes, I read that too. All the tech press sites were reporting week before last that YouTube was now streaming HTML5 by default.

Alas, not with Firefox. It works with most other browsers, but Firefox's HTML5 video support is too broken. And I guess it's a measure of Firefox's increasing irrelevance that almost none of the reportage two weeks ago even bothered to try it on Firefox before reporting that it worked everywhere.

It turns out that using HTML5 video on YouTube depends on something called Media Source Extensions (MSE). You can check your MSE support by going to YouTube's HTML5 info page. In Firefox 35, it's off by default.

You can enable MSE in Firefox by flipping the media.mediasource preference, but that's not enough; YouTube also wants "MSE & H2.64". Apparently if you care enough, you can set a new preference to enable MSE & H2.64 support on YouTube even though it's not supported by Firefox and is considered too buggy to enable.

If you search the web, you'll find lots of people talking about how HTML5 with MSE is enabled by default for Firefox 32 on youtube. But here we are at Firefox 35 and it requires jumping through hoops. What gives?

Well, it looks like they enabled it briefly, discovered it was too buggy and turned it back off again. I found bug 1129039: Disable MSE for Firefox 36, which seems an odd title considering that it's off in Firefox 35, but there you go.

Here is the dependency tree for the MSE tracking bug, 778617. Its dependency graph is even scarier. After taking a look at that, I switched my media.mediasource preference back off again. With a dependency tree like that, and nothing anywhere summarizing the current state of affairs ... I think I can live with flash. Especially now that I know how to get flashblock working.

Tags: ,
[ 17:08 Feb 09, 2015    More tech/web | permalink to this entry | comments ]

Thu, 30 Oct 2014

Simulating a web page timeout

Today dinner was a bit delayed because I got caught up dealing with an RSS feed that wasn't feeding. The website was down, and Python's urllib2, which I use in my "feedme" RSS fetcher, has an inordinately long timeout.

That certainly isn't the first time that's happened, but I'd like it to be the last. So I started to write code to set a shorter timeout, and realized: how does one test that? Of course, the offending site was working again by the time I finished eating dinner, went for a little walk then sat down to code.

I did a lot of web searching, hoping maybe someone had already set up a web service somewhere that times out for testing timeout code. No such luck. And discussions of how to set up such a site always seemed to center around installing elaborate heavyweight Java server-side packages. Surely there must be an easier way!

How about PHP? A web search for that wasn't helpful either. But I decided to try the simplest possible approach ... and it worked!

Just put something like this at the beginning of your HTML page (assuming, of course, your server has PHP enabled):

<?php sleep(500); ?>

Of course, you can adjust that 500 to be any delay you like.

Or you can even make the timeout adjustable, with a few more lines of code:

<?php
 if (isset($_GET['timeout']))
     sleep($_GET['timeout']);
 else
     sleep(500);
?>

Then surf to yourpage.php?timeout=6 and watch the page load after six seconds.

Simple once I thought of it, but it's still surprising no one had written it up as a cookbook formula. It certainly is handy. Now I just need to get some Python timeout-handling code working.

Tags: , , ,
[ 19:38 Oct 30, 2014    More tech/web | permalink to this entry | comments ]

Thu, 11 Sep 2014

Making emailed LinkedIn discussion thread links actually work

I don't use web forums, the kind you have to read online, because they don't scale. If you're only interested in one subject, then they work fine: you can keep a browser tab for your one or two web forums perenially open and hit reload every few hours to see what's new. If you're interested in twelve subjects, each of which has several different web forums devoted to it -- how could you possibly keep up with that? So I don't bother with forums unless they offer an email gateway, so they'll notify me by email when new discussions get started, without my needing to check all those web pages several times per day.

LinkedIn discussions mostly work like a web forum. But for a while, they had a reasonably usable email gateway. You could set a preference to be notified of each new conversation. You still had to click on the web link to read the conversation so far, but if you posted something, you'd get the rest of the discussion emailed to you as each message was posted. Not quite as good as a regular mailing list, but it worked pretty well. I used it for several years to keep up with the very active Toastmasters group discussions.

About a year ago, something broke in their software, and they lost the ability to send email for new conversations. I filed a trouble ticket, and got a note saying they were aware of the problem and working on it. I followed up three months later (by filing another ticket -- there's no way to add to an existing one) and got a response saying be patient, they were still working on it. 11 months later, I'm still being patient, but it's pretty clear they have no intention of ever fixing the problem.

Just recently I fiddled with something in my LinkedIn prefs, and started getting "Popular Discussions" emails every day or so. The featured "popular discussion" is always something stupid that I have no interest in, but it's followed by a section headed "Other Popular Discussions" that at least gives me some idea what's been posted in the last few days. Seemed like it might be worth clicking on the links even though it means I'd always be a few days late responding to any conversations.

Except -- none of the links work. They all go to a generic page with a red header saying "Sorry it seems there was a problem with the link you followed."

I'm reading the plaintext version of the mail they send out. I tried viewing the HTML part of the mail in a browser, and sure enough, those links worked. So I tried comparing the text links with the HTML:

Text version:
http://www.linkedin.com/e/v2?e=3x1l-hzwzd1q8-6f&amp;t=gde&amp;midToken=AQEqep2nxSZJIg&amp;ek=b2_anet_digest&amp;li=82&amp;m=group_discussions&amp;ts=textdisc-6&amp;itemID=5914453683503906819&amp;itemType=member&amp;anetID=98449
HTML version:
http://www.linkedin.com/e/v2?e=3x1l-hzwzd1q8-6f&t=gde&midToken=AQEqep2nxSZJIg&ek=b2_anet_digest&li=17&m=group_discussions&ts=grouppost-disc-6&itemID=5914453683503906819&itemType=member&anetID=98449

Well, that's clear as mud, isn't it?

HTML entity substitution

I pasted both links one on top of each other, to make it easier to compare them one at a time. That made it fairly easy to find the first difference:

Text version:
http://www.linkedin.com/e/v2?e=3x1l-hzwzd1q8-6f&amp;t=gde&amp;midToken= ...
HTML version:
http://www.linkedin.com/e/v2?e=3x1l-hzwzd1q8-6f&t=gde&midToken= ...

Time to die laughing: they're doing HTML entity substitution on the plaintext part of their email notifications, changing & to &amp; everywhere in the link.

If you take the link from the text email and replace &amp; with &, the link works, and takes you to the specific discussion.

Pagination

Except you can't actually read the discussion. I went to a discussion that had been open for 2 days and had 35 responses, and LinkedIn only showed four of them. I don't even know which four they are -- are they the first four, the last four, or some Facebook-style "four responses we thought you'd like". There's a button to click on to show the most recent entries, but then I only see a few of the most recent responses, still not the whole thread.

Hooray for the web -- of course, plenty of other people have had this problem too, and a little web searching unveiled a solution. Add a pagination token to the end of the URL that tells LinkedIn to show 1000 messages at once.

&count=1000&paginationToken=
It won't actually show 1000 (or all) responses -- but if you start at the beginning of the page and scroll down reading responses one by one, it will auto-load new batches. Yes, infinite scrolling pages can be annoying, but at least it's a way to read a LinkedIn conversation in order.

Making it automatic

Okay, now I know how to edit one of their URLs to make it work. Do I want to do that by hand any time I want to view a discussion? Noooo!

Time for a script! Since I'll be selecting the URLs from mutt, they'll be in the X PRIMARY clipboard. And unfortunately, mutt adds newlines so I might as well strip those as well as fixing the LinkedIn problems. (Firefox will strip newlines for me when I paste in a multi-line URL, but why rely on that?)

Here's the important part of the script:

import subprocess, gtk

primary = gtk.clipboard_get(gtk.gdk.SELECTION_PRIMARY)
if not primary.wait_is_text_available() :
    sys.exit(0)
link = primary.wait_for_text()
link = link.replace("\n", "").replace("&amp;", "&") + \
       "&count=1000&paginationToken="
subprocess.call(["firefox", "-new-tab", link])

And here's the full script: linkedinify on GitHub. I also added it to pyclip, the script I call from Openbox to open a URL in Firefox when I middle-click on the desktop.

Now I can finally go back to participating in those discussions.

Tags: , , ,
[ 13:10 Sep 11, 2014    More tech/web | permalink to this entry | comments ]

Sat, 21 Jun 2014

Mirror a website using lftp

I'm helping an organization with some website work. But I'm not the only one working on the website, and there's no version control. I wanted an easy way to make sure all my files were up-to-date before I start to work on one ... a way to mirror the website, or at least specific directories, to my local disk.

Normally I use rsync -av over ssh to mirror directories, but this website is on a server that only offers ftp access. I've been using ncftp to copy files up one by one, but although ncftp's manual says it has a mirror mode and I found a few web references to that, I couldn't find anything telling me how to activate it.

Making matters worse, there are some large files that I don't need to mirror. The first time I tried to use get * in ncftp to get one directory, it spent 15 minutes trying to download a huge powerpoint file, then stalled and lost the connection. There are some big .doc and .docx files, too. And ncftp doesn't seem to have a way to exclude specific files.

Enter lftp. It has a mirror mode (with documentation, even!) which includes a -X to exclude files matching specified patterns.

lftp includes a -e to pass commands -- like "mirror" -- to it on the command line. But the documentation doesn't say whether you can use more than one command at a time. So it seemed safer to start up an lftp session and pass a series of commands to it.

And that works nicely. Just set up the list of directories you want to mirror, and you can write a nice shell function you can put in your. .zshrc or .bashrc:

sitemirror() {
commands=""
for dir in thisdir thatdir theotherdir
do
  commands="$commands
mirror --only-newer -vvv -X '*.ppt' -X '*.doc*' -X '*.pdf' htdocs/$dir $HOME/web/webmirror/$dir"
done

echo Commands to be run:
echo $commands
echo

lftp <<EOF
open -u 'user,password' ftp.example.com
$commands
bye
EOF
}

Super easy -- all I do is type sitemirror and wait a little. Now I don't have any excuse for not being up to date.

Tags: , ,
[ 12:39 Jun 21, 2014    More tech/web | permalink to this entry | comments ]

Sat, 20 Jul 2013

Make Firefox warn you of specific types of links before you click

Sometimes when I middleclick on a Firefox link to open it in a new tab, I get an empty new tab. I hate that.

It happens most often on Javascript links. For instance, suppose a website offers a Help link next to the link I'm trying to use. I don't know what type of link it is; if it's a normal link, to an HTML page, then it may open in my current tab, overwriting the form I just spent five minutes filling out. So I want to middleclick it, so it will open in a new tab. On the other hand, if it's a Javascript link that pops up a new help window, middleclicking won't work at all; all it does is open an empty new tab, which I'll have to close.

A similar effect happens on PDF links; in that case, middleclicking gives me the "What do you want to do with this?" dialog but I also get a new tab that I have to close. (Though I'm not sure what happens with Firefox's new built-in PDF reader.)

Anyway, since there seems to be no way of making middleclick just do the sensible thing and open these links in a new tab like I asked, it, I can do something almost as good: a user stylesheet that warns me when I'm about to click on one of these special links. This rule changes the cursor to a crosshair, and turns the link bold with colors of red on yellow. Hard to miss!

I put this into userContent.css, inside the chrome directory inside my profile:

/*
 * Make it really obvious when links are javascript,
 * since middleclicking javascript links doesn't do anything
 * except open an empty new tab that then has to be closed.
 */
a:hover[href^="javascript"] {
  cursor: crosshair; font-weight: bold;
  text-decoration: blink;
  color: red; background-color: yellow
  !important
}

/*
 * And the same for PDFs, for the same reason.
 * Sadly, we can't catch all PDFs, just the ones where the actual
 * filename ends in .pdf.
 * Apparently there's no way to make a selector case insensitive,
 * so we have separate cases for .pdf and .PDFb
 */
a:hover[href$=".pdf"], a:hover[href$=".PDF"] {
  cursor: crosshair;
  color: red; background-color: yellow
  !important
}

In selectors, ^="javascript" means "starts with javascript", for links like javascript:do_something(). $=".pdf" means "ends with .pdf". If you want to match a string anywhere inside the href, *= means "contains".

What about that crosshair cursor? Here are some of the cursors you can use: Mozilla's cursor documentation page. Don't trust the images on that page -- hover over each cursor to see what your actual browser shows.

You can also warn about links that would open a new window or tab. If you prefer to keep control of that, rather than letting each web page designer decide for you where each link should open, you can control it with the browser.link.open newwindow preference. But whatever you do with that preference you can add a rule for a:hover[target="_blank"] to help you notice links that are likely to open in a new tab.

You can even make these special links blink, with text-decoration: blink. Assuming you're not a curmudgeon like I am who disables blinking entirely by setting the "browser.blink_allowed" preference to false.

Tags: , , , ,
[ 20:26 Jul 20, 2013    More tech/web | permalink to this entry | comments ]

Syndicated on:
LinuxChix Live
Ubuntu Women
Women in Free Software
Graphics Planet
DevChix
Ubuntu California
Planet Openbox
Devchix
Planet LCA2009

Friends' Blogs:
Morris "Mojo" Jones
Jane Houston Jones
Dan Heller
Long Live the Village Green
Ups & Downs
DailyBBG

Other Blogs of Interest:
DevChix
Scott Adams
Dave Barry
BoingBoing

Powered by PyBlosxom.