Someone forwarded me a message from the Albuquerque Journal. It was all about "New Mexico\222s schools".
Sigh. I thought I'd gotten all my Mutt charset problems fixed long ago. My system locale is set to en_US.UTF-8, and accented characters in Spanish and in people's names usually show up correctly. But I do see this every now and then.
When I see it, I usually assume it's a case of incorrect encoding: whoever sent it perhaps pasted characters from a Windows Word document or something, and their mailer didn't properly re-encode them into the charset they were using to send the message.
In this case, the message had
I suspect it came from a "Share this" link on the newspaper's website.
I used vim to look at the source of the message, and it had
Content-Type: text/plain; charset=iso-8859-1For the bad characters, in vim I saw things like
New Mexico<92>s schools
I checked an old web page I'd bookmarked years ago that had a table of the iso-8859-1 characters, and sure enough, hex 0x92 was an apostrophe. What was wrong?
I got some help on the #mutt IRC channel, and, to make a long story short, that web table I was using was wrong. ISO-8859-1 doesn't include any characters in the range 8x-9x, as you can see on the Wikipedia ISO/IEC 8859-1.
What was happening was that the page was really cp1252: that's where those extra characters, like hex 92/octal 222 for an apostrophe, or hex 96/octal 226 for a dash (nitpick: that's an en dash, but it was used in a context that called for an em dash; if someone is going to use something other than the plain old ASCII dash - you'd think they'd at least use the right one. Sheesh!)
Anyway, the fix for this is to tell mutt when it sees iso-8859-1, use cp1252 instead:
charset-hook iso-8859-1 cp1252
Voilà! Now I could read the article about New Mexico's schools.
A happy find related to this: it turns out there's a better way of
looking up ISO-8859 tables, and I can ditch that bookmark to the old,
erroneous page. I've known about
man ascii forever, but
someone I'd never thought to try other charsets. Turns out
man iso_8859-1 and
have built-in tables too. Nice!
man utf-8 doesn't give a table. Of course,
that would be a long man page, if it did!)
[ 11:06 Jun 24, 2017 More linux | permalink to this entry | comments ]