Mutt: Fixing Erroneous Charsets, part 632 (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sat, 24 Jun 2017

Mutt: Fixing Erroneous Charsets, part 632

Someone forwarded me a message from the Albuquerque Journal. It was all about "New Mexico\222s schools".

Sigh. I thought I'd gotten all my Mutt charset problems fixed long ago. My system locale is set to en_US.UTF-8, and accented characters in Spanish and in people's names usually show up correctly. But I do see this every now and then.

When I see it, I usually assume it's a case of incorrect encoding: whoever sent it perhaps pasted characters from a Windows Word document or something, and their mailer didn't properly re-encode them into the charset they were using to send the message.

In this case, the message had User-Agent: SquirrelMail/1.4.13. I suspect it came from a "Share this" link on the newspaper's website.

I used vim to look at the source of the message, and it had

Content-Type: text/plain; charset=iso-8859-1
For the bad characters, in vim I saw things like
New Mexico<92>s schools

I checked an old web page I'd bookmarked years ago that had a table of the iso-8859-1 characters, and sure enough, hex 0x92 was an apostrophe. What was wrong?

I got some help on the #mutt IRC channel, and, to make a long story short, that web table I was using was wrong. ISO-8859-1 doesn't include any characters in the range 8x-9x, as you can see on the Wikipedia ISO/IEC 8859-1.

What was happening was that the page was really cp1252: that's where those extra characters, like hex 92/octal 222 for an apostrophe, or hex 96/octal 226 for a dash (nitpick: that's an en dash, but it was used in a context that called for an em dash; if someone is going to use something other than the plain old ASCII dash - you'd think they'd at least use the right one. Sheesh!)

Anyway, the fix for this is to tell mutt when it sees iso-8859-1, use cp1252 instead:

charset-hook iso-8859-1 cp1252

Voilà! Now I could read the article about New Mexico's schools.

A happy find related to this: it turns out there's a better way of looking up ISO-8859 tables, and I can ditch that bookmark to the old, erroneous page. I've known about man ascii forever, but someone I'd never thought to try other charsets. Turns out man iso_8859-1 and man iso_8859-15 have built-in tables too. Nice!

(Sadly, man utf-8 doesn't give a table. Of course, that would be a long man page, if it did!)

Tags: , ,
[ 11:06 Jun 24, 2017    More linux | permalink to this entry | comments ]
(Commenting requires Javascript from ShallowSky.com and Disqus.com, and a cookie from Disqus.com.)
blog comments powered by Disqus