Shallow Thoughts : tags : charsets

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sat, 24 Jun 2017

Mutt: Fixing Erroneous Charsets, part 632

Someone forwarded me a message from the Albuquerque Journal. It was all about "New Mexico\222s schools".

Sigh. I thought I'd gotten all my Mutt charset problems fixed long ago. My system locale is set to en_US.UTF-8, and accented characters in Spanish and in people's names usually show up correctly. But I do see this every now and then.

When I see it, I usually assume it's a case of incorrect encoding: whoever sent it perhaps pasted characters from a Windows Word document or something, and their mailer didn't properly re-encode them into the charset they were using to send the message.

In this case, the message had User-Agent: SquirrelMail/1.4.13. I suspect it came from a "Share this" link on the newspaper's website.

I used vim to look at the source of the message, and it had

Content-Type: text/plain; charset=iso-8859-1
For the bad characters, in vim I saw things like
New Mexico<92>s schools

I checked an old web page I'd bookmarked years ago that had a table of the iso-8859-1 characters, and sure enough, hex 0x92 was an apostrophe. What was wrong?

I got some help on the #mutt IRC channel, and, to make a long story short, that web table I was using was wrong. ISO-8859-1 doesn't include any characters in the range 8x-9x, as you can see on the Wikipedia ISO/IEC 8859-1.

What was happening was that the page was really cp1252: that's where those extra characters, like hex 92/octal 222 for an apostrophe, or hex 96/octal 226 for a dash (nitpick: that's an en dash, but it was used in a context that called for an em dash; if someone is going to use something other than the plain old ASCII dash - you'd think they'd at least use the right one. Sheesh!)

Anyway, the fix for this is to tell mutt when it sees iso-8859-1, use cp1252 instead:

charset-hook iso-8859-1 cp1252

Voilà! Now I could read the article about New Mexico's schools.

A happy find related to this: it turns out there's a better way of looking up ISO-8859 tables, and I can ditch that bookmark to the old, erroneous page. I've known about man ascii forever, but someone I'd never thought to try other charsets. Turns out man iso_8859-1 and man iso_8859-15 have built-in tables too. Nice!

(Sadly, man utf-8 doesn't give a table. Of course, that would be a long man page, if it did!)

Tags: , ,
[ 11:06 Jun 24, 2017    More linux | permalink to this entry | ]

Wed, 25 Nov 2009

Character Sets and Encodings in Linux, part 2

Continuing the discussion of those funny characters you sometimes see in email or on web pages, today's Linux Planet article discusses how to convert and handle encoding errors, using Python or the command-line tool recode:

Mastering Characters Sets in Linux (Weird Characters, part 2).

Tags: , , , , , , ,
[ 15:06 Nov 25, 2009    More writing | permalink to this entry | ]

Thu, 12 Nov 2009

Article: Character Sets and Encodings in Linux

or: Why do I See All Those Those Weird Characters?

Today's Linux Planet article concerns those funny characters you sometimes see in email or on web pages, like when somebody puts “random squiggles’ around a phrase when they probably meant “double quotes”:

Character Sets in Linux or: Why do I See Those Weird Characters?.

Today's article covers only what users need to know. A followup article will discuss character encoding from a programmer's point of view.

Tags: , , , , ,
[ 16:34 Nov 12, 2009    More writing | permalink to this entry | ]

Wed, 21 Oct 2009

Un-unicode: translating web pages to plain ASCII

It's not that I'm a dumb provincial American, really!

I mean, okay, I am a dumb provincial American. But not completely. I know about Unicode, I know what UTF-8 and ISO-8859-1 and -15 are, I even know how to type Spanish characters like ñ and á in email (at least in Ubuntu; I can't seem to make it work in Gentoo).

The real problem is PalmOS -- I've never found any way to create Plucker files for my Palm that display anything beyond the standard ASCII character set. (I'm not clear whether to blame that on Palm or Plucker. Doesn't matter.)

So when I use a program like Sitescooper or my new FeedMe RSS reader to read daily news on my Palm, I'm forever seeing lines like this:

the weather phenomenon known as ÅoEl Ni€oÅq is

It's tiresome to try to read stuff like that.

Strangely, I've found no libraries to do this, in any language. There are lots of ways to translate from one character encoding into another -- but no way to degrade from nonASCII characters to the nearest ASCII equivalent. Googling finds lots of people asking for them -- I'm far from the only one who wants this. There are various partial hacks, but nothing ready-to-go.

Oh, well, welcome to the programming world. Time to roll my own. I started from some nice tricks I picked up in the web discussions I found, and ended up with something reasonably compact. Of course, the table of fallback characters will grow.

But my ace in the hole, this time, is that my little function has a way of logging errors. When it sees a character it doesn't recognize, it can log the character code to a file, making it easy to add a translation for that character. That was always the problem with similar hacks I'd attempted to add to mutt or plucker or sitescooper in the past: figuring out each new character and what its intended meaning was, so I could add it to the translation table.

Here it is: ununicode.

Call it like this:

import ununicode

ununicode.toascii(str, errfilename=os.path.join("/path/to/errfile"))

There's also a minimal test script provided (which will also grow with time as I accumulate good samples).

Tags: , , , ,
[ 20:48 Oct 21, 2009    More programming | permalink to this entry | ]