Shallow Thoughts : tags : unicode

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Wed, 25 Nov 2009

Character Sets and Encodings in Linux, part 2

Continuing the discussion of those funny characters you sometimes see in email or on web pages, today's Linux Planet article discusses how to convert and handle encoding errors, using Python or the command-line tool recode:

Mastering Characters Sets in Linux (Weird Characters, part 2).

Tags: , , , , , , ,
[ 15:06 Nov 25, 2009    More writing | permalink to this entry | ]

Thu, 12 Nov 2009

Article: Character Sets and Encodings in Linux

or: Why do I See All Those Those Weird Characters?

Today's Linux Planet article concerns those funny characters you sometimes see in email or on web pages, like when somebody puts “random squiggles’ around a phrase when they probably meant “double quotes”:

Character Sets in Linux or: Why do I See Those Weird Characters?.

Today's article covers only what users need to know. A followup article will discuss character encoding from a programmer's point of view.

Tags: , , , , ,
[ 16:34 Nov 12, 2009    More writing | permalink to this entry | ]

Wed, 21 Oct 2009

Un-unicode: translating web pages to plain ASCII

It's not that I'm a dumb provincial American, really!

I mean, okay, I am a dumb provincial American. But not completely. I know about Unicode, I know what UTF-8 and ISO-8859-1 and -15 are, I even know how to type Spanish characters like ñ and á in email (at least in Ubuntu; I can't seem to make it work in Gentoo).

The real problem is PalmOS -- I've never found any way to create Plucker files for my Palm that display anything beyond the standard ASCII character set. (I'm not clear whether to blame that on Palm or Plucker. Doesn't matter.)

So when I use a program like Sitescooper or my new FeedMe RSS reader to read daily news on my Palm, I'm forever seeing lines like this:

the weather phenomenon known as ÅoEl Ni€oÅq is

It's tiresome to try to read stuff like that.

Strangely, I've found no libraries to do this, in any language. There are lots of ways to translate from one character encoding into another -- but no way to degrade from nonASCII characters to the nearest ASCII equivalent. Googling finds lots of people asking for them -- I'm far from the only one who wants this. There are various partial hacks, but nothing ready-to-go.

Oh, well, welcome to the programming world. Time to roll my own. I started from some nice tricks I picked up in the web discussions I found, and ended up with something reasonably compact. Of course, the table of fallback characters will grow.

But my ace in the hole, this time, is that my little function has a way of logging errors. When it sees a character it doesn't recognize, it can log the character code to a file, making it easy to add a translation for that character. That was always the problem with similar hacks I'd attempted to add to mutt or plucker or sitescooper in the past: figuring out each new character and what its intended meaning was, so I could add it to the translation table.

Here it is: ununicode.

Call it like this:

import ununicode

ununicode.toascii(str, errfilename=os.path.join("/path/to/errfile"))

There's also a minimal test script provided (which will also grow with time as I accumulate good samples).

Tags: , , , ,
[ 20:48 Oct 21, 2009    More programming | permalink to this entry | ]