Continuing the discussion of those funny characters you sometimes
see in email or on web pages, today's Linux Planet article
discusses how to convert and handle encoding errors, using
Python or the command-line tool recode:
Mastering
Characters Sets in Linux (Weird Characters, part 2).
Tags: writing, linux, unicode, i18n, charsets, ascii, programming, python
[
15:06 Nov 25, 2009
More writing |
permalink to this entry |
]
or: Why do I See All Those Those Weird Characters?
Today's Linux Planet article concerns those funny characters you sometimes
see in email or on web pages, like when somebody puts
“random squiggles’ around a phrase
when they probably meant “double quotes”:
Character
Sets in Linux or: Why do I See Those Weird Characters?.
Today's article covers only what users need to know.
A followup article will discuss character encoding
from a programmer's point of view.
Tags: writing, linux, unicode, i18n, charsets, ascii
[
16:34 Nov 12, 2009
More writing |
permalink to this entry |
]
It's not that I'm a dumb provincial American, really!
I mean, okay, I am a dumb provincial American. But not completely.
I know about Unicode, I know what UTF-8 and ISO-8859-1 and -15 are,
I even know how to type Spanish characters like ñ and á
in email (at least in Ubuntu; I can't seem to make it work in Gentoo).
The real problem is PalmOS --
I've never found any way to create Plucker files for
my Palm that display anything beyond the standard ASCII character set.
(I'm not clear whether to blame that on Palm or Plucker. Doesn't matter.)
So when I use a program like Sitescooper or my new
FeedMe RSS reader
to read
daily news on my Palm, I'm forever seeing lines like this:
the weather phenomenon known as ÅoEl Ni€oÅq is
It's tiresome to try to read stuff like that.
Strangely, I've found no libraries to do this, in any language.
There are lots of ways to translate from one character encoding into
another -- but no way to degrade from nonASCII characters to the
nearest ASCII equivalent. Googling finds lots of people asking
for them -- I'm far from the only one who wants this.
There are various partial hacks, but nothing ready-to-go.
Oh, well, welcome to the programming world. Time to roll my own.
I started from some nice tricks I picked up in the web discussions
I found, and ended up with something reasonably compact.
Of course, the table of fallback characters will grow.
But my ace in the hole, this time, is that my little function has a
way of logging errors. When it sees a character it doesn't recognize,
it can log the character code to a file, making it easy to add a
translation for that character. That was always the problem with
similar hacks I'd attempted to add to mutt or plucker or sitescooper
in the past: figuring out each new character and what its intended
meaning was, so I could add it to the translation table.
Here it is: ununicode.
Call it like this:
import ununicode
ununicode.toascii(str, errfilename=os.path.join("/path/to/errfile"))
There's also a minimal test script provided (which will also grow with
time as I accumulate good samples).
Tags: unicode, i18n, charsets, ascii, palm
[
20:48 Oct 21, 2009
More programming |
permalink to this entry |
]