Someone forwarded me a message from the Albuquerque Journal.
It was all about "New Mexico\222s schools".
Sigh. I thought I'd gotten all my Mutt charset problems fixed long
ago. My system locale is set to en_US.UTF-8, and accented characters
in Spanish and in people's names usually show up correctly.
But I do see this every now and then.
When I see it, I usually assume it's a case of incorrect encoding:
whoever sent it perhaps pasted characters from a Windows Word document
or something, and their mailer didn't properly re-encode them into
the charset they were using to send the message.
In this case, the message had
User-Agent: SquirrelMail/1.4.13
.
I suspect it came from a "Share this" link on the newspaper's website.
I used vim to look at the source of the message, and it had
Content-Type: text/plain; charset=iso-8859-1
For the bad characters, in vim I saw things like
New Mexico<92>s schools
I checked an old web page I'd bookmarked years ago that had a table
of the iso-8859-1 characters, and sure enough, hex 0x92 was an apostrophe.
What was wrong?
I got some help on the #mutt IRC channel, and, to make a long story
short, that web table I was using was wrong.
ISO-8859-1 doesn't include any characters in the range 8x-9x,
as you can see on
the Wikipedia
ISO/IEC 8859-1.
What was happening was that the page was really cp1252: that's where
those extra characters, like hex 92/octal 222 for an apostrophe,
or hex 96/octal 226 for a dash (nitpick: that's an en dash, but it
was used in a context that called for an em dash; if someone is going
to use something other than the plain old ASCII dash - you'd think
they'd at least use the right one. Sheesh!)
Anyway, the fix for this is to tell mutt when it sees iso-8859-1,
use cp1252 instead:
charset-hook iso-8859-1 cp1252
Voilà! Now I could read the article about
New Mexico's schools.
A happy find related to this: it turns out there's a better way of
looking up ISO-8859 tables, and I can ditch that bookmark to the old,
erroneous page. I've known about man ascii
forever, but
someone I'd never thought to try other charsets. Turns out
man iso_8859-1
and man iso_8859-15
have built-in tables too. Nice!
(Sadly, man utf-8
doesn't give a table. Of course,
that would be a long man page, if it did!)
Tags: mutt, charsets, linux
[
11:06 Jun 24, 2017
More linux |
permalink to this entry |
]
Continuing the discussion of those funny characters you sometimes
see in email or on web pages, today's Linux Planet article
discusses how to convert and handle encoding errors, using
Python or the command-line tool recode:
Mastering
Characters Sets in Linux (Weird Characters, part 2).
Tags: writing, linux, unicode, i18n, charsets, ascii, programming, python
[
15:06 Nov 25, 2009
More writing |
permalink to this entry |
]
or: Why do I See All Those Those Weird Characters?
Today's Linux Planet article concerns those funny characters you sometimes
see in email or on web pages, like when somebody puts
“random squiggles’ around a phrase
when they probably meant “double quotes”:
Character
Sets in Linux or: Why do I See Those Weird Characters?.
Today's article covers only what users need to know.
A followup article will discuss character encoding
from a programmer's point of view.
Tags: writing, linux, unicode, i18n, charsets, ascii
[
16:34 Nov 12, 2009
More writing |
permalink to this entry |
]
It's not that I'm a dumb provincial American, really!
I mean, okay, I am a dumb provincial American. But not completely.
I know about Unicode, I know what UTF-8 and ISO-8859-1 and -15 are,
I even know how to type Spanish characters like ñ and á
in email (at least in Ubuntu; I can't seem to make it work in Gentoo).
The real problem is PalmOS --
I've never found any way to create Plucker files for
my Palm that display anything beyond the standard ASCII character set.
(I'm not clear whether to blame that on Palm or Plucker. Doesn't matter.)
So when I use a program like Sitescooper or my new
FeedMe RSS reader
to read
daily news on my Palm, I'm forever seeing lines like this:
the weather phenomenon known as ÅoEl Ni€oÅq is
It's tiresome to try to read stuff like that.
Strangely, I've found no libraries to do this, in any language.
There are lots of ways to translate from one character encoding into
another -- but no way to degrade from nonASCII characters to the
nearest ASCII equivalent. Googling finds lots of people asking
for them -- I'm far from the only one who wants this.
There are various partial hacks, but nothing ready-to-go.
Oh, well, welcome to the programming world. Time to roll my own.
I started from some nice tricks I picked up in the web discussions
I found, and ended up with something reasonably compact.
Of course, the table of fallback characters will grow.
But my ace in the hole, this time, is that my little function has a
way of logging errors. When it sees a character it doesn't recognize,
it can log the character code to a file, making it easy to add a
translation for that character. That was always the problem with
similar hacks I'd attempted to add to mutt or plucker or sitescooper
in the past: figuring out each new character and what its intended
meaning was, so I could add it to the translation table.
Here it is: ununicode.
Call it like this:
import ununicode
ununicode.toascii(str, errfilename=os.path.join("/path/to/errfile"))
There's also a minimal test script provided (which will also grow with
time as I accumulate good samples).
Tags: unicode, i18n, charsets, ascii, palm
[
20:48 Oct 21, 2009
More programming |
permalink to this entry |
]