Shallow Thoughts : tags : i18n
Akkana's Musings on Open Source Computing and Technology, Science, and Nature.
Wed, 25 Nov 2009
Continuing the discussion of those funny characters you sometimes
see in email or on web pages, today's Linux Planet article
discusses how to convert and handle encoding errors, using
Python or the command-line tool recode:
Mastering
Characters Sets in Linux (Weird Characters, part 2).
Tags: writing, linux, unicode, i18n, charsets, ascii, programming, python
[
15:06 Nov 25, 2009
More writing |
permalink to this entry |
]
Thu, 12 Nov 2009
or: Why do I See All Those Those Weird Characters?
Today's Linux Planet article concerns those funny characters you sometimes
see in email or on web pages, like when somebody puts
“random squiggles’ around a phrase
when they probably meant “double quotes”:
Character
Sets in Linux or: Why do I See Those Weird Characters?.
Today's article covers only what users need to know.
A followup article will discuss character encoding
from a programmer's point of view.
Tags: writing, linux, unicode, i18n, charsets, ascii
[
16:34 Nov 12, 2009
More writing |
permalink to this entry |
]
Wed, 21 Oct 2009
It's not that I'm a dumb provincial American, really!
I mean, okay, I am a dumb provincial American. But not completely.
I know about Unicode, I know what UTF-8 and ISO-8859-1 and -15 are,
I even know how to type Spanish characters like ñ and á
in email (at least in Ubuntu; I can't seem to make it work in Gentoo).
The real problem is PalmOS --
I've never found any way to create Plucker files for
my Palm that display anything beyond the standard ASCII character set.
(I'm not clear whether to blame that on Palm or Plucker. Doesn't matter.)
So when I use a program like Sitescooper or my new
FeedMe RSS reader
to read
daily news on my Palm, I'm forever seeing lines like this:
the weather phenomenon known as ÅoEl Ni€oÅq is
It's tiresome to try to read stuff like that.
Strangely, I've found no libraries to do this, in any language.
There are lots of ways to translate from one character encoding into
another -- but no way to degrade from nonASCII characters to the
nearest ASCII equivalent. Googling finds lots of people asking
for them -- I'm far from the only one who wants this.
There are various partial hacks, but nothing ready-to-go.
Oh, well, welcome to the programming world. Time to roll my own.
I started from some nice tricks I picked up in the web discussions
I found, and ended up with something reasonably compact.
Of course, the table of fallback characters will grow.
But my ace in the hole, this time, is that my little function has a
way of logging errors. When it sees a character it doesn't recognize,
it can log the character code to a file, making it easy to add a
translation for that character. That was always the problem with
similar hacks I'd attempted to add to mutt or plucker or sitescooper
in the past: figuring out each new character and what its intended
meaning was, so I could add it to the translation table.
Here it is: ununicode.
Call it like this:
import ununicode
ununicode.toascii(str, errfilename=os.path.join("/path/to/errfile"))
There's also a minimal test script provided (which will also grow with
time as I accumulate good samples).
Tags: unicode, i18n, charsets, ascii, palm
[
20:48 Oct 21, 2009
More programming |
permalink to this entry |
]
Fri, 02 May 2008
This has been a good week for fonts: two longstanding mysteries solved.
The first concerns the bitstream vera sans mono I've been using
as a terminal font in apps like rxvt and xterm. I'd been specifying it in
~/.Xdefaults like this:
XTerm*font: -bitstream-bitstream vera sans mono-bold-r-normal-*-12-*-*-*-*-*-iso10646-1
The mystery is that I'd noticed that in xterm, the font looked
slightly different -- slightly uglier -- than in rxvt (both apps
use the same X class name of XTerm). It was hard to put my finger on
what was different -- the shape of all the letters looked the same,
but it just seemed a little more ragged, and a little less compact,
in xterm. I figured it was just a minor difference in their drawing
code, or something.
Well, I was fiddling with fonts (trying to get the new-to-me
"Inconsolata" font working) and I noticed that iso10646 bit.
I didn't know what 10646 was, but shouldn't it be 8859-1 or 8859-15,
the codes for the Latin-1 alphabet? After finishing up my Inconsolata
experiments, when I set the font back to Vera I changed the line to
XTerm*font: -bitstream-bitstream vera sans mono-bold-r-normal-*-12-*-*-*-*-*-iso8859-15
and moved on to other things.
Until the next morning, when I booted up to a surprise: my main
terminal window no longer fit on the screen. It seems it had reverted
to the other (uglier) version of Vera Sans Mono, which is also very
slightly taller, so instead of being a couple of lines shorter than
the screen height, it was a couple of lines too tall to fit.
I checked .Xdefaults -- yes, it was still Vera. What was going on?
I finally remembered the one thing I had changed:
the language setting on the font, from 10646-1 to 8858-15. I changed
it back: sure enough, now the font was pretty again and the terminal
was short enough to fit.
I fired up xfontsel and did some experimenting. It turned out the
difference between the two almost-identical Vera sans mono bold roman
fonts is a field xfontsel calls "spc". It can be either 'c' or 'm'.
The 'c' version is the pretty, compact font; the 'm' is the uglier,
taller one. For some reason, specifying 10646-1 makes "spc" default
to 'c', while 8859-15 makes it default to 'm'. But specifying 'c'
in the font specifier gets the good version regardless of which
language is specified.
So this would work:
XTerm*font: -bitstream-bitstream vera sans mono-bold-r-normal-*-12-*-*-*-c-*-*-*
But then I read up on 10646-1 and it turns out to mean "the
whole unicode character set". That sounds like a good idea,
so I kept it in my font specifier after all:
XTerm*font: -bitstream-bitstream vera sans mono-bold-r-normal-*-12-*-*-*-c-*-iso10646-1
(For the moment I still didn't know what spc, c or n meant;
read on if you're curious.)
The second insight concerned a longstanding mystery of Dave's.
He has been complaining for quite a while about the way
Ubuntu's modern pango-based apps all refuse to see bitmapped fonts.
(It bothered me too, but less so, because the terminal and editor
apps I use can see X fonts.)
Dave has an Ubuntu install on one machine that he's been upgrading
release after release, which does see his bitmapped fonts.
But any fresh Ubuntu installation fails to see the fonts.
What was the difference?
We knew about the trick of going into /etc/fonts/conf.d,
removing the symbolic link 70-yes-bitmaps.conf and replacing it
with a link to /etc/fonts/conf.avail/70-yes-bitmaps.conf ...
But doing that doesn't actually change anything, and bitmap
fonts still don't show up.
The secret turned out to be that you need to run
fc-cache -fv
after changing the font/conf.d links. This apparently never
happens on its own -- not on a reboot, not on installing or
uninstalling font packages. Somehow it had happened once on Dave's
good install, and that's why it worked there but nowhere else.
I'm not sure how anyone is supposed to find out about fc-cache --
there's no man fontconfig,
and the /etc/fonts/conf.avail/README offers no clue,
just misleadingly says "Fontconfig scans this directory".
man fc-cache
mentions /usr/share/doc/fontconfig/fontconfig-user.html,
which doesn't exist; it turns out on Ubuntu it's actually
/usr/share/doc/fontconfig-config/fontconfig-user.html.
But wait, that's just an html-ized manual page for fonts-conf,
so actually you could just run man fonts-conf
...
your guess is as good as mine why the fc-cache man page sends
you on a hunt for html files instead.
man fonts-conf
is good reading -- it even solves the
mystery of that spc parameter. It stands for spacing
and can be proportional, dual-width, monospace or charcell.
Aha! And there's lots more useful-looking information in that
manual page as well.
Tags: linux, fonts, i18n, mysteries
[
15:58 May 02, 2008
More linux |
permalink to this entry |
]
Sat, 01 Dec 2007
With what I learned
last week,
I've been able to type accented characters into GTK apps such as xchat,
and a few other apps such as emacs.
That's nice -- but I was still having trouble reading accented
characters in mutt, or writing them in vim to send through mutt
(darn terminal apps).
The biggest problem was the terminal. I was using urxvt,
but it turns out that urxvt won't let me type any nonascii characters.
It just ignores my multi-key sequences, or prints a space instead
of the character I wanted.
I have no idea why, but switching to plain ol' xterm solved that problem.
Of course, I had to make sure that I was using a font that supported
the characters I wanted (ISO 8859-1 or 8859-15 or something similar),
which leaves out my favorite terminal font (Schumacher Clean bold),
but Bitstream Vera Sans Mono bold is almost as readable.
Of course, it's important to have your locale variables set
appropriately. There are several locale variables:
- LC_CTYPE
- Which encodings to use for typing and displaying characters.
- LC_MESSAGES
- Which translations to use, in programs that offer them.
- LC_COLLATE
- How to sort alphabetically (this one also affects whether ls
groups capitalized filenames first).
- LC_ALL
- Overrides any of the others.
- LANG
- The default, in case none of the other variables is set.
There are a few others which control very specific features like
time, numbers, money, addresses and paper size:
type
locale
to see all of them.
Once I switched to xterm, I was able to set either LANG or LC_CTYPE to
either en_US.UTF-8
or en_US.ISO-8859-1
.
I set LC_COLLATE and LANG or LC_MESSAGES to C, so that I get the
default (usually US) translations for programs and so that ls groups
all the capitalized files first.
Along the way, I learned about yet another
way to type accented characters.
setxkbmap -model pc104 -layout us -variant intl
switches to an international layout, at which point typing certain
punctuation (like ' or ~) is assumed to be a prefix key. So instead
of typing [Multi] ~ n, I can just type ~ n. The catch: it makes it
harder to type quotes or tildes by themselves (you have to type a
space after the quote or tilde).
Even faster, the international layout also offers shortcuts to many
common characters with the "AltGr" key, which I'd heard about
for years but never knew how to enable. AltGr is the right alt
key, and typing, say, AltGr followed by n gives an ñ.
You can see a full map at
Wikipedia
(AltGr characters are blue, quote prefixes are red).
To get back to a US non-international layout:
setxkbmap -model pc104 -layout us
Of course, these aren't the only keyboard layouts to choose from --
there are lots, plus you can define your own. And I was going to
write a little bit about that, except it turns out they've changed
it all around again since I last did that two years ago (don't you
love the digital world?). So that will have to wait for another time.
But the place to start exploring is /usr/share/X11/xkb.
The file symbols/us contains the definitions for those US
keyboards, and I believe it's included via the files in the
rules directory, probably rules/base, base.xml and base.lst.
From there you're on your own. But the standard layouts probably
follow the ones in the Wikipedia article on
keyboard layouts
Tags: linux, i18n, keyboard
[
16:48 Dec 01, 2007
More linux |
permalink to this entry |
]
Thu, 22 Nov 2007
Happy Thanksgiving, everyone! Today's holiday tip involves
how to type international characters.
For the online Spanish class I've been taking, so far I've been
able to manage without having to type characters like
ñ or á. Usually, if I need one I can find it in one of
the class examples, copy it, and paste it wherever I need it. But
obviously that would be tedious if I needed to type much.
I hacked up a quickie workaround:
a python
script that shows a set of buttons, one for each accented
character I'm likely to need. Clicking a button copies that character
to the clipboard, so I can now paste via mouse middleclick or ctrl-V.
(I'm sure that sounds pathetic to those of you who type accented
characters every day, but it's not something most US English speakers
need to do. And besides, now I know how to access the X clipboard
from Python-GTK -- hooray for learning new things from procrastination
projects!)
Anyway, Mikael Magnusson took pity on me and explained in simple
language how to use the X "Multi key" to type these characters the
right way (well, a right way, anyway). Since all the online
instructions I've seen have been rather complicated, here are the
simple instructions for any of my fellow US monolingists who'd
like to expand their horizons:
First, choose a key for the "Multi key" that you're not using for
anything else. A lot of people use one of the Alt or Windows keys,
but I use both of those already. What I don't use is the Menu key
(that little key down by the right Ctrl key, at least on my keyboard)
since not many Linux apps support it anyway.
Find the keycode for that key, by firing up xev
and
typing the key. For my Menu key, the keycode is 117.
Now type:
xmodmap -e "keycode 117 = Multi_key"
Now you're ready to type a sequence like:
[Menu] ~ n
to type an n-tilde,
[Menu] ' a
for an accented a, or
[menu] ? ?
for the upside-down question mark,
in any app that supports those characters.
Of course, you don't want to type that xmodmap command every time you
log in, so to make it permanent, put this in your .Xmodmap (you're on
your own for figuring out whether your X environment reads .Xmodmap
automatically or whether you need to tell it to run
xmodmap .Xmodmap
when X starts up):
keycode 117 = Multi_key
I have one final useful international input tidbit to offer:
how to type Unicode characters by number.
Hold ctrl+shift+U, then release U but keep holding the
other two while you type a numeric sequence. (This may only work in
gtk apps.) For instance, try this: hold down ctrl and shift, then
type: u 2 6 6 c. Cool, huh?
You can use the "gucharmap" program to find other
neat sequences (hint: View->By Unicode Block otherwise
you'll never find anything).
Now it's time to check the turkey. Have a good day, everyone!
Tags: linux, i18n, keyboard
[
17:03 Nov 22, 2007
More linux |
permalink to this entry |
]