Shallow Thoughts : : programming

Akkana's Musings on Open Source Computing, Science, and Nature.

Wed, 11 Dec 2013

Counting syllables in Python

When I wrote recently about my Dactylic dinosaur doggerel, I glossed over a minor problem with my final poem: the rules of double-dactylic doggerel say that the sixth line (or sometimes the seventh) should be a single double-dactyl word -- something like "paleontologist" or "hexasyllabic'ly". I used "dinosaur orchestra" -- two words, which is cheating.

I don't feel too guilty about that. If you read the post, you may recall that the verse was the result of drifting grumpily through an insomniac morning where I would have preferred to be getting back to sleep. Coming up with anything that scans at all is probably good enough.

Still, it bugged me, not being able to think of a double-dactylic word that related somehow to Parasaurolophus. So I vowed that, later that day when I was up and at the computer, I would attempt to find one and rewrite the poem accordingly.

I thought that would be fairly straightforward. Not so much. I thought there would be some utility I could run that would count syllables for me, then I could run /usr/share/dict/words through it, print out all the 6-syllable words, and find one that fit. Turns out there is no such utility.

But Python has a library for everything, doesn't it?

Some searching turned up PyHyphen, which includes some syllable-counting functions. It apparently uses the hyphenation dictionaries that come with LibreOffice.

There's a Debian package for it, python-pyhyphen -- but it doesn't work. First, it depends on another package, hyphen-en-us, but doesn't have that dependency encoded in the package, even as a suggested or recommended package. But even when you install the hyphenated dictionary, it still doesn't work because it doesn't point to the dictionary in the place it was installed. Looks like that problem was reported almost two years ago, bug 627944: python-pyhyphen: doesn't work out-of-the-box with hyphen-* packages. There's a fix there that involves editing two files, /usr/lib/python2.7/dist-packages/hyphen/ and /usr/lib/python2.7/dist-packages/hyphen/

Or you can just give up on Debian and pip install pyhyphen, which is a lot easier.

But once you get it working, you find that it's terrible. It was wrong about almost every word I tried. I hope not too many people are relying on this hyphen-en-us dictionary for important documents. Its results seemed nearly random, and I quickly gave up on it for getting a useful list of words around six syllables.

Just for fun, since my count syllables web search turned up quite a few websites claiming that functionality, I tried entering some of my long test words manually. All of the websites I tried were wrong more than half the time, and often they were off by more than two syllables. I don't mind off-by-ones -- I can look at words claiming 5 and 7 syllables while searching for double dactyls -- but if I have to include 4-syllable words as well, I'll never find what I'm looking for.

That discouraged me from using another Python suggestion I'd seen, the nltk (natural language toolkit) package. I've been looking for an excuse to play with nltk, and some day I will, but for this project I was looking for a quick approximate solution, and the nltk examples I found mostly looked like using it would require a bigger time commitment than I was willing to devote to silly poetry. And if none of the dedicated syllable-counting websites or dictionaries got it right, would a big time investment in nltk pay off?

Anyway, by this time I'd wasted more than an hour poking around various libraries and websites for this silly unimportant problem, and I decided that with that kind of time investment, I could probably do better on my own than the official solutions were giving me. Why not basically just count vowels?

So I whipped up a little script, countsyl, that did just that. I gave it a list of vowels, with a few simple rules. Obviously, you can't just say every vowel is a new syllable -- there are too many double vowels and silent letters and such. But you can't say that any run of multiple vowels together counts as one syllable, because sometimes the vowels do count; and you can't make absolute rules like "'e' at the end of a word is always silent", because sometimes it isn't. So I kept both minimum and maximum syllable counts for each word, and printed both.

And much to my surprise, without much tuning at all my silly little script immediately much better results than the hyphenation dictionary or the dedicated websites.

Alas, although it did give me quite a few hexasyllabic words in /usr/share/dict/words, none of them were useful at all for a program on Parasaurolophus. What I really needed was a musical term (since that's what the poem is about). What about a musical dictionary?

I found a list of musical terms on Wikipedia: Glossary of musical terminology, saved it as a local file, ran a few vim substitutes and turned it into a plain list of words. That did a little better, and gave me some possible ideas: (non?)contrapuntally? (something)harmonically? extemporaneously?

But none of them worked out, and by then I'd run out of steam. I gave up and blogged the poem as originally written, with the cheating two-word phrase "dinosaur orchestra", and vowed to write up how to count words in Python -- which I have now done. Quite noncontrapuntally, and definitely not extemporaneously. But at least I have a useful little script next time I want to get an approximate syllable count.

Tags: , , , , ,
[ 16:51 Dec 11, 2013    More programming | permalink to this entry | comments ]

Wed, 28 Aug 2013

Python scripts for Android

Python on Android. Wouldn't that make so many things so much easier?

I've known for a long time about SL4A, but when I read, a year or two ago, that Google officially disclaimed support for languages other than Java and C and didn't want their employees working on projects like SL4A, I decided it wasn't a good bet.

But recently I heard from someone who had just discovered SL4A and its Python support and talked about it like a going thing. I had an Android scripting problem I really wanted to solve, and decided it was time to take another look.

It turns out SL4A and its Python interpreter are still being maintained, and indeed, I was able to solve my problem that way. But the documentation was scanty at best. So here are some shortcuts.

Getting Python running on Android

How do you install it in the first place? Took me three or four tries: it turns out it's extremely picky about the order in which you do things, and the documentation doesn't warn you about that. Follow these steps:

  1. Enable "Unknown Sources" under Application settings if you haven't already.
  2. Download both sl4a_r6.apk and PythonForAndroid_r4.apk
  3. Install sl4a from the apk. Do not install Python yet.
  4. Find SL4A in Applications and run it. It will say "no matches found" (i.e. no scripts) but that's okay: the important thing is that it creates the directory where the scripts will live, /sdcard/sl4a/scripts, without which PythonForAndroid would fail to install.
  5. Install PythonForAndroid from the apk.
  6. Find Python for Android in Applications and run it. Tap Install. This will install the sample scripts, and you'll be ready to go.

Make a shortcut on the home screen:

You've written a script and it does what you want. But to run it, you have to run SL4A, choose the Python interpreter, scroll around to find the script, tap on it, and indicate whether or not you want to see the console. Way too many steps!

Turns out you can make a shortcut on the home screen to an SL4A script, like this: (thanks to this tip):

This will give you the familiar twin-snake Python icon on your home screen. There doesn't seem to be any way to change this to a different icon.

Wait, what about UI?

Well, that still seems to be a big hole in the whole SL4A model. You can write great scripts that print to the console. You can even do a few specialized things, like popup menus, messages (what the Python Android module calls makeToast()) and notifications. The sample script is a great illustration of how to use all those features, plus a lot more.

But what if you want to show a window, put a few buttons in it, let the user control things? Nobody seems to have thought about that possibility. I mean, it's not "sorry, we haven't had time to implement this", it isn't even mentioned as something someone would want to do on an Android device. Boggle.

The only possibility I've found is that there is apparently a way to use Android's WebView class from Python. I have not tried this yet; when I do, I'll write it up separately.

WebView may not be the best way to do UI. I've spent many hours tearing my hair out over its limitations even when called from Java. But still, it's something. And one very interesting thing about it is that it provides an easy way to call up an HTML page, either local or remote, from an Android home screen icon. So that may be the best reason yet to check out SL4A.

Tags: , ,
[ 21:31 Aug 28, 2013    More programming | permalink to this entry | comments ]

Sat, 13 Apr 2013

Parsing NOAA historical weather data

We've been considering the possibility of moving out of the Bay Area to somewhere less crowded, somewhere in the desert southwest we so love to visit. But that also means moving to somewhere with much harsher weather.

How harsh? It's pretty easy to search for a specific location and get average temperatures. But what if I want to make a table to compare several different locations? I couldn't find any site that made that easy.

No problem, I say. Surely there's a Python library, I say. Well, no, as it turns out. There are Python APIs to get the current weather anywhere; but if you want historical weather data, or weather data averaged over many years, you're out of luck.

NOAA purports to have historical climate data, but the only dataset I found was spotty and hard to use. There's an FTP site containing directories by year; inside are gzipped files with names like 723710-03162-2012.op.gz. The first two numbers are station numbers, and there's a file at the top level called ish-history.txt with a list of the station codes and corresponding numbers. Not obvious!

Once you figure out the station codes, the files themselves are easy to parse, with lines like

STN--- WBAN   YEARMODA    TEMP       DEWP      SLP        STP       VISIB      WDSP     MXSPD   GUST    MAX     MIN   PRCP   SNDP   FRSHTT
724945 23293  20120101    49.5 24    38.8 24  1021.1 24  1019.5 24    9.9 24    1.5 24    4.1  999.9    68.0    37.0   0.00G 999.9  000000
Each line represents one day (20120101 is January 1st, 2012), and the codes are explained in another file called GSOD_DESC.txt. For instance, MAX is the daily high temperature, and SNDP is snow depth.

[NOAA historical temp program] So all I needed was to decode the station names, download the right files and parse them. That took about a day to write (including a lot of time wasted futzing with mysterious incantations for matplotlib).

Little accessibility refresher: I showed it to Dave -- "Neat, look at this, San Jose is the blue pair, Flagstaff is green and Page is red." His reaction: "This makes no sense. They all look the same to me. I have no idea which is which." Oops -- right. Don't use color as your only visual indicator. I knew that, supposedly! So I added markers in different shapes for each site. (I wish somebody would teach that lesson to Google Maps, which uses color as its only indicator on the traffic layer, so it's useless for red-green colorblind people.)

Back to the data -- it turns out NOAA doesn't actually have that much historical data available for download. If you search on most of these locations, you'll find sites that claim to have historical temperatures dating back 50 years or more, sometimes back to the 1800s. But NOAA typically only has files starting at about 2005 or 2006. I don't know where sites are getting this older data, or how reliable it is.

Still, averages since 2006 are still interesting to compare. Here's a run of KSJC KFLG KSAF KLAM KCEZ KPGA KCNY. It's striking how moderate California weather is compared to any of these inland sites. No surprise there. Another surprise was that Los Alamos, despite its high elevation, has more moderate weather than most of the others -- lower highs, higher lows. I was a bit disappointed at how sparse the site list was -- no site in Moab? Really? So I used Canyonlands Field instead.

Anyway, it's fun for a data junkie to play around with, and it prints data on other weather factors, like precipitation and snowpack, although it doesn't plot them yet. The code is on my GitHub scripts page, under Weather.

Anyone found a better source for historical weather information? I'd love to have something that went back far enough to do some climate research, see what sites are getting warmer, colder, or seeing greater or lesser spreads between their extreme temperatures. The NOAA dataset obviously can't do that, so there must be something else that weather researchers use. Data on other countries would be interesting, too. Is there anything that's available to the public?

Tags: , , ,
[ 21:57 Apr 13, 2013    More programming | permalink to this entry | comments ]

Tue, 19 Mar 2013

Letters not used in Python keywords

One of the closing lightning talks at PyCon this year concerned the answers to a list of Python programming puzzles given at some other point during the conference. I hadn't seen the questions (I'm still not sure where they are), but some of the problems looked fun.

One of them was: "What are the letters not used in Python keywords?" I hadn't known about Python's keyword module, which could come in handy some day:

>>> import keyword
>>> keyword.kwlist
['and', 'as', 'assert', 'break', 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'exec', 'finally', 'for', 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'not', 'or', 'pass', 'print', 'raise', 'return', 'try', 'while', 'with', 'yield']

So, given the list of keywords, what's the best way to find the list of unique letters?

Any time you want a list of unique anything, you want a set. For instance,

>>> set([1, 2, 3, 2, 2, 4, 5, 1, 5])
set([1, 2, 3, 4, 5])
But first you need a list of letters so can make a set out of it.

Split the list of words into a list of letters

My first idea was to use list comprehensions. You can split a single word into letters like this:

>>> [ x for x in 'hello' ]
['h', 'e', 'l', 'l', 'o']

It took a bit of fiddling to get the right syntax to apply that to every word in the list:

>>> [[c for c in w] for w in keyword.kwlist]
[['a', 'n', 'd'], ['a', 's'], ['a', 's', 's', 'e', 'r', 't'], ... ]

Update: Dave Foster points out that [list(w) for w in keyword.kwlist] is another way, simpler and cleaner way than the double list comprehension.

That's a list of lists, so it needs to be flattened into a single list of letters before we can turn it into a set.

Flatten the list of lists

There are lots of ways to flatten a list of lists. Here are four of them:

[item for sublist in [[c for c in w] for w in keyword.kwlist] for item in sublist]

reduce(lambda x,y: x+y, [[c for c in w] for w in keyword.kwlist])

import itertools
list(itertools.chain.from_iterable([[c for c in w] for w in keyword.kwlist]))

sum([[c for c in w] for w in keyword.kwlist], [])

That last one, using sum(), makes use of the fact that Python uses + for list concatenation -- in other words, that [1, 2, 3] + [4, 5, 6] is [1, 2, 3, 4, 5, 6]. But the first method (item for sublist in) is faster: see Making a flat list out of list of lists in Python on StackOverflow. And another StackOverflow thread has a nice script for plotting speed vs. list size of various flatteners.

A simpler way of making the set

But it turns out none of this list comprehension stuff is needed anyway. set('word') splits words into letters already:

>>> set('bubble')
set(['e', 'b', 'u', 'l'])
Ignore the order -- elements of a set often end up displaying in some strange order. The important thing is that it has all the letters and no repeats.

Now we have an easy way of making a set containing the letters in one word. But how do we apply that to a list of words?

Again I initially tried using list comprehensions, then realized there's an easier way. Given a list of strings, it's trivial to join them into a single string using ''.join(). And that gives us our set of letters within keywords:

>>> set(''.join(keyword.kwlist))
set(['a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'm', 'l', 'o', 'n', 'p', 's', 'r', 'u', 't', 'w', 'y', 'x'])

What letters are not in the set?

Almost done! But the original problem was to find the letters not in keywords. We can do that by subtracting this set from the set of all letters from a to z. How do we get that? The string module will give us a list:

>>> string.lowercase

You could also use a list comprehension and ord and chr (alas, range won't give you a range of letters directly):

>>> [chr(i) for i in range(ord('a'), ord('z')+1)]
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
It's a bit longer, but doesn't require an import.

Now that you have your a-z set, just subtract the two sets:

>>> set(string.lowercase[:]) - set(''.join(keyword.kwlist))
set(['q', 'j', 'z', 'v'])

So the only letters not used in Python keywords are q, j, z and v.

Just a useless little ditty, really ... but I thought it was a fun exercise, so maybe you will too.

Tags: ,
[ 12:36 Mar 19, 2013    More programming | permalink to this entry | comments ]

Thu, 21 Feb 2013

New project: Metapho image tagger

I'm excited about my new project: MetaPho, an image tagger.

It arose out of a discussion on the LinuxChix Techtalk list: photo collection management software. John Sturdy was looking for an efficient way of viewing and tagging large collections of photos. Like me, he likes fast, lightweight, keyboard-driven programs. And like me, he didn't want a database-driven system that ties you forever to one image cataloging program. I put my image tags in plaintext files, named Keywords, so that I can easily write scripts to search or modify them, or user grep, and I can even make quick changes with a text editor.

I shared some tips on how I use my Pho image viewer for tagging images, and it sounded close to what he was looking for. But as we discussed ideas about image tagging, we realized that there were things he wanted to do that pho doesn't do well, things not offered by any other image tagger we've been able to find. While discussing how we might add new tagging functionality to pho, I increasingly had the feeling that I was trying to fit off-road tires onto a Miata -- or insert your own favorite metaphor for "making something do something it wasn't designed to do."

Pho is a great image viewer, but the more I patched it to handle tagging, the uglier and more complicated the code got, and it also got more complex to use.

[metapho screenshot] And really, everything we needed for tagging could be easily done in a Python-GTK application. (Pho is written in C because it does a lot of complicated focus management to deal with how window managers handle window moving and resizing. A tagger wouldn't need any of that.)

I whipped up a demo image viewer in a few hours and showed it to John. We continued the discussion, I made a GitHub repo, and over the next week or so the code grew into an efficient and already surprisingly usable image tagger.

We have big plans for it, like tags organized into categories so we can have lots of tags without cluttering the interface too much. But really, even as it is, it's better than anything I've used before. I've been scanning in lots of photos from old family albums (like this one of my mother and grandmother, and me at 9 months) and it's been great to be able to add and review tags easily.

If you want to check out MetaPho, or contribute to it (either code or user interface design), it lives in my MetaPho repository on GitHub. And I wrote up a quick man page in markdown format:

Feedback and contributors welcome!

Tags: , , , , ,
[ 18:31 Feb 21, 2013    More programming | permalink to this entry | comments ]

Sat, 08 Dec 2012

Decoding RFC 2047 email headers (like spam Subjects in other charsets)

Having not had much luck with spam filtering solutions like SpamAssassin, I'm forever having to add new spam filters by hand. For instance, after about the sixth time I get "President Waives Refi Requirement" or "Melt your fat! MUST WATCH this video now!" within a couple of hours, I'm pretty tired of it and don't want to see any more of them.

With mail filtering programs like procmail or maildrop, it's easy enough to match a pattern like "Subject:.*Refi Requirement" or "Subject:.*Melt your fat" and filter that message to a spam folder (or /dev/null).

But increasingly, I add patterns I'm seeing in spam messages, and yet the messages with those patterns keep coming in. Why? Because the spammers are using RFC 2047 to encode the subject into some other character set.

Here's how it works. A spammer sends a subject line that looks something like this:

Subject: =?utf-8?B?U3RvcCBPdmVycGF5aW5nIGZvciBQcmludGVyIEluaw==?=

Mail programs are smart enough to decode this into:

Subject: Stop Overpaying for Printer Ink

but spam filtering programs often aren't, so your "printer ink" filter won't catch it. And if you look through your spam folder with tools like grep to see why it didn't get caught, or to find particularly spammy subjects that might call for a filter (grep Subject spamfolder | sort is pretty handy), these encoded subjects will be incognito.

I briefly tried setting up a filter that spam-filed anything with =? in the Subject line. But that's way too broad a brush -- not all people there are legitimate reasons for using other charsets even in English language email. It's relatively rare, but it happens. And some bots, notably the Adafruit forum notification bot and the bot that sends out announcements from my alma mater, unaccountably encode the charset even when they're sending mail entirely in US ASCII.

So what's really needed is not to filter out all messages that specify a charset, but to decode the Subject so the spam filter can see it and filter it accordingly.

How? I couldn't find any ready-made tool available for Linux that could decode RFC 2047 headers; but the Python email package makes decoding a one-line task. In the Python interpreter:

$ python
Python 2.7.3 (default, Aug  1 2012, 05:16:07) 
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> email.Header.decode_header("Subject: =?utf-8?B?U3RvcCBPdmVycGF5aW5nIGZvciBQcmludGVyIEluaw==?=")
[('Subject:', None), ('Stop Overpaying for Printer Ink', 'utf-8')]

So it's easy to write a script that can pull headers out of email messages (files) and decode them. Just look for the line starting with the header you want to match -- e.g. "Subject:" -- and pass that line to email.Header.decode_header().

Only one snag. If the subject is longer than about 20 characters, spammers will often opt to split it up into multiple groups, sometimes even in different character sets. So for example, you might see something like this, spread over multiple lines:

Subject: =?windows-1252?Q?Earn_your_degree_=97_on_your_time?=

The script has to handle that too. If it's reading a header, it has to check the next line, and if that line begins with whitespace, treat it as more of the header.

The resulting script, (on github), seems pretty handy and should be able to be plugged in to a mail filtering program.

Tags: ,
[ 20:45 Dec 08, 2012    More programming | permalink to this entry | comments ]

Wed, 17 Oct 2012

Asynchronous sound playing in Python

A little while back I wrote about my Python xchat script to play sound alerts.

But one thing that's been annoying me about it -- it was a problem with the old perl alert script too -- is the repeated sounds. If lots of twitter updates come in on the Bitlbee channel, or if someone pastes numerous lines into a channel, I hear POPPOPPOPPOPPOPPOP or repetitions of whatever the alert sound is for that type of message. It's annoying to me, but even more so to anyone else in the same room.

It would be so much nicer if I could have it play just one repetition of any given alert, even if there are eight lines all coming in at the same time. So I decided to write a Python class to handle that.

My existing code used subprocesses to call the basic ALSA sound player, /usr/bin/aplay -q. Initially I used
if not os.fork() : os.execl(APLAY, APLAY, "-q", alertfile)
but I later switched to the cleaner[APLAY, '-q', alertfile])
But of course, it would be better to do it all from Python without requiring an external process like aplay. So I looked into that first.

Sadly, it turns out Python audio support is a mess. The built-in libraries are fairly limited in functionality and formats, and the external libraries that handle sound are mostly unmaintained, unless you want to pull in a larger library like pygame. After a little web searching I decided that maybe an aplay subprocess wasn't so bad after all.

Okay, so how should I handle the subprocesses? I decided the best way was to keep track of what sound was currently playing. If another alert fires for the same sound while that sound is already playing, just ignore it. If an alert comes in for a different sound, then wait() for the current sound to finish, then start the new sound.

That's all quite easy with Python's subprocess module. subprocess.Popen() returns a Popen object that tracks a process ID and can check whether that process has finished or not. If self.curpath is the path to the sound currently playing and self.current is the Popen object for whatever aplay process is currently running, then:

    if self.current :
        if self.current.poll() == None :
            # Current process hasn't finished yet. Is this the same sound?
            if path == self.curpath :
                # A repeat of the currently playing sound.
                # Don't play it more than once.
            else :
                # Trying to play a different sound.
                # Wait on the current sound then play the new one.

    self.curpath = path
    self.current = subprocess.Popen([ "/usr/bin/aplay", '-q', path ] )

Finally, it's a good idea when exiting the program to check whether any aplay process is running, and wait() for it. Otherwise, you might end up with a zombie aplay process.

    def __del__(self) :

I don't know if xchat actually closes down Python objects gracefully, so I don't know whether the __del__ destructor will actually be called. But at least I tried. It's possible that a context manager might be more reliable.

The full scripts are on github at for the basic SoundPlayer class, and for the xchat script that includes SoundPlayer.

Tags: , ,
[ 12:07 Oct 17, 2012    More programming | permalink to this entry | comments ]

Wed, 26 Sep 2012

Writing xchat scripts in Python (to play sound alerts)

I use xchat as my IRC client. Mostly I like it, but its sound alerts aren't quite as configurable as I'd like. I have a few channels, like my Bitlbee Twitter feed, where I want a much more subtle alert, or no alert at all. And I want an easy way of turning sounds on and off, in case I get busy with something and need to minimize distractions.

Years ago I grabbed a perl xchat plug-in called "Smet's NickSound" that did something close to what I wanted. I've hacked a few things into it. But every time I try to customize it any further, I'm hit with the pain of write-only Perl. I've written Perl scripts, honest. But I always have a really hard time reading anyone else's Perl code and figuring out what it's doing. When I dove in again recently to try to figure out why I was getting so many alerts when first starting up xchat, I finally decided: learning how to write a Python xchat script couldn't be any harder than reverse engineering a Perl one.

First, of course, I looked for an existing nick sound Python script ... and totally struck out. In fact, mostly I struck out on finding any xchat Python scripts at all. I know there are Python bindings for xchat, because there's documentation for them. But sample plug-ins? Nope. For some reason, nobody's writing xchat plug-ins in Python.

I eventually found two minimal examples: this very simple example and the more elaborate utf8decoder. I was able to put them together and cobble up a working nick sound plug-in. It's easy once you have an example to work from to help you figure out the event hook arguments.

So here's my own little example, which may help the next person trying to learn xchat Python scripting: on github.

Tags: , , ,
[ 21:13 Sep 26, 2012    More programming | permalink to this entry | comments ]

Syndicated on:
LinuxChix Live
Ubuntu Women
Women in Free Software
Graphics Planet
Ubuntu California
Planet Openbox
Planet LCA2009

Friends' Blogs:
Morris "Mojo" Jones
Jane Houston Jones
Dan Heller
Long Live the Village Green
Ups & Downs

Other Blogs of Interest:
Scott Adams
Dave Barry

Powered by PyBlosxom.