Shallow Thoughts : tags : language

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Wed, 11 Dec 2013

Counting syllables in Python

When I wrote recently about my Dactylic dinosaur doggerel, I glossed over a minor problem with my final poem: the rules of double-dactylic doggerel say that the sixth line (or sometimes the seventh) should be a single double-dactyl word -- something like "paleontologist" or "hexasyllabic'ly". I used "dinosaur orchestra" -- two words, which is cheating.

I don't feel too guilty about that. If you read the post, you may recall that the verse was the result of drifting grumpily through an insomniac morning where I would have preferred to be getting back to sleep. Coming up with anything that scans at all is probably good enough.

Still, it bugged me, not being able to think of a double-dactylic word that related somehow to Parasaurolophus. So I vowed that, later that day when I was up and at the computer, I would attempt to find one and rewrite the poem accordingly.

I thought that would be fairly straightforward. Not so much. I thought there would be some utility I could run that would count syllables for me, then I could run /usr/share/dict/words through it, print out all the 6-syllable words, and find one that fit. Turns out there is no such utility.

But Python has a library for everything, doesn't it?

Some searching turned up PyHyphen, which includes some syllable-counting functions. It apparently uses the hyphenation dictionaries that come with LibreOffice.

There's a Debian package for it, python-pyhyphen -- but it doesn't work. First, it depends on another package, hyphen-en-us, but doesn't have that dependency encoded in the package, even as a suggested or recommended package. But even when you install the hyphenated dictionary, it still doesn't work because it doesn't point to the dictionary in the place it was installed. Looks like that problem was reported almost two years ago, bug 627944: python-pyhyphen: doesn't work out-of-the-box with hyphen-* packages. There's a fix there that involves editing two files, /usr/lib/python2.7/dist-packages/hyphen/ and /usr/lib/python2.7/dist-packages/hyphen/

Or you can just give up on Debian and pip install pyhyphen, which is a lot easier.

But once you get it working, you find that it's terrible. It was wrong about almost every word I tried. I hope not too many people are relying on this hyphen-en-us dictionary for important documents. Its results seemed nearly random, and I quickly gave up on it for getting a useful list of words around six syllables.

Just for fun, since my count syllables web search turned up quite a few websites claiming that functionality, I tried entering some of my long test words manually. All of the websites I tried were wrong more than half the time, and often they were off by more than two syllables. I don't mind off-by-ones -- I can look at words claiming 5 and 7 syllables while searching for double dactyls -- but if I have to include 4-syllable words as well, I'll never find what I'm looking for.

That discouraged me from using another Python suggestion I'd seen, the nltk (natural language toolkit) package. I've been looking for an excuse to play with nltk, and some day I will, but for this project I was looking for a quick approximate solution, and the nltk examples I found mostly looked like using it would require a bigger time commitment than I was willing to devote to silly poetry. And if none of the dedicated syllable-counting websites or dictionaries got it right, would a big time investment in nltk pay off?

Anyway, by this time I'd wasted more than an hour poking around various libraries and websites for this silly unimportant problem, and I decided that with that kind of time investment, I could probably do better on my own than the official solutions were giving me. Why not basically just count vowels?

So I whipped up a little script, countsyl, that did just that. I gave it a list of vowels, with a few simple rules. Obviously, you can't just say every vowel is a new syllable -- there are too many double vowels and silent letters and such. But you can't say that any run of multiple vowels together counts as one syllable, because sometimes the vowels do count; and you can't make absolute rules like "'e' at the end of a word is always silent", because sometimes it isn't. So I kept both minimum and maximum syllable counts for each word, and printed both.

And much to my surprise, without much tuning at all my silly little script immediately much better results than the hyphenation dictionary or the dedicated websites.

Alas, although it did give me quite a few hexasyllabic words in /usr/share/dict/words, none of them were useful at all for a program on Parasaurolophus. What I really needed was a musical term (since that's what the poem is about). What about a musical dictionary?

I found a list of musical terms on Wikipedia: Glossary of musical terminology, saved it as a local file, ran a few vim substitutes and turned it into a plain list of words. That did a little better, and gave me some possible ideas: (non?)contrapuntally? (something)harmonically? extemporaneously?

But none of them worked out, and by then I'd run out of steam. I gave up and blogged the poem as originally written, with the cheating two-word phrase "dinosaur orchestra", and vowed to write up how to count words in Python -- which I have now done. Quite noncontrapuntally, and definitely not extemporaneously. But at least I have a useful little script next time I want to get an approximate syllable count.

Tags: , , , , ,
[ 17:51 Dec 11, 2013    More programming | permalink to this entry | comments ]

Wed, 18 Aug 2004

Dict's web pronunciations

I went to dict this afternoon to find out whether "cerebral" was best pronounced ser EE brul or SER e bral.

The Collaborative International Dictionary of English v.0.44 [gcide] gives two definitions, both with the same pronunciation: /Cer"e*bral/

Great! What does that mean? It's not the standard phonetic markings like dictionaries use (lucky for me, since if it used accent marks and such, I wouldn't be able to display it in my terminal font).

Jutta helped me out with the investigation, and with some combined googling and README-reading, she eventually found gcide's pronunc.web file.

That holds the key to the stresses: the double quote (") is a heavy stress (light light stress, not indicated, would be a backquote). The asterisk (*) is simply a hyphen to separate syllables (why they don't just use a dash, or even a space, I'm not sure).

That's progress, but what do those vowels mean? And the C? (Okay, I know it's pronounced as an ess. I even knew by now that both pronunciations are acceptable, since I'd looked it up in a dead-tree dictionary and so had about four other people on the channel while I was trying to track down a dict pronunciation guide). pronunc.web talks about a long list of special characters that are supposed to correspond to web fonts (haha). But dict doesn't actually use those: try dict free, which gives the pronunciation /free/, while pronunc.web says it should show up as /fr<emac// (which, you have to admit, would be pretty confusing what with the close slash for the special character followed by the close slash for the pronunciation; even aside from the question of who can read strings like /fr<emac//, ick).

Some other googling mentioned web dictionaries, including gcide, using the pronunciation guide from the Jargon file. Ironically, this was very hard to read since it uses smartquote characters all over the place which not only don't appear in the font I was using in mozilla, but also don't get substituted properly in mozilla (moz is usually pretty good about that) so I just see boxes. It's possible that the font claims to have the characters, then shows boxes instead.

Jutta wondered why PRONUNC.JPG and PRONUNC.WEB weren't in the Debian package, since they're mentioned in /usr/share/doc/dict-gcide/README.dictionary.gz. I have mixed feelings: I think it's a bug that there's no file that describes the pronunciation system being used, but since neither of those files does describe it, not including them is probably not a bug.

At Jutta's suggestion, I filed a bug on dict-cgide (bug 266773).

[ 21:19 Aug 18, 2004    More misc | permalink to this entry | comments ]