Shallow Thoughts : tags : poetry

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Wed, 11 Dec 2013

Counting syllables in Python

When I wrote recently about my Dactylic dinosaur doggerel, I glossed over a minor problem with my final poem: the rules of double-dactylic doggerel say that the sixth line (or sometimes the seventh) should be a single double-dactyl word -- something like "paleontologist" or "hexasyllabic'ly". I used "dinosaur orchestra" -- two words, which is cheating.

I don't feel too guilty about that. If you read the post, you may recall that the verse was the result of drifting grumpily through an insomniac morning where I would have preferred to be getting back to sleep. Coming up with anything that scans at all is probably good enough.

Still, it bugged me, not being able to think of a double-dactylic word that related somehow to Parasaurolophus. So I vowed that, later that day when I was up and at the computer, I would attempt to find one and rewrite the poem accordingly.

I thought that would be fairly straightforward. Not so much. I thought there would be some utility I could run that would count syllables for me, then I could run /usr/share/dict/words through it, print out all the 6-syllable words, and find one that fit. Turns out there is no such utility.

But Python has a library for everything, doesn't it?

Some searching turned up PyHyphen, which includes some syllable-counting functions. It apparently uses the hyphenation dictionaries that come with LibreOffice.

There's a Debian package for it, python-pyhyphen -- but it doesn't work. First, it depends on another package, hyphen-en-us, but doesn't have that dependency encoded in the package, even as a suggested or recommended package. But even when you install the hyphenated dictionary, it still doesn't work because it doesn't point to the dictionary in the place it was installed. Looks like that problem was reported almost two years ago, bug 627944: python-pyhyphen: doesn't work out-of-the-box with hyphen-* packages. There's a fix there that involves editing two files, /usr/lib/python2.7/dist-packages/hyphen/config.py and /usr/lib/python2.7/dist-packages/hyphen/__init__.py.

Or you can just give up on Debian and pip install pyhyphen, which is a lot easier.

But once you get it working, you find that it's terrible. It was wrong about almost every word I tried. I hope not too many people are relying on this hyphen-en-us dictionary for important documents. Its results seemed nearly random, and I quickly gave up on it for getting a useful list of words around six syllables.

Just for fun, since my count syllables web search turned up quite a few websites claiming that functionality, I tried entering some of my long test words manually. All of the websites I tried were wrong more than half the time, and often they were off by more than two syllables. I don't mind off-by-ones -- I can look at words claiming 5 and 7 syllables while searching for double dactyls -- but if I have to include 4-syllable words as well, I'll never find what I'm looking for.

That discouraged me from using another Python suggestion I'd seen, the nltk (natural language toolkit) package. I've been looking for an excuse to play with nltk, and some day I will, but for this project I was looking for a quick approximate solution, and the nltk examples I found mostly looked like using it would require a bigger time commitment than I was willing to devote to silly poetry. And if none of the dedicated syllable-counting websites or dictionaries got it right, would a big time investment in nltk pay off?

Anyway, by this time I'd wasted more than an hour poking around various libraries and websites for this silly unimportant problem, and I decided that with that kind of time investment, I could probably do better on my own than the official solutions were giving me. Why not basically just count vowels?

So I whipped up a little script, countsyl, that did just that. I gave it a list of vowels, with a few simple rules. Obviously, you can't just say every vowel is a new syllable -- there are too many double vowels and silent letters and such. But you can't say that any run of multiple vowels together counts as one syllable, because sometimes the vowels do count; and you can't make absolute rules like "'e' at the end of a word is always silent", because sometimes it isn't. So I kept both minimum and maximum syllable counts for each word, and printed both.

And much to my surprise, without much tuning at all my silly little script immediately much better results than the hyphenation dictionary or the dedicated websites.

Alas, although it did give me quite a few hexasyllabic words in /usr/share/dict/words, none of them were useful at all for a program on Parasaurolophus. What I really needed was a musical term (since that's what the poem is about). What about a musical dictionary?

I found a list of musical terms on Wikipedia: Glossary of musical terminology, saved it as a local file, ran a few vim substitutes and turned it into a plain list of words. That did a little better, and gave me some possible ideas: (non?)contrapuntally? (something)harmonically? extemporaneously?

But none of them worked out, and by then I'd run out of steam. I gave up and blogged the poem as originally written, with the cheating two-word phrase "dinosaur orchestra", and vowed to write up how to count words in Python -- which I have now done. Quite noncontrapuntally, and definitely not extemporaneously. But at least I have a useful little script next time I want to get an approximate syllable count.

Tags: , , , , ,
[ 17:51 Dec 11, 2013    More programming | permalink to this entry | ]

Thu, 21 Nov 2013

Dinosaur Doggerel

I woke up thinking about dinosaurs.

Specifically, Pachycephalosaurus, the bone-headed dinosaur, and her long-crested cousin Parasaurolophus (pictured at right).

The previous night, I had been reading The Know-It-All, A. J. Jacob's entertaining account of his adventures reading the whole Encyclopedia Britannica. I'd left off in the Ps, which included a very short entry on Pachycephalosaurus (A.J. is not particularly into dinosaurs).

Drifting along in a typical insomniac "I wish I could get back to sleep" haze, I couldn't help noticing that Parasaurolophus was six syllables -- in fact, it was a double dactyl.

And that meant it was a prime candidate for my favorite verse form, double-dactylic doggerel, a form with fairly strict rules which require, among other things, that the second line be a double-dactylic proper name. And as double-dactylic junkies know, once you've noticed a double-dactylic name, you can't rest until it's turned into a poem.

So now I couldn't sleep because I was thinking about Parasaurolophus. Now, even aside from its mellifluous name, Parasaurolophus and the whole Hadrosaur family are pretty interesting. The biggest puzzle is why they had those elaborate bony crests. Decoration for mating purposes? Fighting, like horns and antlers on modern hoofed mammals? But in the late 1990s, CT scans of hadrosaur fossils revealed long air passages inside the crests of many Hadrosaurs, including Parasaurolophus ... and those air passages were connected to the nasal passages. That led to suggestions that the crests might have been tuned for sound production -- a built-in wind instrument.

[computer model of Parasaurolophus crest] In Scientists Use Digital Paleontology to Produce Voice of Parasaurolophus Dinosaur a team at Sandia made computer models of the air passages, and you can even listen to sound files of what Parasaurolophus might have sounded like. The sound is wonderful, like a trombone. Sandia's pages use a, <embed> tag that didn't work for me in Firefox, so if you have trouble with their links, I've separated out the wav file URLs: songLQ.wav (588k) and a higher quality version, song2.wav (2.7M).

Anyway, I never did get back to sleep, but I did end up with some insomniacal doggerel:

Dinosaur, schminosaur
Parasaurolophus
How do you use that
Magnificent crest?
"I play trombone in the
Dinosaur orchestra
All hadrosaurs play, but
I am the best."

Tags: , ,
[ 16:49 Nov 21, 2013    More writing | permalink to this entry | ]