Shallow Thoughts : tags : language

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Thu, 26 Mar 2020

C is for Cabezon (and the Census too)

... You thought C would be coronavirus or COVID-19, I bet!

Well, I won't pretend I'm not as obsessed with it as everybody else. Of course I am. But, house-bound as we all are now, let's try to think about other things at least now and then. It's healthier.

[Cabezon Peak] One of the distinctive peaks here in northern New Mexico is a butte called Cabezón, west of the Jemez near Cuba.

It's a volcanic neck: the core of an old volcano, part of the Mt Taylor volcanic field. Once a basalt volcano stops erupting, the lava sitting inside it slowly cools and solidifies. Then, over time, the outside of the volcano erodes away, leaving the hard basalt that used to be lava in the throat of the volcano. It's the same process that made Tunyo or Black Mesa, the butte between Los Alamos and Española that's been featured in so many movies, and the same process that made the spectacular Shiprock.

Dave and I have driven past Cabezón peak several times, but haven't yet actually explored it. Supposedly there's a trail and you can climb to the top (reports vary on how difficult the climb is). One of these days.

Last week, Dave was poking around in a Spanish dictionary and discovered that -ón in Spanish is a suffix that denotes something larger. So, since cabeza means head, cabezón means big head. (Looking for confirmation on that, I found this useful page on 18 Spanish Suffixes You’ll Never Want to Let Go Of.) Apparently it can also mean stubborn, ditzy, or just having big hair.

But hearing that cabezón meant big head took me back to my childhood, and another meaning of cabezón.

When I was maybe ten, my father decided to take up fishing. He bought a rod and reel, and brought me along as we headed out to the docks (I don't remember where, but we were in Los Angeles, so it was probably somewhere around Santa Monica or San Pedro).

This didn't last long as a hobby; I don't think dad was cut out for fishing. And mostly he didn't catch anything. But on one of our last fishing trips, he caught a fish. An amazing fish. It wasn't especially big, maybe fourteen inches or so. It had a big head and a triangular body, with a flat belly as the base of the triangle. It had weird fins. It was dark olive green on two sides of the triangle, with a dull yellow belly. It looked prehistoric, and sent me running off to the books when we got home to make sure we hadn't caught a coelocanth.

After some research at the library (this was way pre internet), my father concluded that he'd caught something called, you guessed it, a cabezón.

Searching for photos now, I'm not so sure that's right. None of the photos I've found look that much like the fish I remember. But I can't find anything more likely candidates, either (though I'm wondering about the Pacific staghorn sculpin as a possibility). I guess fish identification even now in the age of Google isn't all that much easier than it was in the seventies.

I don't think we ever ate the fish. It sat in his freezer for quite a while while he tried to identify it, and I'm not sure what happened after that.

So maybe I've seen a cabezón fish, and maybe I haven't. But it was fun to learn about the -ón suffix in Spanish, to find out the meaning of the name for that distinctive butte out near Cuba. One of these days Dave and I will go hike it. And if we make it to the top, we'll try not to get big heads about it.

Tags: nature, language
[ 19:17 Mar 26, 2020 More nature | permalink to this entry | ]

Wed, 11 Dec 2013

Counting syllables in Python

When I wrote recently about my Dactylic dinosaur doggerel, I glossed over a minor problem with my final poem: the rules of double-dactylic doggerel say that the sixth line (or sometimes the seventh) should be a single double-dactyl word -- something like "paleontologist" or "hexasyllabic'ly". I used "dinosaur orchestra" -- two words, which is cheating.

I don't feel too guilty about that. If you read the post, you may recall that the verse was the result of drifting grumpily through an insomniac morning where I would have preferred to be getting back to sleep. Coming up with anything that scans at all is probably good enough.

Still, it bugged me, not being able to think of a double-dactylic word that related somehow to Parasaurolophus. So I vowed that, later that day when I was up and at the computer, I would attempt to find one and rewrite the poem accordingly.

I thought that would be fairly straightforward. Not so much. I thought there would be some utility I could run that would count syllables for me, then I could run /usr/share/dict/words through it, print out all the 6-syllable words, and find one that fit. Turns out there is no such utility.

But Python has a library for everything, doesn't it?

Some searching turned up PyHyphen, which includes some syllable-counting functions. It apparently uses the hyphenation dictionaries that come with LibreOffice.

There's a Debian package for it, python-pyhyphen -- but it doesn't work. First, it depends on another package, hyphen-en-us, but doesn't have that dependency encoded in the package, even as a suggested or recommended package. But even when you install the hyphenated dictionary, it still doesn't work because it doesn't point to the dictionary in the place it was installed. Looks like that problem was reported almost two years ago, bug 627944: python-pyhyphen: doesn't work out-of-the-box with hyphen-* packages. There's a fix there that involves editing two files, /usr/lib/python2.7/dist-packages/hyphen/config.py and /usr/lib/python2.7/dist-packages/hyphen/__init__.py.

Or you can just give up on Debian and pip install pyhyphen, which is a lot easier.

But once you get it working, you find that it's terrible. It was wrong about almost every word I tried. I hope not too many people are relying on this hyphen-en-us dictionary for important documents. Its results seemed nearly random, and I quickly gave up on it for getting a useful list of words around six syllables.

Just for fun, since my count syllables web search turned up quite a few websites claiming that functionality, I tried entering some of my long test words manually. All of the websites I tried were wrong more than half the time, and often they were off by more than two syllables. I don't mind off-by-ones -- I can look at words claiming 5 and 7 syllables while searching for double dactyls -- but if I have to include 4-syllable words as well, I'll never find what I'm looking for.

That discouraged me from using another Python suggestion I'd seen, the nltk (natural language toolkit) package. I've been looking for an excuse to play with nltk, and some day I will, but for this project I was looking for a quick approximate solution, and the nltk examples I found mostly looked like using it would require a bigger time commitment than I was willing to devote to silly poetry. And if none of the dedicated syllable-counting websites or dictionaries got it right, would a big time investment in nltk pay off?

Anyway, by this time I'd wasted more than an hour poking around various libraries and websites for this silly unimportant problem, and I decided that with that kind of time investment, I could probably do better on my own than the official solutions were giving me. Why not basically just count vowels?

So I whipped up a little script, countsyl, that did just that. I gave it a list of vowels, with a few simple rules. Obviously, you can't just say every vowel is a new syllable -- there are too many double vowels and silent letters and such. But you can't say that any run of multiple vowels together counts as one syllable, because sometimes the vowels do count; and you can't make absolute rules like "'e' at the end of a word is always silent", because sometimes it isn't. So I kept both minimum and maximum syllable counts for each word, and printed both.

And much to my surprise, without much tuning at all my silly little script immediately much better results than the hyphenation dictionary or the dedicated websites.

Alas, although it did give me quite a few hexasyllabic words in /usr/share/dict/words, none of them were useful at all for a program on Parasaurolophus. What I really needed was a musical term (since that's what the poem is about). What about a musical dictionary?

I found a list of musical terms on Wikipedia: Glossary of musical terminology, saved it as a local file, ran a few vim substitutes and turned it into a plain list of words. That did a little better, and gave me some possible ideas: (non?)contrapuntally? (something)harmonically? extemporaneously?

But none of them worked out, and by then I'd run out of steam. I gave up and blogged the poem as originally written, with the cheating two-word phrase "dinosaur orchestra", and vowed to write up how to count words in Python -- which I have now done. Quite noncontrapuntally, and definitely not extemporaneously. But at least I have a useful little script next time I want to get an approximate syllable count.

Tags: dinosaur, poetry, writing, programming, python, language
[ 17:51 Dec 11, 2013 More programming | permalink to this entry | ]

Wed, 18 Aug 2004

Dict's web pronunciations

I went to dict this afternoon to find out whether "cerebral" was best pronounced ser EE brul or SER e bral.

The Collaborative International Dictionary of English v.0.44 [gcide] gives two definitions, both with the same pronunciation: /Cer"e*bral/

Great! What does that mean? It's not the standard phonetic markings like dictionaries use (lucky for me, since if it used accent marks and such, I wouldn't be able to display it in my terminal font).

Jutta helped me out with the investigation, and with some combined googling and README-reading, she eventually found gcide's pronunc.web file.

That holds the key to the stresses: the double quote (") is a heavy stress (light light stress, not indicated, would be a backquote). The asterisk (*) is simply a hyphen to separate syllables (why they don't just use a dash, or even a space, I'm not sure).

That's progress, but what do those vowels mean? And the C? (Okay, I know it's pronounced as an ess. I even knew by now that both pronunciations are acceptable, since I'd looked it up in a dead-tree dictionary and so had about four other people on the channel while I was trying to track down a dict pronunciation guide). pronunc.web talks about a long list of special characters that are supposed to correspond to web fonts (haha). But dict doesn't actually use those: try dict free, which gives the pronunciation /free/, while pronunc.web says it should show up as /fr<emac// (which, you have to admit, would be pretty confusing what with the close slash for the special character followed by the close slash for the pronunciation; even aside from the question of who can read strings like /fr<emac//, ick).

Some other googling mentioned web dictionaries, including gcide, using the pronunciation guide from the Jargon file. Ironically, this was very hard to read since it uses smartquote characters all over the place which not only don't appear in the font I was using in mozilla, but also don't get substituted properly in mozilla (moz is usually pretty good about that) so I just see boxes. It's possible that the font claims to have the characters, then shows boxes instead.

Jutta wondered why PRONUNC.JPG and PRONUNC.WEB weren't in the Debian package, since they're mentioned in /usr/share/doc/dict-gcide/README.dictionary.gz. I have mixed feelings: I think it's a bug that there's no file that describes the pronunciation system being used, but since neither of those files does describe it, not including them is probably not a bug.

At Jutta's suggestion, I filed a bug on dict-cgide (bug 266773).

Tags: language
[ 21:19 Aug 18, 2004 More misc | permalink to this entry | ]

	Feeds: RSS 2.0 \| Atom
	@akkana@fosstodon.org on Mastodon
	@akkakk on Twitter (now inactive)
	Shallow Sky Home
	Contact Akkana

<	March 2020					>
Su	Mo	Tu	We	Th	Fr	Sa
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31