Shallow Thoughts : tags : language
Akkana's Musings on Open Source Computing and Technology, Science, and Nature.
Thu, 26 Mar 2020
... You thought C would be coronavirus or COVID-19, I bet!
Well, I won't pretend I'm not as obsessed with it as everybody else.
Of course I am. But, house-bound as we all are now, let's try to
think about other things at least now and then. It's healthier.
One of the distinctive peaks here in northern New Mexico is a butte
called Cabezón, west of the Jemez near Cuba.
It's a volcanic neck: the core of an old volcano, part of the Mt Taylor
volcanic field. Once a basalt volcano stops erupting, the lava sitting
inside it slowly cools and solidifies. Then, over time, the outside
of the volcano erodes away, leaving the hard basalt that used to
be lava in the throat of the volcano. It's the same process that
made Tunyo or Black Mesa, the butte between Los Alamos and Española
that's been featured in so many movies, and the same process that
made the spectacular Shiprock.
Dave and I have driven past Cabezón peak several times, but haven't
yet actually explored it. Supposedly there's a trail and you can climb
to the top (reports vary on how difficult the climb is). One of these
days.
Last week, Dave was poking around in a Spanish dictionary and discovered
that -ón in Spanish is a suffix that denotes something larger.
So, since cabeza means head, cabezón means
big head.
(Looking for confirmation on that, I found this useful page on
18
Spanish Suffixes You’ll Never Want to Let Go Of.)
Apparently it can also mean stubborn, ditzy, or just having big hair.
But hearing that cabezón meant big head took me
back to my childhood, and another meaning of cabezón.
When I was maybe ten, my father decided to take up fishing.
He bought a rod and reel, and brought me along as we headed out
to the docks (I don't remember where, but we were in Los Angeles,
so it was probably somewhere around Santa Monica or San Pedro).
This didn't last long as a hobby; I don't think dad was cut out for
fishing. And mostly he didn't catch anything. But on one of our last
fishing trips, he caught a fish. An amazing fish. It wasn't especially
big, maybe fourteen inches or so. It had a big head and a triangular
body, with a flat belly as the base of the triangle. It had weird fins.
It was dark olive green on two sides of the triangle, with a dull yellow
belly. It looked prehistoric, and sent me running off to the books when
we got home to make sure we hadn't caught a coelocanth.
After some research at the library (this was way pre internet),
my father concluded that he'd caught something called, you guessed it,
a cabezón.
Searching for photos now, I'm not so sure that's right. None of the
photos I've found look that much like the fish I remember. But I can't
find anything more likely candidates, either (though I'm wondering
about the Pacific staghorn sculpin as a possibility). I guess fish
identification even now in the age of Google isn't all that much
easier than it was in the seventies.
I don't think we ever ate the fish. It sat in his freezer for quite
a while while he tried to identify it, and I'm not sure what happened
after that.
So maybe I've seen a cabezón fish, and maybe I haven't.
But it was fun to learn about the -ón suffix in Spanish,
to find out the meaning of the name for that distinctive butte
out near Cuba. One of these days Dave and I will go hike it.
And if we make it to the top, we'll try not to get big heads about it.
Tags: nature, language
[
19:17 Mar 26, 2020
More nature |
permalink to this entry |
]
Wed, 11 Dec 2013
When I wrote recently about my
Dactylic
dinosaur doggerel, I glossed over a minor problem with my final poem:
the rules of
double-dactylic
doggerel say that the sixth line (or sometimes the seventh) should
be a single double-dactyl word -- something like "paleontologist"
or "hexasyllabic'ly". I used "dinosaur orchestra" -- two words,
which is cheating.
I don't feel too guilty about that.
If you read the post, you may recall that the verse was the result of
drifting grumpily through an insomniac morning where I would have
preferred to be getting back to sleep. Coming up with anything that
scans at all is probably good enough.
Still, it bugged me, not being able to think of a double-dactylic word
that related somehow to Parasaurolophus. So I vowed that, later that
day when I was up and at the computer, I would attempt to find one and
rewrite the poem accordingly.
I thought that would be fairly straightforward. Not so much. I thought
there would be some utility I could run that would count syllables for
me, then I could run /usr/share/dict/words through it, print
out all the 6-syllable words, and find one that fit. Turns out there
is no such utility.
But Python has a library for everything, doesn't it?
Some searching turned up
PyHyphen,
which includes some syllable-counting functions.
It apparently uses the hyphenation dictionaries that come with
LibreOffice.
There's a Debian package for it, python-pyhyphen -- but it doesn't work.
First, it depends on another package, hyphen-en-us, but doesn't
have that dependency encoded in the package, even as a suggested or
recommended package. But even when you install the hyphenated dictionary,
it still doesn't work because it doesn't point to the dictionary in
the place it was installed.
Looks like that problem was reported almost two years ago,
bug 627944:
python-pyhyphen: doesn't work out-of-the-box with hyphen-* packages.
There's a fix there that involves editing two files,
/usr/lib/python2.7/dist-packages/hyphen/config.py and
/usr/lib/python2.7/dist-packages/hyphen/__init__.py.
Or you can just give up on Debian and pip install pyhyphen
,
which is a lot easier.
But once you get it working, you find that it's terrible.
It was wrong about almost every word I tried.
I hope not too many people are relying on this hyphen-en-us dictionary
for important documents. Its results seemed nearly random, and I
quickly gave up on it for getting a useful list of words around
six syllables.
Just for fun, since my count syllables
web search turned
up quite a few websites claiming that functionality, I tried entering
some of my long test words manually. All of the websites I tried were
wrong more than half the time, and often they were off by more than
two syllables. I don't mind off-by-ones -- I can look at words
claiming 5 and 7 syllables while searching for double dactyls --
but if I have to include 4-syllable words as well, I'll never find
what I'm looking for.
That discouraged me from using another Python suggestion I'd seen, the
nltk (natural language toolkit) package. I've been looking for an
excuse to play with nltk, and some day I will, but for this project
I was looking for a quick approximate solution, and the nltk examples
I found mostly looked like using it would require a bigger time
commitment than I was willing to devote to silly poetry. And if
none of the dedicated syllable-counting websites or dictionaries
got it right, would a big time investment in nltk pay off?
Anyway, by this time I'd wasted more than an hour poking around
various libraries and websites for this silly unimportant problem,
and I decided that with that kind of time investment, I could probably
do better on my own than the official solutions were giving me.
Why not basically just count vowels?
So I whipped up a little script,
countsyl,
that did just that. I gave it a list of vowels, with a few simple rules.
Obviously, you can't just say every vowel is a new syllable -- there
are too many double vowels and silent letters and such. But you can't
say that any run of multiple vowels together counts as one syllable,
because sometimes the vowels do count; and you can't make absolute
rules like "'e' at the end of a word is always silent", because
sometimes it isn't. So I kept both minimum and maximum syllable counts
for each word, and printed both.
And much to my surprise, without much tuning at all my silly little
script immediately much better results than the hyphenation dictionary
or the dedicated websites.
Alas, although it did give me quite a few hexasyllabic words in
/usr/share/dict/words, none of them were useful at all for a program
on Parasaurolophus. What I really needed was a musical term (since
that's what the poem is about). What about a musical dictionary?
I found a list of musical terms on
Wikipedia:
Glossary of musical terminology, saved it as a local file,
ran a few vim substitutes and turned it into a plain list of words.
That did a little better, and gave me some possible ideas:
(non?)contrapuntally?
(something)harmonically?
extemporaneously?
But none of them worked out, and by then I'd run out of steam.
I gave up and blogged the poem as originally written, with the
cheating two-word phrase "dinosaur orchestra", and vowed to write
up how to count words in Python -- which I have now done.
Quite noncontrapuntally, and definitely not extemporaneously.
But at least I have a useful little script next time I want to
get an approximate syllable count.
Tags: dinosaur, poetry, writing, programming, python, language
[
17:51 Dec 11, 2013
More programming |
permalink to this entry |
]
Wed, 18 Aug 2004
I went to dict this afternoon to find out whether "cerebral" was
best pronounced ser EE brul or SER e bral.
The Collaborative International Dictionary of English v.0.44 [gcide]
gives two definitions, both with the same pronunciation:
/Cer"e*bral/
Great! What does that mean? It's not the standard phonetic
markings like dictionaries use (lucky for me, since if it used
accent marks and such, I wouldn't be able to display it in my
terminal font).
Jutta helped me out with the investigation, and with some combined
googling and README-reading, she eventually found
gcide's pronunc.web file.
That holds the key to the stresses: the double quote (") is a heavy stress
(light light stress, not indicated, would be a backquote).
The asterisk (*) is simply a hyphen to separate syllables (why
they don't just use a dash, or even a space, I'm not sure).
That's progress, but what do those vowels mean? And the C? (Okay,
I know it's pronounced as an ess. I even knew by now that both
pronunciations are acceptable, since I'd looked it up in a dead-tree
dictionary and so had about four other people on the channel while
I was trying to track down a dict pronunciation guide).
pronunc.web talks about a long list of special characters
that are supposed to correspond to web fonts (haha). But dict
doesn't actually use those: try dict free, which gives the
pronunciation /free/, while pronunc.web says it should show up
as /fr<emac// (which, you have to admit, would be pretty
confusing what with the close slash for the special character
followed by the close slash for the pronunciation; even aside from
the question of who can read strings like /fr<emac//,
ick).
Some other googling mentioned web dictionaries, including gcide,
using the
pronunciation
guide from the Jargon file. Ironically, this was very hard to
read since it uses smartquote characters all over the place which
not only don't appear in the font I was using in mozilla, but also
don't get substituted properly in mozilla (moz is usually pretty
good about that) so I just see boxes. It's possible that the font
claims to have the characters, then shows boxes instead.
Jutta wondered why PRONUNC.JPG and PRONUNC.WEB
weren't in the Debian package, since they're mentioned in
/usr/share/doc/dict-gcide/README.dictionary.gz. I have mixed
feelings: I think it's a bug that there's no file that describes
the pronunciation system being used, but since neither of those
files does describe it, not including them is probably not a bug.
At Jutta's suggestion, I filed a bug on dict-cgide (bug 266773).
Tags: language
[
21:19 Aug 18, 2004
More misc |
permalink to this entry |
]