Shallow Thoughts : : programming
Akkana's Musings on Open Source Computing, Science, and Nature.
Sat, 13 Apr 2013
We've been considering the possibility of moving out of the Bay Area
to somewhere less crowded, somewhere in the desert southwest we so
love to visit. But that also means moving to somewhere
with much harsher weather.
How harsh? It's pretty easy to search for a specific location and get
average temperatures. But what if I want to make a table to compare
several different locations? I couldn't find any site that made
that easy.
No problem, I say. Surely there's a Python library, I say.
Well, no, as it turns out. There are Python APIs to get the current
weather anywhere; but if you want historical weather data, or weather
data averaged over many years, you're out of luck.
NOAA purports to have historical climate data, but the only dataset I
found was spotty and hard to use. There's an
FTP site containing
directories by year; inside are gzipped files with names like
723710-03162-2012.op.gz. The first two numbers are station numbers,
and there's a file at the top level called ish-history.txt
with a list of the station codes and corresponding numbers.
Not obvious!
Once you figure out the station codes, the files themselves are easy to
parse, with lines like
STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT
724945 23293 20120101 49.5 24 38.8 24 1021.1 24 1019.5 24 9.9 24 1.5 24 4.1 999.9 68.0 37.0 0.00G 999.9 000000
Each line represents one day (20120101 is January 1st, 2012),
and the codes are explained in another file called
GSOD_DESC.txt.
For instance, MAX is the daily high temperature, and SNDP is snow depth.
So all I needed was to decode the station names, download the right files
and parse them. That took about a day to write (including a lot of
time wasted futzing with mysterious incantations for matplotlib).
Little accessibility refresher: I showed it to Dave -- "Neat, look at
this, San Jose is the blue pair, Flagstaff is green and Page is red."
His reaction:
"This makes no sense. They all look the same to me. I have no idea
which is which."
Oops -- right. Don't use color as your only visual indicator. I knew that,
supposedly! So I added markers in different shapes for each site.
(I wish somebody would teach that lesson to Google Maps, which uses
color as its only indicator on the traffic layer, so it's useless
for red-green colorblind people.)
Back to the data --
it turns out NOAA doesn't actually have that much historical data
available for download. If you search on most of these locations,
you'll find sites that claim to have historical temperatures dating
back 50 years or more, sometimes back to the 1800s. But NOAA typically
only has files starting at about 2005 or 2006. I don't know where
sites are getting this older data, or how reliable it is.
Still, averages since 2006 are still interesting to compare.
Here's a run of noaatemps.py KSJC KFLG KSAF KLAM KCEZ KPGA KCNY.
It's striking how moderate California weather is compared
to any of these inland sites. No surprise there. Another surprise
was that Los Alamos, despite its high elevation, has more moderate weather
than most of the others -- lower highs, higher lows. I was a bit
disappointed at how sparse the site list was -- no site in Moab?
Really? So I used Canyonlands Field instead.
Anyway, it's fun for a data junkie to play around with, and it prints
data on other weather factors, like precipitation and snowpack, although
it doesn't plot them yet.
The code is on my
GitHub
scripts page, under Weather.
Anyone found a better source for historical weather information?
I'd love to have something that went back far enough to do some
climate research, see what sites are getting warmer, colder, or
seeing greater or lesser spreads between their extreme temperatures.
The NOAA dataset obviously can't do that, so there must be something
else that weather researchers use. Data on other countries would be
interesting, too. Is there anything that's available to the public?
Tags: python, programming, weather, data
[
21:57 Apr 13, 2013
More programming |
permalink to this entry |
comments
]
Tue, 19 Mar 2013
One of the closing lightning talks at PyCon this year concerned the answers
to a list of Python programming puzzles given at some other point during
the conference. I hadn't seen the questions (I'm still not sure
where they are), but some of the problems looked fun.
One of them was: "What are the letters not used in Python keywords?"
I hadn't known about Python's keyword module, which could
come in handy some day:
>>> import keyword
>>> keyword.kwlist
['and', 'as', 'assert', 'break', 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'exec', 'finally', 'for', 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'not', 'or', 'pass', 'print', 'raise', 'return', 'try', 'while', 'with', 'yield']
So, given the list of keywords, what's the best way to find the list
of unique letters?
Any time you want a list of unique anything, you want a set.
For instance,
>>> set([1, 2, 3, 2, 2, 4, 5, 1, 5])
set([1, 2, 3, 4, 5])
But first you need a list of letters so can make a set out of it.
Split the list of words into a list of letters
My first idea was to use list comprehensions. You can split a single
word into letters like this:
>>> [ x for x in 'hello' ]
['h', 'e', 'l', 'l', 'o']
It took a bit of fiddling to get the right syntax to apply that to
every word in the list:
>>> [[c for c in w] for w in keyword.kwlist]
[['a', 'n', 'd'], ['a', 's'], ['a', 's', 's', 'e', 'r', 't'], ... ]
Update: Dave Foster points out that
[list(w) for w in keyword.kwlist] is another way,
simpler and cleaner way than the double list comprehension.
That's a list of lists, so it needs to be flattened into a single
list of letters before we can turn it into a set.
Flatten the list of lists
There are lots of ways to flatten a list of lists.
Here are four of them:
[item for sublist in [[c for c in w] for w in keyword.kwlist] for item in sublist]
reduce(lambda x,y: x+y, [[c for c in w] for w in keyword.kwlist])
import itertools
list(itertools.chain.from_iterable([[c for c in w] for w in keyword.kwlist]))
sum([[c for c in w] for w in keyword.kwlist], [])
That last one, using sum(), makes use of the fact that
Python uses + for list concatenation -- in other words, that
[1, 2, 3] + [4, 5, 6] is [1, 2, 3, 4, 5, 6].
But the first method (item for sublist in) is faster: see
Making a flat list out of list of lists in Python
on StackOverflow.
And another StackOverflow thread has a
nice script
for plotting speed vs. list size of various flatteners.
A simpler way of making the set
But it turns out none of this list comprehension stuff is needed anyway.
set('word') splits words into letters already:
>>> set('bubble')
set(['e', 'b', 'u', 'l'])
Ignore the order -- elements of a set often end up displaying in some
strange order. The important thing is that it has all the letters
and no repeats.
Now we have an easy way of making a set containing the letters in
one word. But how do we apply that to a list of words?
Again I initially tried using list comprehensions, then realized
there's an easier way. Given a list of strings, it's trivial to
join them into a single string using ''.join(). And that gives us
our set of letters within keywords:
>>> set(''.join(keyword.kwlist))
set(['a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'm', 'l', 'o', 'n', 'p', 's', 'r', 'u', 't', 'w', 'y', 'x'])
What letters are not in the set?
Almost done! But the original problem was to find the letters not in
keywords. We can do that by subtracting this set from the set of all
letters from a to z. How do we get that? The string
module will give us a list:
>>> string.lowercase
'abcdefghijklmnopqrstuvwxyz'
You could also use a list comprehension and ord and
chr (alas, range won't give you a range of
letters directly):
>>> [chr(i) for i in range(ord('a'), ord('z')+1)]
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
It's a bit longer, but doesn't require an import.
Now that you have your a-z set, just subtract the two sets:
>>> set(string.lowercase[:]) - set(''.join(keyword.kwlist))
set(['q', 'j', 'z', 'v'])
So the only letters not used in Python keywords are q, j, z and v.
Just a useless little ditty, really ... but I thought it was a fun exercise,
so maybe you will too.
Tags: programming, python
[
12:36 Mar 19, 2013
More programming |
permalink to this entry |
comments
]
Thu, 21 Feb 2013
I'm excited about my new project: MetaPho, an image tagger.
It arose out of a discussion on the LinuxChix Techtalk list:
photo collection management software.
John Sturdy was looking for an efficient way of viewing and tagging
large collections of photos. Like me, he likes fast, lightweight,
keyboard-driven programs. And like me, he didn't want a database-driven
system that ties you forever to one image cataloging program.
I put my image tags in plaintext files, named Keywords, so that
I can easily write scripts to search or modify them, or user grep,
and I can even make quick changes with a text editor.
I shared some tips on how I use my
Pho image viewer
for tagging images, and it sounded close to what he was looking for.
But as we discussed ideas about image tagging, we realized that
there were things he wanted to do that pho doesn't do well, things
not offered by any other image tagger we've been able to find.
While discussing how we might add new tagging functionality to pho,
I increasingly had the feeling that I was trying to fit off-road
tires onto a Miata -- or insert your own favorite metaphor for "making
something do something it wasn't designed to do."
Pho is a great image viewer, but the more I patched it to handle tagging,
the uglier and more complicated the code got, and it also got more
complex to use.
And really, everything we needed for tagging could be easily done in
a Python-GTK application. (Pho is written in C because it does a lot
of complicated focus management to deal with how window managers
handle window moving and resizing. A tagger wouldn't need any of that.)
I whipped up a demo image viewer in a few hours and showed it to John.
We continued the discussion, I made a GitHub repo, and over the next
week or so the code grew into an efficient and already surprisingly usable
image tagger.
We have big plans for it, like tags organized into categories so we
can have lots of tags without cluttering the interface too much.
But really, even as it is, it's better than anything I've used before.
I've been scanning in lots of photos from old family albums
(like this one of my mother and grandmother, and me at 9 months)
and it's been great to be able to add and review tags easily.
If you want to check out MetaPho, or contribute to it (either code or
user interface design), it lives in my
MetaPho
repository on GitHub.
And I wrote up a quick man page in markdown format:
metapho.1.md.
Feedback and contributors welcome!
Tags: programming, pho, image viewer, python, tagging, metapho
[
18:31 Feb 21, 2013
More programming |
permalink to this entry |
comments
]
Sat, 08 Dec 2012
Having not had much luck with spam filtering solutions like SpamAssassin,
I'm forever having to add new spam filters by hand. For instance, after
about the sixth time I get "President Waives Refi Requirement"
or "Melt your fat! MUST WATCH this video now!" within a couple of
hours, I'm pretty tired of it and don't want to see any more of them.
With mail filtering programs like procmail or maildrop, it's easy
enough to match a pattern like "Subject:.*Refi Requirement" or
"Subject:.*Melt your fat" and filter that message to a spam folder
(or /dev/null).
But increasingly, I add patterns I'm seeing in spam messages, and yet
the messages with those patterns keep coming in. Why? Because the
spammers are using RFC 2047
to encode the subject into some other character set.
Here's how it works. A spammer sends a subject line that looks
something like this:
Subject: =?utf-8?B?U3RvcCBPdmVycGF5aW5nIGZvciBQcmludGVyIEluaw==?=
Mail programs are smart enough to decode this into:
Subject: Stop Overpaying for Printer Ink
but spam filtering programs often aren't, so your "printer ink" filter
won't catch it. And if you look through your spam folder with tools like
grep to see why it didn't get caught, or to find particularly spammy
subjects that might call for a filter
(grep Subject spamfolder | sort is pretty handy),
these encoded subjects will be incognito.
I briefly tried setting up a filter that spam-filed anything with =? in the
Subject line. But that's way too broad a brush -- not all people
there are legitimate reasons for using other charsets even in English
language email. It's relatively rare, but it happens. And some bots,
notably the Adafruit forum notification bot
and the bot that sends out announcements from my alma mater,
unaccountably encode the charset even when they're sending mail
entirely in US ASCII.
So what's really needed is not to filter out all messages that specify
a charset, but to decode the Subject so the spam filter can see it and
filter it accordingly.
How? I couldn't find any ready-made tool
available for Linux that could decode RFC 2047 headers; but the Python
email package makes decoding a one-line task.
In the Python interpreter:
$ python
Python 2.7.3 (default, Aug 1 2012, 05:16:07)
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> email.Header.decode_header("Subject: =?utf-8?B?U3RvcCBPdmVycGF5aW5nIGZvciBQcmludGVyIEluaw==?=")
[('Subject:', None), ('Stop Overpaying for Printer Ink', 'utf-8')]
>>>
So it's easy to write a script that can pull headers out of email
messages (files) and decode them. Just look for the line starting with
the header you want to match -- e.g. "Subject:" -- and pass that line
to email.Header.decode_header().
Only one snag. If the subject is longer than about 20 characters,
spammers will often opt to split it up into multiple groups, sometimes
even in different character sets. So for example, you might see
something like this, spread over multiple lines:
Subject: =?windows-1252?Q?Earn_your_degree_=97_on_your_time?=
=?windows-1252?Q?_and_terms?=
The script has to handle that too. If it's reading a header, it has to
check the next line, and if that line begins with whitespace, treat it
as more of the header.
The resulting script, decodemail.py
(on github), seems pretty handy and should be able to be plugged in
to a mail filtering program.
Tags: email, spam
[
20:45 Dec 08, 2012
More programming |
permalink to this entry |
comments
]
Wed, 17 Oct 2012
A little while back I wrote about my
Python
xchat script to play sound alerts.
But one thing that's been annoying me about it -- it was a problem
with the old perl alert script too -- is the repeated sounds.
If lots of twitter updates come in on the Bitlbee channel, or if
someone pastes numerous lines into a channel, I hear POPPOPPOPPOPPOPPOP
or repetitions of whatever the alert sound is for that type of message.
It's annoying to me, but even more so to anyone else in the same room.
It would be so much nicer if I could have it play just one repetition
of any given alert, even if there are eight lines all coming in at the
same time. So I decided to write a Python class to handle that.
My existing code used subprocesses to call the basic ALSA sound player,
/usr/bin/aplay -q.
Initially I used
if not os.fork() : os.execl(APLAY, APLAY, "-q", alertfile)
but I later switched to the cleaner
subprocess.call([APLAY, '-q', alertfile])
But of course, it would be better to do it all from Python without
requiring an external process like aplay. So I looked into that first.
Sadly, it turns out Python audio support is a mess. The built-in libraries
are fairly limited in functionality and formats, and the external
libraries that handle sound are mostly unmaintained, unless you want
to pull in a larger library like pygame. After a little web searching
I decided that maybe an aplay subprocess wasn't so bad after all.
Okay, so how should I handle the subprocesses? I decided the best way was
to keep track of what sound was currently playing. If another alert fires
for the same sound while that sound is already playing, just ignore it.
If an alert comes in for a different sound, then wait() for the
current sound to finish, then start the new sound.
That's all quite easy with Python's subprocess module.
subprocess.Popen() returns a Popen object that tracks
a process ID and can check whether that process has finished or not.
If self.curpath is the path to the sound currently playing
and self.current is the Popen object for whatever aplay process
is currently running, then:
if self.current :
if self.current.poll() == None :
# Current process hasn't finished yet. Is this the same sound?
if path == self.curpath :
# A repeat of the currently playing sound.
# Don't play it more than once.
return
else :
# Trying to play a different sound.
# Wait on the current sound then play the new one.
self.wait()
self.curpath = path
self.current = subprocess.Popen([ "/usr/bin/aplay", '-q', path ] )
Finally, it's a good idea when exiting the program to check whether
any aplay process is running, and wait() for it. Otherwise, you might
end up with a zombie aplay process.
def __del__(self) :
self.wait()
I don't know if xchat actually closes down Python objects gracefully,
so I don't know whether the __del__ destructor will actually be called.
But at least I tried. It's possible that a
context
manager might be more reliable.
The full scripts are on github at
pyplay.py
for the basic SoundPlayer class, and
chatsounds.py
for the xchat script that includes SoundPlayer.
Tags: programming, python, audio
[
12:07 Oct 17, 2012
More programming |
permalink to this entry |
comments
]
Wed, 26 Sep 2012
I use xchat as my IRC client. Mostly I like it, but its sound alerts
aren't quite as configurable as I'd like. I have a few channels, like
my Bitlbee Twitter feed, where I want a much more subtle alert, or no
alert at all. And I want an easy way of turning sounds on and off,
in case I get busy with something and need to minimize distractions.
Years ago I grabbed a perl xchat plug-in called "Smet's NickSound"
that did something close to what I wanted. I've hacked a few things
into it. But every time I try to customize it any further, I'm hit
with the pain of write-only Perl. I've written Perl scripts, honest.
But I always have a really hard time reading anyone else's Perl code
and figuring out what it's doing. When I dove in again recently to
try to figure out why I was getting so many alerts when first starting
up xchat, I finally decided: learning how to write a Python xchat
script couldn't be any harder than reverse engineering a Perl one.
First, of course, I looked for an existing nick sound Python script ...
and totally struck out. In fact, mostly I struck out on finding any
xchat Python scripts at all. I know there are
Python bindings for
xchat, because there's documentation for them. But sample plug-ins?
Nope. For some reason, nobody's writing xchat plug-ins in Python.
I eventually found two minimal examples:
this very
simple example and the more elaborate
utf8decoder.
I was able to put them together and cobble up a working nick sound plug-in.
It's easy once you have an example to work from to help you figure out
the event hook arguments.
So here's my own little example, which may help the next person trying
to learn xchat Python scripting:
chatsounds.py
on github.
Tags: programming, python, xchat, irc
[
21:13 Sep 26, 2012
More programming |
permalink to this entry |
comments
]
Wed, 19 Sep 2012
When I'm using my RSS reader
FeedMe,
I normally check every feed every day. But that can be wasteful: some
feeds, like World Wide Words,
only update once a week.
A few feeds update even less often, like serialized books that come
out once a month or whenever the author has time to add something new.
So I decided it would be nice to add some "when" logic to FeedMe,
so I could add when = Sat in the config section for World
Wide Words and have it only update once a week.
That sounded trivial -- a little python parsing logic to tell days from
numbers, a few calls to time.localtime() and I was done.
Except of course I wasn't. Because sometimes, like when I'm on vacation,
I don't always update every day. If I missed a Saturday, then I'd
never see that week's edition of World Wide Words. And that would
be terrible!
So what I really needed was a way to ask, "Has a Saturday occurred
(including today) since the last time I ran feedme?"
The last time I ran feedme is easy to determine: it's in the last
modified date of the cache file. Or, in more Pythonic terms, it's
statbuf = os.stat(cachefile).st_mtime. And of course
I can get the current time with time.localtime().
But how do I figure out whether a given week or month day falls
between those two dates?
I'm sure this particular wheel has been invented many times. There's
probably even a nifty Python library somewhere to do it. But how
do you google for that? I tried to think of keywords and found nothing.
So I went for a nice walk in the redwoods and thought about it for a bit,
and came up with a solution.
Turns out for the week day case, you can just use modular arithmetic:
if (weekday_2 - target_weekday) % 7 < (weekday_2 - weekday_1)
then the day does indeed fall between the two dates.
Things are a little more complicated for the day of the month, though,
because you don't know whether you need mod 30 or 31 or 29 or 28,
so you either have to make your own table, or import the calendar module
just so you can call calendar.monthrange().
I decided it was easier to use logic:
if the difference between the two dates is
greater than 31, then it definitely includes any month day. Otherwise,
check whether they're in the same month or not, and do greater than/less
than comparisons on the three dates.
Throw in a bunch of conversion to make it easy to call, and a bunch of
unit tests to make sure everything works and my later tweaks don't
break anything, and I had a nice function I could call from Feedme.
falls_between.py
on github
Tags: programming, python
[
21:07 Sep 19, 2012
More programming |
permalink to this entry |
comments
]
Wed, 05 Sep 2012
I decided to give myself a birthday present and release version 0.9.8 of
Pho, my image viewer,
at long last.
I've been using it essentially unchanged for many months now,
occasionally tweaking things or fixing minor bugs ... but I haven't
run into any bugs in quite a while, and think I've fixed all the
pending ones. Been meaning to make a release for a long time, but
somehow I keep getting sidetracked and forgetting about it.
This should rationalize the version number again ... the official
releases have been 0.9.7-preN forever, but there was an unofficial
0.9.7 and even a 0.9.8 that snuck in along with some patches I got
from David Gardner. It's been confusing. So now it's officially
0.9.8, and any figure versions will start with 0.9.9, and we might
even see a 1.0 one of these days. (I suppose it's time -- Pho is ten
years old!)
So here it is: Pho 0.9.8.
I think it's working well. If you're already a Pho user, or if
you want a lightweight image viewer that's also good at triaging and
annotating large batches of images, you might want to take a look.
Tags: programming, pho, image viewer
[
12:36 Sep 05, 2012
More programming |
permalink to this entry |
comments
]