Shallow Thoughts : tags : matplotlib
Akkana's Musings on Open Source Computing and Technology, Science, and Nature.
Tue, 14 Jan 2020
A recent article on Pharyngula blog,
You ain’t no fortunate one,
discussed US wars, specifically the qeustion: depending on when you were born,
for how much of your life has the US been at war?
It was an interesting bunch of plots, constantly increasing until
for people born after 2001, the percentage hit 100%.
Really? That didn't seem right.
Wasn't the US in a lot of wars in the past?
When I was growing up, it seemed like we were always getting into wars,
poking our nose into other countries' business.
Can it really be true that we're so much more warlike now than we used to be?
It made me want to see a plot of when the wars were, beyond Pharyngula's
percentage-of-life pie charts. So I went looking for data.
The best source of war dates I could find was
American Involvement in Wars from Colonial Times to the Present.
I pasted that data into a table and reformatted it to turn it into Python
data, and used matplotlib to plot it as a Gantt chart.
(Script here:
us-wars.py.)
Sure enough. If that Thoughtco page with the war dates is even close to
accurate -- it could be biased toward listing recent conflicts,
but I didn't find a more authoritative source for war dates --
the prevalence of war took a major jump in 2001.
We used to have big gaps between wars, and except for Vietnam,
the wars we were involved with were short, mostly less than a year each.
But starting in 2001, we've been involved in a never-ending series of
overlapping wars unprecedented in US history.
The Thoughtco page had wars going back to 1675, so I also made a plot
showing all of them (click for the full-sized version).
It's no different: short wars, not overlapping, all the way back
to before the revolution. We've seen nothing in the past like the
current warmongering.
Depressing. Climate change isn't the only phenomenon showing
a modern "hockey stick" curve, it seems.
Tags: politics, programming, python, matplotlib
[
12:25 Jan 14, 2020
More politics |
permalink to this entry |
]
Thu, 30 May 2019
A friend and I were talking about temperature curves: specifically,
the way the temperature sinks in the evening but then frequently rises
again before it really starts cooling off.
I thought it would be fun to plot the curve of temperature as a
function of time over successive days, as a 3-D plot. I knew matplotlib
had a way to do 3D plots, but I've never actually generated one.
Well, it turns out there are lots of examples, but they all start by
generating mysterious data blobs, and none of them explain the
structure of the data they're using, and the documentation has
mysterious parameters like "zs" that aren't explained anywhere. So
getting something that worked was a fiddly process. Creating a color
version, to distinguish the graphs better, was even more fiddly.
So I wrote an example that I hope will make it a little clearer for
anyone trying to use this library. It can plot using just lines:
... or it can plot in color, cycling colors manually because by default
matplotlib makes adjacent colors similar, exactly the opposite of what
you'd want:
Here's the demo:
multiplot3d.py
on GitHub.
... Except there's a Bug
All is not perfect. Axes3D gets a bit confused sometimes about which
layer is supposed to be in front of which other layer. You can see that
on the two plots: in both cases, the fourth and fifth layers from the
front are reversed, so the fifth layer is drawn in front of the fourth
layer. I haven't yet found anyone in the matplotlib organization who
seems to know much about Axes3D; eventually I'll file a bug but I want
to write a shorter, clearer test case to illustrate the problem.
Still, even with the bugs it's a useful technique to know.
Tags: python, programming, data, matplotlib
[
09:57 May 30, 2019
More programming |
permalink to this entry |
]
Thu, 19 Jan 2017
In my article on
Plotting election (and other county-level) data with Python Basemap,
I used ESRI shapefiles for both states and counties.
But one of the election data files I found, OpenDataSoft's
USA 2016 Presidential Election by county
had embedded county shapes,
available either as CSV or as GeoJSON. (I used the CSV version, but
inside the CSV the geo data are encoded as JSON so you'll need JSON
decoding either way. But that's no problem.)
Just about all the documentation
I found on coloring shapes in Basemap assumed that the shapes were
defined as ESRI shapefiles. How do you draw shapes if you have
latitude/longitude data in a more open format?
As it turns out, it's quite easy, but it took a fair amount of poking
around inside Basemap to figure out how it worked.
In the loop over counties in the US in the previous article,
the end goal was to create a matplotlib Polygon
and use that to add a Basemap patch
.
But matplotlib's Polygon wants map coordinates, not latitude/longitude.
If m is your basemap (i.e. you created the map with
m = Basemap( ... )
, you can translate coordinates like this:
(mapx, mapy) = m(longitude, latitude)
So once you have a region as a list of (longitude, latitude) coordinate
pairs, you can create a colored, shaped patch like this:
for coord_pair in region:
coord_pair[0], coord_pair[1] = m(coord_pair[0], coord_pair[1])
poly = Polygon(region, facecolor=color, edgecolor=color)
ax.add_patch(poly)
Working with the OpenDataSoft data file was actually a little harder than
that, because the list of coordinates was JSON-encoded inside the CSV file,
so I had to decode it with json.loads(county["Geo Shape"])
.
Once decoded, it had some counties as a Polygon
list of
lists (allowing for discontiguous outlines), and others as
a MultiPolygon
list of list of lists (I'm not sure why,
since the Polygon format already allows for discontiguous boundaries)
And a few counties were missing, so there were blanks on the map,
which show up as white patches in this screenshot.
The counties missing data either have inconsistent formatting in
their coordinate lists, or they have only one coordinate pair, and
they include Washington, Virginia; Roane, Tennessee; Schley, Georgia;
Terrell, Georgia; Marshall, Alabama; Williamsburg, Virginia; and Pike
Georgia; plus Oglala Lakota (which is clearly meant to be Oglala,
South Dakota), and all of Alaska.
One thing about crunching data files
from the internet is that there are always a few special cases you
have to code around. And I could have gotten those coordinates from
the census shapefiles; but as long as I needed the census shapefile
anyway, why use the CSV shapes at all? In this particular case, it
makes more sense to use the shapefiles from the Census.
Still, I'm glad to have learned how to use arbitrary coordinates as shapes,
freeing me from the proprietary and annoying ESRI shapefile format.
The code:
Blue-red
map using CSV with embedded county shapes
Tags: elections, politics, visualization, programming, data, open data, data, matplotlib
[
09:36 Jan 19, 2017
More programming |
permalink to this entry |
]
Sat, 14 Jan 2017
After my
arduous
search for open 2016 election data by county, as a first test I
wanted one of those red-blue-purple charts of how Democratic or
Republican each county's vote was.
I used the Basemap package for plotting.
It used to be part of matplotlib, but it's been split off into its
own toolkit, grouped under mpl_toolkits: on Debian, it's
available as python-mpltoolkits.basemap, or you can find
Basemap on GitHub.
It's easiest to start with the
fillstates.py
example that shows
how to draw a US map with different states colored differently.
You'll need the three shapefiles (because of ESRI's silly shapefile format):
st99_d00.dbf, st99_d00.shp and st99_d00.shx, available
in the same examples directory.
Of course, to plot counties, you need county shapefiles as well.
The US Census has
county
shapefiles at several different resolutions (I used the 500k version).
Then you can plot state and counties outlines like this:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
def draw_us_map():
# Set the lower left and upper right limits of the bounding box:
lllon = -119
urlon = -64
lllat = 22.0
urlat = 50.5
# and calculate a centerpoint, needed for the projection:
centerlon = float(lllon + urlon) / 2.0
centerlat = float(lllat + urlat) / 2.0
m = Basemap(resolution='i', # crude, low, intermediate, high, full
llcrnrlon = lllon, urcrnrlon = urlon,
lon_0 = centerlon,
llcrnrlat = lllat, urcrnrlat = urlat,
lat_0 = centerlat,
projection='tmerc')
# Read state boundaries.
shp_info = m.readshapefile('st99_d00', 'states',
drawbounds=True, color='lightgrey')
# Read county boundaries
shp_info = m.readshapefile('cb_2015_us_county_500k',
'counties',
drawbounds=True)
if __name__ == "__main__":
draw_us_map()
plt.title('US Counties')
# Get rid of some of the extraneous whitespace matplotlib loves to use.
plt.tight_layout(pad=0, w_pad=0, h_pad=0)
plt.show()
Accessing the state and county data after reading shapefiles
Great. Now that we've plotted all the states and counties, how do we
get a list of them, so that when I read out "Santa Clara, CA" from
the data I'm trying to plot, I know which map object to color?
After calling readshapefile('st99_d00', 'states'), m has two new
members, both lists: m.states and m.states_info.
m.states_info[] is a list of dicts mirroring what was in the shapefile.
For the Census state list, the useful keys are NAME, AREA, and PERIMETER.
There's also STATE, which is an integer (not restricted to 1 through 50)
but I'll get to that.
If you want the shape for, say, California,
iterate through m.states_info[] looking for the one where
m.states_info[i]["NAME"] == "California"
.
Note i; the shape coordinates will be in m.states[i]
n
(in basemap map coordinates, not latitude/longitude).
Correlating states and counties in Census shapefiles
County data is similar, with county names in
m.counties_info[i]["NAME"]
.
Remember that STATE integer? Each county has a STATEFP,
m.counties_info[i]["STATEFP"]
that matches some state's
m.states_info[i]["STATE"]
.
But doing that search every time would be slow. So right after calling
readshapefile for the states, I make a table of states. Empirically,
STATE in the state list goes up to 72. Why 72? Shrug.
MAXSTATEFP = 73
states = [None] * MAXSTATEFP
for state in m.states_info:
statefp = int(state["STATE"])
# Many states have multiple entries in m.states (because of islands).
# Only add it once.
if not states[statefp]:
states[statefp] = state["NAME"]
That'll make it easy to look up a county's state name quickly when
we're looping through all the counties.
Calculating colors for each county
Time to figure out the colors from the Deleetdk election results CSV file.
Reading lines from the CSV file into a dictionary is superficially easy enough:
fp = open("tidy_data.csv")
reader = csv.DictReader(fp)
# Make a dictionary of all "county, state" and their colors.
county_colors = {}
for county in reader:
# What color is this county?
pop = float(county["votes"])
blue = float(county["results.clintonh"])/pop
red = float(county["Total.Population"])/pop
county_colors["%s, %s" % (county["name"], county["State"])] \
= (red, 0, blue)
But in practice, that wasn't good enough, because the county names
in the Deleetdk names didn't always match the official Census county names.
Fuzzy matches
For instance, the CSV file had no results for Alaska or Puerto Rico,
so I had to skip those. Non-ASCII characters were a problem:
"Doña Ana" county in the census data was "Dona Ana" in the CSV.
I had to strip off " County", " Borough" and similar terms:
"St Louis" in the census data was "St. Louis County" in the CSV.
Some names were capitalized differently, like PLYMOUTH vs. Plymouth,
or Lac Qui Parle vs. Lac qui Parle.
And some names were just different, like "Jeff Davis" vs. "Jefferson Davis".
To get around that I used SequenceMatcher to look for fuzzy matches
when I couldn't find an exact match:
def fuzzy_find(s, slist):
'''Try to find a fuzzy match for s in slist.
'''
best_ratio = -1
best_match = None
ls = s.lower()
for ss in slist:
r = SequenceMatcher(None, ls, ss.lower()).ratio()
if r > best_ratio:
best_ratio = r
best_match = ss
if best_ratio > .75:
return best_match
return None
Correlate the county names from the two datasets
It's finally time to loop through the counties in the map to color and
plot them.
Remember STATE vs. STATEFP? It turns out there are a few counties in
the census county shapefile with a STATEFP that doesn't match any
STATE in the state shapefile. Mostly they're in the Virgin Islands
and I don't have election data for them anyway, so I skipped them for now.
I also skipped Puerto Rico and Alaska (no results in the election data)
and counties that had no corresponding state: I'll omit that code here,
but you can see it in the final script, linked at the end.
for i, county in enumerate(m.counties_info):
countyname = county["NAME"]
try:
statename = states[int(county["STATEFP"])]
except IndexError:
print countyname, "has out-of-index statefp of", county["STATEFP"]
continue
countystate = "%s, %s" % (countyname, statename)
try:
ccolor = county_colors[countystate]
except KeyError:
# No exact match; try for a fuzzy match
fuzzyname = fuzzy_find(countystate, county_colors.keys())
if fuzzyname:
ccolor = county_colors[fuzzyname]
county_colors[countystate] = ccolor
else:
print "No match for", countystate
continue
countyseg = m.counties[i]
poly = Polygon(countyseg, facecolor=ccolor) # edgecolor="white"
ax.add_patch(poly)
Moving Hawaii
Finally, although the CSV didn't have results for Alaska, it did have
Hawaii. To display it, you can move it when creating the patches:
countyseg = m.counties[i]
if statename == 'Hawaii':
countyseg = list(map(lambda (x,y): (x + 5750000, y-1400000), countyseg))
poly = Polygon(countyseg, facecolor=countycolor)
ax.add_patch(poly)
The offsets are in map coordinates and are empirical; I fiddled with
them until Hawaii showed up at a reasonable place.
Well, that was a much longer article than I intended. Turns out
it takes a fair amount of code to correlate several datasets and
turn them into a map. But a lot of the work will be applicable
to other datasets.
Full script on GitHub:
Blue-red
map using Census county shapefile
Tags: elections, politics, visualization, programming, python, mapping, data, open data, matplotlib
[
15:10 Jan 14, 2017
More programming |
permalink to this entry |
]
Sun, 11 May 2014
I went to a terrific workshop last week on identifying bird songs.
We listened to recordings of songs from some of the trickier local species,
and discussed the differences and how to remember them. I'm not a serious
birder -- I don't do lists or Big Days or anything like that, and I
dislike getting up at 6am just because the birds do -- but I do try to
identify birds (as well as mammals, reptiles, rocks, geographic
features, and pretty much anything else I see while hiking or just
sitting in the yard) and I've always had trouble remembering their songs.
One of the tools birders use to study bird songs is the sonogram.
It's a plot of frequency (on the vertical axis) and intensity (represented
by color, red being louder) versus time. Looking at a sonogram
you can identify not just how fast a bird trills and whether it calls in
groups of three or five, but whether it's buzzy/rattly (a vertical
line, lots of frequencies at once) or a purer whistle, and whether
each note is ascending or descending.
The class last week included sonograms for the species we studied.
But what about other species? The class didn't cover even all the local
species I'd like to be able to recognize.
I have several collections of bird calls on CD
(which I bought to use in combination with my "tweet" script
-- yes, the name messes up google searches, but my tweet predates Twitter --
a tweet
Python script and
tweet
in HTML for Android).
It would be great to be able to make sonograms from some of those
recordings too.
But a search for Linux sonogram
turned up nothing useful.
Audacity has a histogram visualization mode with lots of options, but
none of them seem to result in a usable sonogram, and most discussions
I found on the net agreed that it couldn't do it. There's another
sound editor program called snd which can do sonograms, but it's
fiddly to use and none of the many color schemes produce a sonogram
that I found very readable.
Okay, what about python scripts? Surely that's been done?
I had better luck there. Matplotlib's pylab package has a
specgram()
call that does more or less what I wanted,
and here's
an
example of how to use pylab.specgram().
(That post also has another example using a library called timeside,
but timeside's PyPI package doesn't have any dependency information,
and after playing the old RPM-chase game installing another dependency,
trying it, then installing the next dependency, I gave up.)
The only problem with pylab.specgram()
was that it shows
the full range of the sound, both in time and frequency.
The recordings I was examining can
last a minute or more and go up to 20,000 Hz -- and when pylab tries
to fit that all on the screen, you end up with a plot where the details
are too small to show you anything useful.
You'd think there would be a way for pylab.specgram() to show
only part of the spectrum, but that doesn't seem to be.
I finally found a Stack Overflow discussion where "edited"
gives an excellent
rewritten
version of pylab.specgram which allows setting minimum and maximum
frequency cutoffs. Worked great!
Then I did some fiddling to allow for analyzing only part of the
recording -- Python's wave package has no way to read in just the first
six seconds of a .wav file, so I had to read in the
whole file, read the data into a numpy array, then take a slice
representing the seconds of the recording I actually wanted.
But now I can plot nice sonograms of any bird song I want to see,
print them out or stick them on my Android device so I can carry them
with me.
Update: Oops! I forgot to include a link to the script. Here it is:
Sonograms
in Python.
Tags: programming, python, nature, birds, matplotlib
[
09:17 May 11, 2014
More programming |
permalink to this entry |
]