Shallow Thoughts : tags : data

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sat, 02 Nov 2024

Decoding Specialized FIT files (30 Day Map Challenge Day 2, Lines)

[Screenshot from Specialized bike Android app] I have a new eBike! I'll write about it more before long, but for now, what's relevant to the 30 Day Map Challenge is that it's from Specialized, and if you use the Specialized phone app, it can record all sorts of fun statistics for rides, including GPS coordinates.

Read more ...

Tags: , , , , ,
[ 13:54 Nov 02, 2024    More mapping | permalink to this entry | ]

Mon, 27 Nov 2023

A Smaller, Lighter Dataset: Clip Layers in QGIS

[QGIS screenshot showing a manufactured polygon to clip a river layer]

The dataset I used for mapping fire perimeters is huge: not surprising if it's all historic fires for the US. Classifying it in QGIS gave a warning, and operations were very slow. Here's how to clip a big dataset in QGIS to restrict it to a smaller geographic area.

Read more ...

Tags: , , ,
[ 11:43 Nov 27, 2023    More mapping | permalink to this entry | ]

Fri, 13 May 2022

Mapping Fire Perimeters

[Fire map from mapping support.com] I've been using the Wildland Fires map from MappingSupport.com to keep an eye on the Cerro Pelado fire and the larger (though more distant from me) Hermit's Peak/Calf Canyon fires raging in the Pecos.

It's an excellent map, but it's a little sporadic in whether it shows the fire perimeter. In any case, as a data junkie, I wanted to know how to get the data and make my own display, maybe for a quick viewer that I can pop up when I sign on in the morning.

Also, Los Alamos County, on its Cerro Pelado Information page, has a map showing the "Go" lines (if the fire crosses these lines, we have to evacuate) for Los Alamos and White Rock and I'd like to be able to view those lines on the same map with the fire perimeter and hot spots.

Read more ...

Tags: , , , , ,
[ 10:46 May 13, 2022    More mapping | permalink to this entry | ]

Mon, 06 Apr 2020

F is for Food Waste

I keep seeing people claim that 40% of consumer food in the US is thrown away uneaten, or hear statistics like 20 pounds of wasted food per person per month.

I simply don't believe it.

There's no question that some food is wasted. It's hard to avoid having that big watermelon go bad before you have a chance to finish it all, especially when you're a one- or two-person household and the market won't sell you a quarter pound of cherries or half a pound of ground beef. And then there's all the stuff you don't want to eat: the bones, the fat, the banana peels and apple cores, the artichoke leaves and corn cobs.

But even if you count all that ... 40 percent? 2/3 of a pound per day per person? And that's supposed to be an average -- so if Dave and I are throwing out a few ounces, somebody else would have to be throwing out multiple pounds a day. It just doesn't seem possible. Who would do that?

When you see people quoting a surprising number -- especially when you see the same big number quoted by lots of people -- you should always ask yourself the source of the number.

Read more ...

Tags: ,
[ 20:08 Apr 06, 2020    More science | permalink to this entry | ]

Mon, 30 Mar 2020

D is for Devilish Place Names

It was surprisingly hard to come up with a "D" to write about, without descending into Data geekery (always a temptation). Though you may decide I've done that anyway with today's topic.

Out for a scenic drive to shake off some of the house-bound cobwebs, I got to thinking about how so many places are named after the Devil. California was full of them -- the Devil's Punchbowl, the Devil's Postpile, and so forth -- and nearly every western National Park has at least one devilish feature.

How many are there really? Happily, there's an easy way to answer questions like this: the Geographic Names page on the USGS website, which hosts the Geographic Names Information System (GNIS). You can download entire place name files for a state, or you can search for place name matches at: GNIS Feature Search.

When I searched there for "devil", I got 1883 hits -- but many of them don't actually include the word "Devil". What, are they taking lessons from Google about searching for things that don't actually match the search terms?

I decided I wanted to download the results so I could count them more easily. The page offers View & Print all or Save as pipe "|" delimited file. I chose to save the file.

Read more ...

Tags: , , , ,
[ 16:30 Mar 30, 2020    More linux/cmdline | permalink to this entry | ]

Thu, 30 May 2019

Plotting a Sequence of Graphs in Matplotlib 3D

A friend and I were talking about temperature curves: specifically, the way the temperature sinks in the evening but then frequently rises again before it really starts cooling off.

I thought it would be fun to plot the curve of temperature as a function of time over successive days, as a 3-D plot. I knew matplotlib had a way to do 3D plots, but I've never actually generated one.

Well, it turns out there are lots of examples, but they all start by generating mysterious data blobs, and none of them explain the structure of the data they're using, and the documentation has mysterious parameters like "zs" that aren't explained anywhere. So getting something that worked was a fiddly process. Creating a color version, to distinguish the graphs better, was even more fiddly.

[Plotting a series of graphs using matplotlib 3d] So I wrote an example that I hope will make it a little clearer for anyone trying to use this library. It can plot using just lines:

[Plotting a series of graphs using matplotlib 3d, color option] ... or it can plot in color, cycling colors manually because by default matplotlib makes adjacent colors similar, exactly the opposite of what you'd want:

Here's the demo: multiplot3d.py on GitHub.

... Except there's a Bug

All is not perfect. Axes3D gets a bit confused sometimes about which layer is supposed to be in front of which other layer. You can see that on the two plots: in both cases, the fourth and fifth layers from the front are reversed, so the fifth layer is drawn in front of the fourth layer. I haven't yet found anyone in the matplotlib organization who seems to know much about Axes3D; eventually I'll file a bug but I want to write a shorter, clearer test case to illustrate the problem. Still, even with the bugs it's a useful technique to know.

Tags: , , ,
[ 09:57 May 30, 2019    More programming | permalink to this entry | ]

Mon, 14 May 2018

Plotting the Jet Stream, or Other Winds, with ECMWF Data

I've been trying to learn more about weather from a friend who used to work in the field -- in particular, New Mexico's notoriously windy spring. One of the reasons behind our spring winds relates to the location of the jet stream. But I couldn't find many good references showing how the jet stream moves throughout the year. So I decided to try to plot it myself -- if I could find the data. Getting weather data can surprisingly hard.

In my search, I stumbled across Geert Barentsen's excellent Annual variations in the jet stream (video). It wasn't quite what I wanted -- it shows the position of the jet stream in December in successive years -- but the important thing is that he provides a Python script on GitHub that shows how he produced his beautiful animation.

[Sample jet steam image]

Well -- mostly. It turns out his data sources are no longer available, and he didn't go into a lot of detail on where he got his data, only saying that it was from the ECMWF ERA re-analysis model (with a link that's now 404). That led me on a merry chase through the ECMWF website trying to figure out which part of which database I needed. ECMWF has lots of publically available databases (and even more) and they even have Python libraries to access them; and they even have a lot of documentation, but somehow none of the documentation addresses questions like which database includes which variables and how to find and fetch the data you're after, and a lot of the sample code doesn't actually work. I ended up using the "ERA Interim, Daily" dataset and requesting data for only specific times and only the variables and pressure levels I was interested in. It's a great source of data once you figure out how to request it.

Sign up for an ECMWF API Key

Access ECMWF Public Datasets (there's also Access MARS and I'm not sure what the difference is), which has links you can click on to register for an API key.

Once you get the email with your initial password, log in using the URL in the email, and change the password. That gave me a "next" button that, when I clicked it, took me to a page warning me that the page was obsolete and I should update whatever bookmark I had used to get there. That page also doesn't offer a link to the new page where you can get your key details, so go here: Your API key. The API Key page gives you some lines you can paste into ~/.ecmwfapirc.

You'll also have to accept the license terms for the databases you want to use.

Install the Python API

That sets you up to use the ECMWF api. They have a Web API and a Python library, plus some other Python packages, but after struggling with a bunch of Magics tutorial examples that mostly crashed or couldn't find data, I decided I was better off sticking to the basic Python downloader API and plotting the results with Matplotlib.

The Python data-fetching API works well. To install it, activate your preferred Python virtualenv or whatever you use for pip packages, then run the pip command shown at Web API Downloads (under "Click here to see the installation/update instructions..."). As always with pip packages, you'll have to decide on a Python version (they support both 2 and 3) and whether to use a virtualenv, the much-disrecommended sudo pip, pip3, etc. I used pip3 in a virtualenv and it worked fine.

Specify a dataset and parameters

That's great, but how do you know which dataset you want to load?

There doesn't seem to be anything that just lists which datasets have which variables. The only way I found is to go to the Web API page for a particular dataset to see the form where you can request different variables. For instance, I ended up using the "interim-full-daily" database, where you can choose date ranges and lists of parameters. There are more choices in the sidebar: for instance, clicking on "Pressure levels" lets you choose from a list of barometric pressures ranging from 1000 all the way down to 1. No units are specified, but they're millibars, also known as hectoPascals (hPa): 1000 is more or less the pressure at ground level, 250 is roughly where the jet stream is, and Los Alamos is roughly at 775 hPa (you can find charts of pressure vs. altitude on the web).

When you go to any of the Web API pages, it will show you a dialog suggesting you read about Data retrieval efficiency, which you should definitely do if you're expecting to request a lot of data, then click on the details for the database you're using to find out how data is grouped in "tape files". For instance, in the ERA-interim database, tapes are grouped by date, so if you're requesting multiple parameters for multiple months, request all the parameters for a given month together, rather than making one request for level 250, another request for level 1000, etc.

Once you've checked the boxes for the data you want, you can fetch the data via the web interface, or click on "View the MARS request" to get parameters you can plug into a Python script.

If you choose the Python script option as I did, you can start with the basic data retrieval example. Use the second example, the one that uses 'format' : "netcdf", which will (eventually) give you a file ending in .nc.

Requesting a specific area

You can request only a limited area,

"area": "75/-20/10/60",
but they're not very forthcoming on the syntax of that, and it's particularly confusing since "75/-20/10/60" supposedly means "Europe". It's hard to figure how those numbers as longitudes and latitudes correspond to Europe, which doesn't go down to 10 degrees latitude, let alone -20 degrees. The Post-processing keywords page gives more information: it's North/West/South/East, which still makes no sense for Europe, until you expand the Area examples tab on that page and find out that by "Europe" they mean Europe plus Saudi Arabia and most of North Africa.

Using the data: What's in it?

Once you have the data file, assuming you requested data in netcdf format, you can parse the .nc file with the netCDF4 Python module -- available as Debian package "python3-netcdf4", or via pip -- to read that file:

import netCDF4

data = netCDF4.Dataset('filename.nc')

But what's in that Dataset? Try running the preceding two lines in the interactive Python shell, then:

>>> for key in data.variables:
...   print(key)
... 
longitude
latitude
level
time
w
vo
u
v

You can find out more about a parameter, like its units, type, and shape (array dimensions). Let's look at "level":

>>> data['level']
<class 'netCDF4._netCDF4.Variable'>
int32 level(level)
    units: millibars
    long_name: pressure_level
unlimited dimensions: 
current shape = (3,)
filling on, default _FillValue of -2147483647 used

>>> data['level'][:]
array([ 250,  775, 1000], dtype=int32)

>>> type(data['level'][:])
<class 'numpy.ndarray'>

Levels has shape (3,): it's a one-dimensional array with three elements: 250, 775 and 1000. Those are the three levels I requested from the web API and in my Python script). The units are millibars.

More complicated variables

How about something more complicated? u and v are the two components of wind speed.

>>> data['u']
<class 'netCDF4._netCDF4.Variable'>
int16 u(time, level, latitude, longitude)
    scale_factor: 0.002161405503194121
    add_offset: 30.095301438361684
    _FillValue: -32767
    missing_value: -32767
    units: m s**-1
    long_name: U component of wind
    standard_name: eastward_wind
unlimited dimensions: time
current shape = (30, 3, 241, 480)
filling on
u (v is the same) has a shape of (30, 3, 241, 480): it's a 4-dimensional array. Why? Looking at the numbers in the shape gives a clue. The second dimension has 3 rows: they correspond to the three levels, because there's a wind speed at every level. The first dimension has 30 rows: it corresponds to the dates I requested (the month of April 2015). I can verify that:
>>> data['time'].shape
(30,)

Sure enough, there are 30 times, so that's what the first dimension of u and v correspond to. The other dimensions, presumably, are latitude and longitude. Let's check that:

>>> data['longitude'].shape
(480,)
>>> data['latitude'].shape
(241,)

Sure enough! So, although it would be nice if it actually told you which dimension corresponded with which parameter, you can probably figure it out. If you're not sure, print the shapes of all the variables and work out which dimensions correspond to what:

>>> for key in data.variables:
...   print(key, data[key].shape)

Iterating over times

data['time'] has all the times for which you have data (30 data points for my initial test of the days in April 2015). The easiest way to plot anything is to iterate over those values:

    timeunits = JSdata.data['time'].units
    cal = JSdata.data['time'].calendar
    for i, t in enumerate(JSdata.data['time']):
        thedate = netCDF4.num2date(t, units=timeunits, calendar=cal)

Then you can use thedate like a datetime, calling thedate.strftime or whatever you need.

So that's how to access your data. All that's left is to plot it -- and in this case I had Geert Barentsen's script to start with, so I just modified it a little to work with slightly changed data format, and then added some argument parsing and runtime options.

Converting to Video

I already wrote about how to take the still images the program produces and turn them into a video: Making Videos (that work in Firefox) from a Series of Images.

However, it turns out ffmpeg can't handle files that are named with timestamps, like jetstream-2017-06-14-250.png. It can only handle one sequential integer. So I thought, what if I removed the dashes from the name, and used names like jetstream-20170614-250.png with %8d? No dice: ffmpeg also has the limitation that the integer can have at most four digits.

So I had to rename my images. A shell command works: I ran this in zsh but I think it should work in bash too.

cd outdir
mkdir moviedir

i=1
for fil in *.png; do
  newname=$(printf "%04d.png" $i)
  ln -s ../$fil moviedir/$newname
  i=$((i+1))
done

ffmpeg -i moviedir/%4d.png -filter:v "setpts=2.5*PTS" -pix_fmt yuv420p jetstream.mp4
The -filter:v "setpts=2.5*PTS" controls the delay between frames -- I'm not clear on the units, but larger numbers have more delay, and I think it's a multiplier, so this is 2.5 times slower than the default.

When I uploaded the video to YouTube, I got a warning, "Your videos will process faster if you encode into a streamable file format." I then spent half a day trying to find a combination of ffmpeg arguments that avoided that warning, and eventually gave up. As far as I can tell, the warning only affects the 20 seconds or so of processing that happens after the 5-10 minutes it takes to upload the video, so I'm not sure it's terribly important.

Results

Here's a video of the jet stream from 2012 to early 2018, and an earlier effort with a much longer 6.0x delay.

And here's the script, updated from the original Barentsen script and with a bunch of command-line options to let you plot different collections of data: jetstream.py on GitHub.

Tags: , , ,
[ 14:18 May 14, 2018    More programming | permalink to this entry | ]

Thu, 19 Jan 2017

Plotting Shapes with Python Basemap wwithout Shapefiles

In my article on Plotting election (and other county-level) data with Python Basemap, I used ESRI shapefiles for both states and counties.

But one of the election data files I found, OpenDataSoft's USA 2016 Presidential Election by county had embedded county shapes, available either as CSV or as GeoJSON. (I used the CSV version, but inside the CSV the geo data are encoded as JSON so you'll need JSON decoding either way. But that's no problem.)

Just about all the documentation I found on coloring shapes in Basemap assumed that the shapes were defined as ESRI shapefiles. How do you draw shapes if you have latitude/longitude data in a more open format?

As it turns out, it's quite easy, but it took a fair amount of poking around inside Basemap to figure out how it worked.

In the loop over counties in the US in the previous article, the end goal was to create a matplotlib Polygon and use that to add a Basemap patch. But matplotlib's Polygon wants map coordinates, not latitude/longitude.

If m is your basemap (i.e. you created the map with m = Basemap( ... ), you can translate coordinates like this:

    (mapx, mapy) = m(longitude, latitude)

So once you have a region as a list of (longitude, latitude) coordinate pairs, you can create a colored, shaped patch like this:

    for coord_pair in region:
        coord_pair[0], coord_pair[1] = m(coord_pair[0], coord_pair[1])
    poly = Polygon(region, facecolor=color, edgecolor=color)
    ax.add_patch(poly)

Working with the OpenDataSoft data file was actually a little harder than that, because the list of coordinates was JSON-encoded inside the CSV file, so I had to decode it with json.loads(county["Geo Shape"]). Once decoded, it had some counties as a Polygonlist of lists (allowing for discontiguous outlines), and others as a MultiPolygonlist of list of lists (I'm not sure why, since the Polygon format already allows for discontiguous boundaries)

[Blue-red-purple 2016 election map]

And a few counties were missing, so there were blanks on the map, which show up as white patches in this screenshot. The counties missing data either have inconsistent formatting in their coordinate lists, or they have only one coordinate pair, and they include Washington, Virginia; Roane, Tennessee; Schley, Georgia; Terrell, Georgia; Marshall, Alabama; Williamsburg, Virginia; and Pike Georgia; plus Oglala Lakota (which is clearly meant to be Oglala, South Dakota), and all of Alaska.

One thing about crunching data files from the internet is that there are always a few special cases you have to code around. And I could have gotten those coordinates from the census shapefiles; but as long as I needed the census shapefile anyway, why use the CSV shapes at all? In this particular case, it makes more sense to use the shapefiles from the Census.

Still, I'm glad to have learned how to use arbitrary coordinates as shapes, freeing me from the proprietary and annoying ESRI shapefile format.

The code: Blue-red map using CSV with embedded county shapes

Tags: , , , , , , , , ,
[ 09:36 Jan 19, 2017    More programming | permalink to this entry | ]

Sat, 14 Jan 2017

Plotting election (and other county-level) data with Python Basemap

After my arduous search for open 2016 election data by county, as a first test I wanted one of those red-blue-purple charts of how Democratic or Republican each county's vote was.

I used the Basemap package for plotting. It used to be part of matplotlib, but it's been split off into its own toolkit, grouped under mpl_toolkits: on Debian, it's available as python-mpltoolkits.basemap, or you can find Basemap on GitHub.

It's easiest to start with the fillstates.py example that shows how to draw a US map with different states colored differently. You'll need the three shapefiles (because of ESRI's silly shapefile format): st99_d00.dbf, st99_d00.shp and st99_d00.shx, available in the same examples directory.

Of course, to plot counties, you need county shapefiles as well. The US Census has county shapefiles at several different resolutions (I used the 500k version). Then you can plot state and counties outlines like this:

from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

def draw_us_map():
    # Set the lower left and upper right limits of the bounding box:
    lllon = -119
    urlon = -64
    lllat = 22.0
    urlat = 50.5
    # and calculate a centerpoint, needed for the projection:
    centerlon = float(lllon + urlon) / 2.0
    centerlat = float(lllat + urlat) / 2.0

    m = Basemap(resolution='i',  # crude, low, intermediate, high, full
                llcrnrlon = lllon, urcrnrlon = urlon,
                lon_0 = centerlon,
                llcrnrlat = lllat, urcrnrlat = urlat,
                lat_0 = centerlat,
                projection='tmerc')

    # Read state boundaries.
    shp_info = m.readshapefile('st99_d00', 'states',
                               drawbounds=True, color='lightgrey')

    # Read county boundaries
    shp_info = m.readshapefile('cb_2015_us_county_500k',
                               'counties',
                               drawbounds=True)

if __name__ == "__main__":
    draw_us_map()
    plt.title('US Counties')
    # Get rid of some of the extraneous whitespace matplotlib loves to use.
    plt.tight_layout(pad=0, w_pad=0, h_pad=0)
    plt.show()
[Simple map of US county borders]

Accessing the state and county data after reading shapefiles

Great. Now that we've plotted all the states and counties, how do we get a list of them, so that when I read out "Santa Clara, CA" from the data I'm trying to plot, I know which map object to color?

After calling readshapefile('st99_d00', 'states'), m has two new members, both lists: m.states and m.states_info.

m.states_info[] is a list of dicts mirroring what was in the shapefile. For the Census state list, the useful keys are NAME, AREA, and PERIMETER. There's also STATE, which is an integer (not restricted to 1 through 50) but I'll get to that.

If you want the shape for, say, California, iterate through m.states_info[] looking for the one where m.states_info[i]["NAME"] == "California". Note i; the shape coordinates will be in m.states[i]n (in basemap map coordinates, not latitude/longitude).

Correlating states and counties in Census shapefiles

County data is similar, with county names in m.counties_info[i]["NAME"]. Remember that STATE integer? Each county has a STATEFP, m.counties_info[i]["STATEFP"] that matches some state's m.states_info[i]["STATE"].

But doing that search every time would be slow. So right after calling readshapefile for the states, I make a table of states. Empirically, STATE in the state list goes up to 72. Why 72? Shrug.

    MAXSTATEFP = 73
    states = [None] * MAXSTATEFP
    for state in m.states_info:
        statefp = int(state["STATE"])
        # Many states have multiple entries in m.states (because of islands).
        # Only add it once.
        if not states[statefp]:
            states[statefp] = state["NAME"]

That'll make it easy to look up a county's state name quickly when we're looping through all the counties.

Calculating colors for each county

Time to figure out the colors from the Deleetdk election results CSV file. Reading lines from the CSV file into a dictionary is superficially easy enough:

    fp = open("tidy_data.csv")
    reader = csv.DictReader(fp)

    # Make a dictionary of all "county, state" and their colors.
    county_colors = {}
    for county in reader:
        # What color is this county?
        pop = float(county["votes"])
        blue = float(county["results.clintonh"])/pop
        red = float(county["Total.Population"])/pop
        county_colors["%s, %s" % (county["name"], county["State"])] \
            = (red, 0, blue)

But in practice, that wasn't good enough, because the county names in the Deleetdk names didn't always match the official Census county names.

Fuzzy matches

For instance, the CSV file had no results for Alaska or Puerto Rico, so I had to skip those. Non-ASCII characters were a problem: "Doña Ana" county in the census data was "Dona Ana" in the CSV. I had to strip off " County", " Borough" and similar terms: "St Louis" in the census data was "St. Louis County" in the CSV. Some names were capitalized differently, like PLYMOUTH vs. Plymouth, or Lac Qui Parle vs. Lac qui Parle. And some names were just different, like "Jeff Davis" vs. "Jefferson Davis".

To get around that I used SequenceMatcher to look for fuzzy matches when I couldn't find an exact match:

def fuzzy_find(s, slist):
    '''Try to find a fuzzy match for s in slist.
    '''
    best_ratio = -1
    best_match = None

    ls = s.lower()
    for ss in slist:
        r = SequenceMatcher(None, ls, ss.lower()).ratio()
        if r > best_ratio:
            best_ratio = r
            best_match = ss
    if best_ratio > .75:
        return best_match
    return None

Correlate the county names from the two datasets

It's finally time to loop through the counties in the map to color and plot them.

Remember STATE vs. STATEFP? It turns out there are a few counties in the census county shapefile with a STATEFP that doesn't match any STATE in the state shapefile. Mostly they're in the Virgin Islands and I don't have election data for them anyway, so I skipped them for now. I also skipped Puerto Rico and Alaska (no results in the election data) and counties that had no corresponding state: I'll omit that code here, but you can see it in the final script, linked at the end.

    for i, county in enumerate(m.counties_info):
        countyname = county["NAME"]
        try:
            statename = states[int(county["STATEFP"])]
        except IndexError:
            print countyname, "has out-of-index statefp of", county["STATEFP"]
            continue

        countystate = "%s, %s" % (countyname, statename)
        try:
            ccolor = county_colors[countystate]
        except KeyError:
            # No exact match; try for a fuzzy match
            fuzzyname = fuzzy_find(countystate, county_colors.keys())
            if fuzzyname:
                ccolor = county_colors[fuzzyname]
                county_colors[countystate] = ccolor
            else:
                print "No match for", countystate
                continue

        countyseg = m.counties[i]
        poly = Polygon(countyseg, facecolor=ccolor)  # edgecolor="white"
        ax.add_patch(poly)

Moving Hawaii

Finally, although the CSV didn't have results for Alaska, it did have Hawaii. To display it, you can move it when creating the patches:

    countyseg = m.counties[i]
    if statename == 'Hawaii':
        countyseg = list(map(lambda (x,y): (x + 5750000, y-1400000), countyseg))
    poly = Polygon(countyseg, facecolor=countycolor)
    ax.add_patch(poly)
The offsets are in map coordinates and are empirical; I fiddled with them until Hawaii showed up at a reasonable place. [Blue-red-purple 2016 election map]

Well, that was a much longer article than I intended. Turns out it takes a fair amount of code to correlate several datasets and turn them into a map. But a lot of the work will be applicable to other datasets.

Full script on GitHub: Blue-red map using Census county shapefile

Tags: , , , , , , , , , , ,
[ 15:10 Jan 14, 2017    More programming | permalink to this entry | ]

Thu, 12 Jan 2017

Getting Election Data, and Why Open Data is Important

Back in 2012, I got interested in fiddling around with election data as a way to learn about data analysis in Python. So I went searching for results data on the presidential election. And got a surprise: it wasn't available anywhere in the US. After many hours of searching, the only source I ever found was at the UK newspaper, The Guardian.

Surely in 2016, we're better off, right? But when I went looking, I found otherwise. There's still no official source for US election results data; there isn't even a source as reliable as The Guardian this time.

You might think Data.gov would be the place to go for official election results, but no: searching for 2016 election on Data.gov yields nothing remotely useful.

The Federal Election Commission has an election results page, but it only goes up to 2014 and only includes the Senate and House, not presidential elections. Archives.gov has popular vote totals for the 2012 election but not the current one. Maybe in four years, they'll have some data.

After striking out on official US government sites, I searched the web. I found a few sources, none of them even remotely official.

Early on I found Simon Rogers, How to Download County-Level Results Data, which leads to GitHub user tonmcg's County Level Election Results 12-16. It's a comparison of Democratic vs. Republican votes in the 2012 and 2016 elections (I assume that means votes for that party's presidential candidate, though the field names don't make that entirely clear), with no information on third-party candidates.

KidPixo's Presidential Election USA 2016 on GitHub is a little better: the fields make it clear that it's recording votes for Trump and Clinton, but still no third party information. It's also scraped from the New York Times, and it includes the scraping code so you can check it and have some confidence on the source of the data.

Kaggle claims to have election data, but you can't download their datasets or even see what they have without signing up for an account. Ben Hamner has some publically available Kaggle data on GitHub, but only for the primary. I also found several companies selling election data, and several universities that had datasets available for researchers with accounts at that university.

The most complete dataset I found, and the only open one that included third party candidates, was through OpenDataSoft. Like the other two, this data is scraped from the NYT. It has data for all the minor party candidates as well as the majors, plus lots of demographic data for each county in the lower 48, plus Hawaii, but not the territories, and the election data for all the Alaska counties is missing.

You can get it either from a GitHub repo, Deleetdk's USA.county.data (look in inst/ext/tidy_data.csv. If you want a larger version with geographic shape data included, clicking through several other opendatasoft pages eventually gets you to an export page, USA 2016 Presidential Election by county, where you can download CSV, JSON, GeoJSON and other formats.

The OpenDataSoft data file was pretty useful, though it had gaps (for instance, there's no data for Alaska). I was able to make my own red-blue-purple plot of county voting results (I'll write separately about how to do that with python-basemap), and to play around with statistics.

Implications of the lack of open data

But the point my search really brought home: By the time I finally found a workable dataset, I was so sick of the search, and so relieved to find anything at all, that I'd stopped being picky about where the data came from. I had long since given up on finding anything from a known source, like a government site or even a newspaper, and was just looking for data, any data.

And that's not good. It means that a lot of the people doing statistics on elections are using data from unverified sources, probably copied from someone else who claimed to have scraped it, using unknown code, from some post-election web page that likely no longer exists. Is it accurate? There's no way of knowing.

What if someone wanted to spread news and misinformation? There's a hunger for data, particularly on something as important as a US Presidential election. Looking at Google's suggested results and "Searches related to" made it clear that it wasn't just me: there are a lot of people searching for this information and not being able to find it through official sources.

If I were a foreign power wanting to spread disinformation, providing easily available data files -- to fill the gap left by the US Government's refusal to do so -- would be a great way to mislead people. I could put anything I wanted in those files: there's no way of checking them against official results since there are no official results. Just make sure the totals add up to what people expect to see. You could easily set up an official-looking site and put made-up data there, and it would look a lot more real than all the people scraping from the NYT.

If our government -- or newspapers, or anyone else -- really wanted to combat "fake news", they should take open data seriously. They should make datasets for important issues like the presidential election publically available, as soon as possible after the election -- not four years later when nobody but historians care any more. Without that, we're leaving ourselves open to fake news and fake data.

Tags: , , , , , , ,
[ 16:41 Jan 12, 2017    More politics | permalink to this entry | ]

Sat, 13 Apr 2013

Parsing NOAA historical weather data

We've been considering the possibility of moving out of the Bay Area to somewhere less crowded, somewhere in the desert southwest we so love to visit. But that also means moving to somewhere with much harsher weather.

How harsh? It's pretty easy to search for a specific location and get average temperatures. But what if I want to make a table to compare several different locations? I couldn't find any site that made that easy.

No problem, I say. Surely there's a Python library, I say. Well, no, as it turns out. There are Python APIs to get the current weather anywhere; but if you want historical weather data, or weather data averaged over many years, you're out of luck.

NOAA purports to have historical climate data, but the only dataset I found was spotty and hard to use. There's an FTP site containing directories by year; inside are gzipped files with names like 723710-03162-2012.op.gz. The first two numbers are station numbers, and there's a file at the top level called ish-history.txt with a list of the station codes and corresponding numbers. Not obvious!

Once you figure out the station codes, the files themselves are easy to parse, with lines like

STN--- WBAN   YEARMODA    TEMP       DEWP      SLP        STP       VISIB      WDSP     MXSPD   GUST    MAX     MIN   PRCP   SNDP   FRSHTT
724945 23293  20120101    49.5 24    38.8 24  1021.1 24  1019.5 24    9.9 24    1.5 24    4.1  999.9    68.0    37.0   0.00G 999.9  000000
Each line represents one day (20120101 is January 1st, 2012), and the codes are explained in another file called GSOD_DESC.txt. For instance, MAX is the daily high temperature, and SNDP is snow depth.

[NOAA historical temp program] So all I needed was to decode the station names, download the right files and parse them. That took about a day to write (including a lot of time wasted futzing with mysterious incantations for matplotlib).

Little accessibility refresher: I showed it to Dave -- "Neat, look at this, San Jose is the blue pair, Flagstaff is green and Page is red." His reaction: "This makes no sense. They all look the same to me. I have no idea which is which." Oops -- right. Don't use color as your only visual indicator. I knew that, supposedly! So I added markers in different shapes for each site. (I wish somebody would teach that lesson to Google Maps, which uses color as its only indicator on the traffic layer, so it's useless for red-green colorblind people.)

Back to the data -- it turns out NOAA doesn't actually have that much historical data available for download. If you search on most of these locations, you'll find sites that claim to have historical temperatures dating back 50 years or more, sometimes back to the 1800s. But NOAA typically only has files starting at about 2005 or 2006. I don't know where sites are getting this older data, or how reliable it is.

Still, averages since 2006 are still interesting to compare. Here's a run of noaatemps.py KSJC KFLG KSAF KLAM KCEZ KPGA KCNY. It's striking how moderate California weather is compared to any of these inland sites. No surprise there. Another surprise was that Los Alamos, despite its high elevation, has more moderate weather than most of the others -- lower highs, higher lows. I was a bit disappointed at how sparse the site list was -- no site in Moab? Really? So I used Canyonlands Field instead.

Anyway, it's fun for a data junkie to play around with, and it prints data on other weather factors, like precipitation and snowpack, although it doesn't plot them yet. The code is on my GitHub scripts page, under Weather.

Anyone found a better source for historical weather information? I'd love to have something that went back far enough to do some climate research, see what sites are getting warmer, colder, or seeing greater or lesser spreads between their extreme temperatures. The NOAA dataset obviously can't do that, so there must be something else that weather researchers use. Data on other countries would be interesting, too. Is there anything that's available to the public?

Tags: , , ,
[ 22:57 Apr 13, 2013    More programming | permalink to this entry | ]