Shallow Thoughts

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sat, 18 Feb 2017

Highlight and remove extraneous whitespace in emacs

I recently got annoyed with all the trailing whitespace I saw in files edited by Windows and Mac users, and in code snippets pasted from sites like StackOverflow. I already had my emacs set up to indent with only spaces:

(setq-default indent-tabs-mode nil)
(setq tabify nil)
and I knew about M-x delete-trailing-whitespace ... but after seeing someone else who had an editor set up to show trailing spaces, and tabs that ought to be spaces, I wanted that too.

To show trailing spaces is easy, but it took me some digging to find a way to control the color emacs used:

;; Highlight trailing whitespace.
(setq-default show-trailing-whitespace t)
(set-face-background 'trailing-whitespace "yellow")

I also wanted to show tabs, since code indented with a mixture of tabs and spaces, especially if it's Python, can cause problems. That was a little harder, but I eventually found it on the EmacsWiki: Show whitespace:

;; Also show tabs.
(defface extra-whitespace-face
  '((t (:background "pale green")))
  "Color for tabs and such.")

(defvar bad-whitespace
  '(("\t" . 'extra-whitespace-face)))

While I was figuring this out, I got some useful advice related to emacs faces on the #emacs IRC channel: if you want to know why something is displayed in a particular color, put the cursor on it and type C-u C-x = (the command what-cursor-position with a prefix argument), which displays lots of information about whatever's under the cursor, including its current face.

Once I had my colors set up, I found that a surprising number of files I'd edited with vim had trailing whitespace. I would have expected vim to be better behaved than that! But it turns out that to eliminate trailing whitespace, you have to program it yourself. For instance, here are some recipes to Remove unwanted spaces automatically with vim.

Tags: ,
[ 16:41 Feb 18, 2017    More linux/editors | permalink to this entry | comments ]

Mon, 13 Feb 2017

Emacs: Initializing code files with a template

Part of being a programmer is having an urge to automate repetitive tasks.

Every new HTML file I create should include some boilerplate HTML, like <html><head></head></body></body></html>. Every new Python file I create should start with #!/usr/bin/env python, and most of them should end with an if __name__ == "__main__": clause. I get tired of typing all that, especially the dunderscores and slash-greater-thans.

Long ago, I wrote an emacs function called newhtml to insert the boilerplate code:

(defun newhtml ()
  "Insert a template for an empty HTML page"
  (interactive)
  (insert "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">\n"
          "<html>\n"
          "<head>\n"
          "<title></title>\n"
          "</head>\n\n"
          "<body>\n\n"
          "<h1></h1>\n\n"
          "<p>\n\n"
          "</body>\n"
          "</html>\n")
  (forward-line -11)
  (forward-char 7)
  )

The motion commands at the end move the cursor back to point in between the <title> and </title>, so I'm ready to type the page title. (I should probably have it prompt me, so it can insert the same string in title and h1, which is almost always what I want.)

That has worked for quite a while. But when I decided it was time to write the same function for python:

(defun newpython ()
  "Insert a template for an empty Python script"
  (interactive)
  (insert "#!/usr/bin/env python\n"
          "\n"
          "\n"
          "\n"
          "if __name__ == '__main__':\n"
          "\n"
          )
  (forward-line -4)
  )
... I realized that I wanted to be even more lazy than that. Emacs knows what sort of file it's editing -- it switches to html-mode or python-mode as appropriate. Why not have it insert the template automatically?

My first thought was to have emacs run the function upon loading a file. There's a function with-eval-after-load which supposedly can act based on file suffix, so something like (with-eval-after-load ".py" (newpython)) is documented to work. But I found that it was never called, and couldn't find an example that actually worked.

But then I realized that I have mode hooks for all the programming modes anyway, to set up things like indentation preferences. Inserting some text at the end of the mode hook seems perfectly simple:

(add-hook 'python-mode-hook
          (lambda ()
            (electric-indent-local-mode -1)
            (font-lock-add-keywords nil bad-whitespace)
            (if (= (buffer-size) 0)
                (newpython))
            (message "python hook")
            ))

The (= (buffer-size) 0) test ensures this only happens if I open a new file. Obviously I don't want to be auto-inserting code inside existing programs!

HTML mode was a little more complicated. I edit some files, like blog posts, that use HTML formatting, and hence need html-mode, but they aren't standalone HTML files that need the usual HTML template inserted. For blog posts, I use a different file extension, so I can use the elisp string-suffix-p to test for that:

  ;; s-suffix? is like Python endswith
  (if (and (= (buffer-size) 0)
           (string-suffix-p ".html" (buffer-file-name)))
      (newhtml) )

I may eventually find other files that don't need the template; if I need to, it's easy to add other tests, like the directory where the new file will live.

A nice timesaver: open a new file and have a template automatically inserted.

Tags: , ,
[ 09:52 Feb 13, 2017    More linux/editors | permalink to this entry | comments ]

Sun, 05 Feb 2017

Rosy Finches

Los Alamos is having an influx of rare rosy-finches (which apparently are supposed to be hyphenated: they're rosy-finches, not finches that are rosy).

[Rosy-finches] They're normally birds of the snowy high altitudes, like the top of Sandia Crest, and quite unusual in Los Alamos. They're even rarer in White Rock, and although I've been keeping my eyes open I haven't seen any here at home; but a few days ago I was lucky enough to be invited to the home of a birder in town who's been seeing great flocks of rosy-finches at his feeders.

There are four types, of which three have ever been seen locally, and we saw all three. Most of the flock was brown-capped rosy-finches, with two each black rosy-finches and gray-capped rosy-finches. The upper bird at right, I believe, is one of the blacks, but it might be a grey-capped. They're a bit hard to tell apart. In any case, pretty birds, sparrow sized with nice head markings and a hint of pink under the wing, and it was fun to get to see them.

[Roadrunner] The local roadrunner also made a brief appearance, and we marveled at the combination of high-altitude snowbirds and a desert bird here at the same place and time. White Rock seems like much better roadrunner territory, and indeed they're sometimes seen here (though not, so far, at my house), but they're just as common up in the forests of Los Alamos. Our host said he only sees them in winter; in spring, just as they start singing, they leave and go somewhere else. How odd!

Speaking of birds and spring, we have a juniper titmouse determinedly singing his ray-gun song, a few house sparrows are singing sporadically, and we're starting to see cranes flying north. They started a few days ago, and I counted several hundred of them today, enjoying the sunny and relatively warm weather as they made their way north. Ironically, just two weeks ago I saw a group of about sixty cranes flying south -- very late migrants, who must have arrived at the Bosque del Apache just in time to see the first northbound migrants leave. "Hey, what's up, we just got here, where ya all going?"

A few more photos: Rosy-finches (and a few other nice birds).

We also have a mule deer buck frequenting our yard, sometimes hanging out in the garden just outside the house to drink from the heated birdbath while everything else is frozen. (We haven't seen him in a few days, with the warmer weather and most of the ice melted.) We know it's the same buck coming back: he's easy to recognize because he's missing a couple of tines on one antler.

The buck is a welcome guest now, but in a month or so when the trees start leafing out I may regret that as I try to find ways of keeping him from stripping all the foliage off my baby apple tree, like some deer did last spring. I'm told it helps to put smelly soap shavings, like Irish Spring, in a bag and hang it from the branches, and deer will avoid the smell. I will try the soap trick but will probably combine it with other measures, like a temporary fence.

Tags: ,
[ 19:39 Feb 05, 2017    More nature/birds | permalink to this entry | comments ]

Fri, 27 Jan 2017

Making aliases for broken fonts

A web page I maintain (originally designed by someone else) specifies Times font. On all my Linux systems, Times displays impossibly tiny, at least two sizes smaller than any other font that's ostensibly the same size. So the page is hard to read. I'm forever tempted to get rid of that font specifier, but I have to assume that other people in the organization like the professional look of Times, and that this pathologic smallness of Times and Times New Roman is just a Linux font quirk.

In that case, a better solution is to alias it, so that pages that use Times will choose some larger, more readable font on my system. How to do that was in this excellent, clear post: How To Set Default Fonts and Font Aliases on Linux .

It turned out Times came from the gsfonts package, while Times New Roman came from msttcorefonts:

$ fc-match Times
n021003l.pfb: "Nimbus Roman No9 L" "Regular"
$ dpkg -S n021003l.pfb
gsfonts: /usr/share/fonts/type1/gsfonts/n021003l.pfb
$ fc-match "Times New Roman"
Times_New_Roman.ttf: "Times New Roman" "Normal"
$ dpkg -S Times_New_Roman.ttf
dpkg-query: no path found matching pattern *Times_New_Roman.ttf*
$ locate Times_New_Roman.ttf
/usr/share/fonts/truetype/msttcorefonts/Times_New_Roman.ttf
(dpkg -S doesn't find the file because msttcorefonts is a package that downloads a bunch of common fonts from Microsoft. Debian can't distribute the font files directly due to licensing restrictions.)

Removing gsfonts fonts isn't an option; aside from some documents and web pages possibly not working right (if they specify Times or Times New Roman and don't provide a fallback), removing gsfonts takes gnumeric and abiword with it, and I do occasionally use gnumeric. And I like having the msttcorefonts installed (hey, gotta have Comic Sans! :-) ). So aliasing the font is a better bet.

Following Chuan Ji's page, linked above, I edited ~/.config/fontconfig/fonts.conf (I already had one, specifying fonts for the fantasy and cursive web families), and added these stanzas:

    <match>
        <test name="family"><string>Times New Roman</string></test>
        <edit name="family" mode="assign" binding="strong">
            <string>DejaVu Serif</string>
        </edit>
    </match>
    <match>
        <test name="family"><string>Times</string></test>
        <edit name="family" mode="assign" binding="strong">
            <string>DejaVu Serif</string>
        </edit>
    </match>

The page says to log out and back in, but I found that restarting firefox was enough. Now I could load up a page that specified Times or Times New Roman and the text is easily readable.

Tags: ,
[ 14:47 Jan 27, 2017    More linux | permalink to this entry | comments ]

Mon, 23 Jan 2017

Testing a GitHub Pull Request

Several times recently I've come across someone with a useful fix to a program on GitHub, for which they'd filed a GitHub pull request.

The problem is that GitHub doesn't give you any link on the pull request to let you download the code in that pull request. You can get a list of the checkins inside it, or a list of the changed files so you can view the differences graphically. But if you want the code on your own computer, so you can test it, or use your own editors and diff tools to inspect it, it's not obvious how. That this is a problem is easily seen with a web search for something like download github pull request -- there are huge numbers of people asking how, and most of the answers are vague unclear.

That's a shame, because it turns out it's easy to pull a pull request. You can fetch it directly with git into a new branch as long as you have the pull request ID. That's the ID shown on the GitHub pull request page:

[GitHub pull request screenshot]

Once you have the pull request ID, choose a new name for your branch, then fetch it:

git fetch origin pull/PULL-REQUEST_ID/head:NEW-BRANCH-NAME
git checkout NEW-BRANCH-NAME

Then you can view diffs with something like git difftool NEW-BRANCH-NAME..master

Easy! GitHub should give a hint of that on its pull request pages.

Fetching a Pull Request diff to apply it to another tree

But shortly after I learned how to apply a pull request, I had a related but different problem in another project. There was a pull request for an older repository, but the part it applied to had since been split off into a separate project. (It was an old pull request that had fallen through the cracks, and as a new developer on the project, I wanted to see if I could help test it in the new repository.)

You can't pull a pull request that's for a whole different repository. But what you can do is go to the pull request's page on GitHub. There are 3 tabs: Conversation, Commits, and Files changed. Click on Files changed to see the diffs visually.

That works if the changes are small and only affect a few files (which fortunately was the case this time). It's not so great if there are a lot of changes or a lot of files affected. I couldn't find any "Raw" or "download" button that would give me a diff I could actually apply. You can select all and then paste the diffs into a local file, but you have to do that separately for each file affected. It might be, if you have a lot of files, that the best solution is to check out the original repo, apply the pull request, generate a diff locally with git diff, then apply that diff to the new repo. Rather circuitous. But with any luck that situation won't arise very often.

Update: thanks very much to Houz for the solution! (In the comments, below.) Just append .diff or .patch to the pull request URL, e.g. https://github.com/OWNER/REPO/pull/REQUEST-ID.diff which you can view in a browser or fetch with wget or curl.

Tags: , ,
[ 14:34 Jan 23, 2017    More programming | permalink to this entry | comments ]

Thu, 19 Jan 2017

Plotting Shapes with Python Basemap wwithout Shapefiles

In my article on Plotting election (and other county-level) data with Python Basemap, I used ESRI shapefiles for both states and counties.

But one of the election data files I found, OpenDataSoft's USA 2016 Presidential Election by county had embedded county shapes, available either as CSV or as GeoJSON. (I used the CSV version, but inside the CSV the geo data are encoded as JSON so you'll need JSON decoding either way. But that's no problem.)

Just about all the documentation I found on coloring shapes in Basemap assumed that the shapes were defined as ESRI shapefiles. How do you draw shapes if you have latitude/longitude data in a more open format?

As it turns out, it's quite easy, but it took a fair amount of poking around inside Basemap to figure out how it worked.

In the loop over counties in the US in the previous article, the end goal was to create a matplotlib Polygon and use that to add a Basemap patch. But matplotlib's Polygon wants map coordinates, not latitude/longitude.

If m is your basemap (i.e. you created the map with m = Basemap( ... ), you can translate coordinates like this:

    (mapx, mapy) = m(longitude, latitude)

So once you have a region as a list of (longitude, latitude) coordinate pairs, you can create a colored, shaped patch like this:

    for coord_pair in region:
        coord_pair[0], coord_pair[1] = m(coord_pair[0], coord_pair[1])
    poly = Polygon(region, facecolor=color, edgecolor=color)
    ax.add_patch(poly)

Working with the OpenDataSoft data file was actually a little harder than that, because the list of coordinates was JSON-encoded inside the CSV file, so I had to decode it with json.loads(county["Geo Shape"]). Once decoded, it had some counties as a Polygonlist of lists (allowing for discontiguous outlines), and others as a MultiPolygonlist of list of lists (I'm not sure why, since the Polygon format already allows for discontiguous boundaries)

[Blue-red-purple 2016 election map]

And a few counties were missing, so there were blanks on the map, which show up as white patches in this screenshot. The counties missing data either have inconsistent formatting in their coordinate lists, or they have only one coordinate pair, and they include Washington, Virginia; Roane, Tennessee; Schley, Georgia; Terrell, Georgia; Marshall, Alabama; Williamsburg, Virginia; and Pike Georgia; plus Oglala Lakota (which is clearly meant to be Oglala, South Dakota), and all of Alaska.

One thing about crunching data files from the internet is that there are always a few special cases you have to code around. And I could have gotten those coordinates from the census shapefiles; but as long as I needed the census shapefile anyway, why use the CSV shapes at all? In this particular case, it makes more sense to use the shapefiles from the Census.

Still, I'm glad to have learned how to use arbitrary coordinates as shapes, freeing me from the proprietary and annoying ESRI shapefile format.

The code: Blue-red map using CSV with embedded county shapes

Tags: , , , , ,
[ 09:36 Jan 19, 2017    More programming | permalink to this entry | comments ]

Sat, 14 Jan 2017

Plotting election (and other county-level) data with Python Basemap

After my arduous search for open 2016 election data by county, as a first test I wanted one of those red-blue-purple charts of how Democratic or Republican each county's vote was.

I used the Basemap package for plotting. It used to be part of matplotlib, but it's been split off into its own toolkit, grouped under mpl_toolkits: on Debian, it's available as python-mpltoolkits.basemap, or you can find Basemap on GitHub.

It's easiest to start with the fillstates.py example that shows how to draw a US map with different states colored differently. You'll need the three shapefiles (because of ESRI's silly shapefile format): st99_d00.dbf, st99_d00.shp and st99_d00.shx, available in the same examples directory.

Of course, to plot counties, you need county shapefiles as well. The US Census has county shapefiles at several different resolutions (I used the 500k version). Then you can plot state and counties outlines like this:

from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

def draw_us_map():
    # Set the lower left and upper right limits of the bounding box:
    lllon = -119
    urlon = -64
    lllat = 22.0
    urlat = 50.5
    # and calculate a centerpoint, needed for the projection:
    centerlon = float(lllon + urlon) / 2.0
    centerlat = float(lllat + urlat) / 2.0

    m = Basemap(resolution='i',  # crude, low, intermediate, high, full
                llcrnrlon = lllon, urcrnrlon = urlon,
                lon_0 = centerlon,
                llcrnrlat = lllat, urcrnrlat = urlat,
                lat_0 = centerlat,
                projection='tmerc')

    # Read state boundaries.
    shp_info = m.readshapefile('st99_d00', 'states',
                               drawbounds=True, color='lightgrey')

    # Read county boundaries
    shp_info = m.readshapefile('cb_2015_us_county_500k',
                               'counties',
                               drawbounds=True)

if __name__ == "__main__":
    draw_us_map()
    plt.title('US Counties')
    # Get rid of some of the extraneous whitespace matplotlib loves to use.
    plt.tight_layout(pad=0, w_pad=0, h_pad=0)
    plt.show()
[Simple map of US county borders]

Accessing the state and county data after reading shapefiles

Great. Now that we've plotted all the states and counties, how do we get a list of them, so that when I read out "Santa Clara, CA" from the data I'm trying to plot, I know which map object to color?

After calling readshapefile('st99_d00', 'states'), m has two new members, both lists: m.states and m.states_info.

m.states_info[] is a list of dicts mirroring what was in the shapefile. For the Census state list, the useful keys are NAME, AREA, and PERIMETER. There's also STATE, which is an integer (not restricted to 1 through 50) but I'll get to that.

If you want the shape for, say, California, iterate through m.states_info[] looking for the one where m.states_info[i]["NAME"] == "California". Note i; the shape coordinates will be in m.states[i]n (in basemap map coordinates, not latitude/longitude).

Correlating states and counties in Census shapefiles

County data is similar, with county names in m.counties_info[i]["NAME"]. Remember that STATE integer? Each county has a STATEFP, m.counties_info[i]["STATEFP"] that matches some state's m.states_info[i]["STATE"].

But doing that search every time would be slow. So right after calling readshapefile for the states, I make a table of states. Empirically, STATE in the state list goes up to 72. Why 72? Shrug.

    MAXSTATEFP = 73
    states = [None] * MAXSTATEFP
    for state in m.states_info:
        statefp = int(state["STATE"])
        # Many states have multiple entries in m.states (because of islands).
        # Only add it once.
        if not states[statefp]:
            states[statefp] = state["NAME"]

That'll make it easy to look up a county's state name quickly when we're looping through all the counties.

Calculating colors for each county

Time to figure out the colors from the Deleetdk election results CSV file. Reading lines from the CSV file into a dictionary is superficially easy enough:

    fp = open("tidy_data.csv")
    reader = csv.DictReader(fp)

    # Make a dictionary of all "county, state" and their colors.
    county_colors = {}
    for county in reader:
        # What color is this county?
        pop = float(county["votes"])
        blue = float(county["results.clintonh"])/pop
        red = float(county["Total.Population"])/pop
        county_colors["%s, %s" % (county["name"], county["State"])] \
            = (red, 0, blue)

But in practice, that wasn't good enough, because the county names in the Deleetdk names didn't always match the official Census county names.

Fuzzy matches

For instance, the CSV file had no results for Alaska or Puerto Rico, so I had to skip those. Non-ASCII characters were a problem: "Doña Ana" county in the census data was "Dona Ana" in the CSV. I had to strip off " County", " Borough" and similar terms: "St Louis" in the census data was "St. Louis County" in the CSV. Some names were capitalized differently, like PLYMOUTH vs. Plymouth, or Lac Qui Parle vs. Lac qui Parle. And some names were just different, like "Jeff Davis" vs. "Jefferson Davis".

To get around that I used SequenceMatcher to look for fuzzy matches when I couldn't find an exact match:

def fuzzy_find(s, slist):
    '''Try to find a fuzzy match for s in slist.
    '''
    best_ratio = -1
    best_match = None

    ls = s.lower()
    for ss in slist:
        r = SequenceMatcher(None, ls, ss.lower()).ratio()
        if r > best_ratio:
            best_ratio = r
            best_match = ss
    if best_ratio > .75:
        return best_match
    return None

Correlate the county names from the two datasets

It's finally time to loop through the counties in the map to color and plot them.

Remember STATE vs. STATEFP? It turns out there are a few counties in the census county shapefile with a STATEFP that doesn't match any STATE in the state shapefile. Mostly they're in the Virgin Islands and I don't have election data for them anyway, so I skipped them for now. I also skipped Puerto Rico and Alaska (no results in the election data) and counties that had no corresponding state: I'll omit that code here, but you can see it in the final script, linked at the end.

    for i, county in enumerate(m.counties_info):
        countyname = county["NAME"]
        try:
            statename = states[int(county["STATEFP"])]
        except IndexError:
            print countyname, "has out-of-index statefp of", county["STATEFP"]
            continue

        countystate = "%s, %s" % (countyname, statename)
        try:
            ccolor = county_colors[countystate]
        except KeyError:
            # No exact match; try for a fuzzy match
            fuzzyname = fuzzy_find(countystate, county_colors.keys())
            if fuzzyname:
                ccolor = county_colors[fuzzyname]
                county_colors[countystate] = ccolor
            else:
                print "No match for", countystate
                continue

        countyseg = m.counties[i]
        poly = Polygon(countyseg, facecolor=ccolor)  # edgecolor="white"
        ax.add_patch(poly)

Moving Hawaii

Finally, although the CSV didn't have results for Alaska, it did have Hawaii. To display it, you can move it when creating the patches:

    countyseg = m.counties[i]
    if statename == 'Hawaii':
        countyseg = list(map(lambda (x,y): (x + 5750000, y-1400000), countyseg))
    poly = Polygon(countyseg, facecolor=countycolor)
    ax.add_patch(poly)
The offsets are in map coordinates and are empirical; I fiddled with them until Hawaii showed up at a reasonable place. [Blue-red-purple 2016 election map]

Well, that was a much longer article than I intended. Turns out it takes a fair amount of code to correlate several datasets and turn them into a map. But a lot of the work will be applicable to other datasets.

Full script on GitHub: Blue-red map using Census county shapefile

Tags: , , , , , ,
[ 15:10 Jan 14, 2017    More programming | permalink to this entry | comments ]

Thu, 12 Jan 2017

Getting Election Data, and Why Open Data is Important

Back in 2012, I got interested in fiddling around with election data as a way to learn about data analysis in Python. So I went searching for results data on the presidential election. And got a surprise: it wasn't available anywhere in the US. After many hours of searching, the only source I ever found was at the UK newspaper, The Guardian.

Surely in 2016, we're better off, right? But when I went looking, I found otherwise. There's still no official source for US election results data; there isn't even a source as reliable as The Guardian this time.

You might think Data.gov would be the place to go for official election results, but no: searching for 2016 election on Data.gov yields nothing remotely useful.

The Federal Election Commission has an election results page, but it only goes up to 2014 and only includes the Senate and House, not presidential elections. Archives.gov has popular vote totals for the 2012 election but not the current one. Maybe in four years, they'll have some data.

After striking out on official US government sites, I searched the web. I found a few sources, none of them even remotely official.

Early on I found Simon Rogers, How to Download County-Level Results Data, which leads to GitHub user tonmcg's County Level Election Results 12-16. It's a comparison of Democratic vs. Republican votes in the 2012 and 2016 elections (I assume that means votes for that party's presidential candidate, though the field names don't make that entirely clear), with no information on third-party candidates.

KidPixo's Presidential Election USA 2016 on GitHub is a little better: the fields make it clear that it's recording votes for Trump and Clinton, but still no third party information. It's also scraped from the New York Times, and it includes the scraping code so you can check it and have some confidence on the source of the data.

Kaggle claims to have election data, but you can't download their datasets or even see what they have without signing up for an account. Ben Hamner has some publically available Kaggle data on GitHub, but only for the primary. I also found several companies selling election data, and several universities that had datasets available for researchers with accounts at that university.

The most complete dataset I found, and the only open one that included third party candidates, was through OpenDataSoft. Like the other two, this data is scraped from the NYT. It has data for all the minor party candidates as well as the majors, plus lots of demographic data for each county in the lower 48, plus Hawaii, but not the territories, and the election data for all the Alaska counties is missing.

You can get it either from a GitHub repo, Deleetdk's USA.county.data (look in inst/ext/tidy_data.csv. If you want a larger version with geographic shape data included, clicking through several other opendatasoft pages eventually gets you to an export page, USA 2016 Presidential Election by county, where you can download CSV, JSON, GeoJSON and other formats.

The OpenDataSoft data file was pretty useful, though it had gaps (for instance, there's no data for Alaska). I was able to make my own red-blue-purple plot of county voting results (I'll write separately about how to do that with python-basemap), and to play around with statistics.

Implications of the lack of open data

But the point my search really brought home: By the time I finally found a workable dataset, I was so sick of the search, and so relieved to find anything at all, that I'd stopped being picky about where the data came from. I had long since given up on finding anything from a known source, like a government site or even a newspaper, and was just looking for data, any data.

And that's not good. It means that a lot of the people doing statistics on elections are using data from unverified sources, probably copied from someone else who claimed to have scraped it, using unknown code, from some post-election web page that likely no longer exists. Is it accurate? There's no way of knowing.

What if someone wanted to spread news and misinformation? There's a hunger for data, particularly on something as important as a US Presidential election. Looking at Google's suggested results and "Searches related to" made it clear that it wasn't just me: there are a lot of people searching for this information and not being able to find it through official sources.

If I were a foreign power wanting to spread disinformation, providing easily available data files -- to fill the gap left by the US Government's refusal to do so -- would be a great way to mislead people. I could put anything I wanted in those files: there's no way of checking them against official results since there are no official results. Just make sure the totals add up to what people expect to see. You could easily set up an official-looking site and put made-up data there, and it would look a lot more real than all the people scraping from the NYT.

If our government -- or newspapers, or anyone else -- really wanted to combat "fake news", they should take open data seriously. They should make datasets for important issues like the presidential election publically available, as soon as possible after the election -- not four years later when nobody but historians care any more. Without that, we're leaving ourselves open to fake news and fake data.

Tags: , , , , ,
[ 16:41 Jan 12, 2017    More politics | permalink to this entry | comments ]