Getting Election Data, and Why Open Data is Important
Back in 2012, I got interested in fiddling around with election data as a way to learn about data analysis in Python. So I went searching for results data on the presidential election. And got a surprise: it wasn't available anywhere in the US. After many hours of searching, the only source I ever found was at the UK newspaper, The Guardian.
Surely in 2016, we're better off, right? But when I went looking, I found otherwise. There's still no official source for US election results data; there isn't even a source as reliable as The Guardian this time.
You might think Data.gov would be the place to go for official election results, but no: searching for 2016 election on Data.gov yields nothing remotely useful.
The Federal Election Commission has an election results page, but it only goes up to 2014 and only includes the Senate and House, not presidential elections. Archives.gov has popular vote totals for the 2012 election but not the current one. Maybe in four years, they'll have some data.
After striking out on official US government sites, I searched the web. I found a few sources, none of them even remotely official.
Early on I found Simon Rogers, How to Download County-Level Results Data, which leads to GitHub user tonmcg's County Level Election Results 12-16. It's a comparison of Democratic vs. Republican votes in the 2012 and 2016 elections (I assume that means votes for that party's presidential candidate, though the field names don't make that entirely clear), with no information on third-party candidates.
KidPixo's Presidential Election USA 2016 on GitHub is a little better: the fields make it clear that it's recording votes for Trump and Clinton, but still no third party information. It's also scraped from the New York Times, and it includes the scraping code so you can check it and have some confidence on the source of the data.
Kaggle claims to have election data, but you can't download their datasets or even see what they have without signing up for an account. Ben Hamner has some publically available Kaggle data on GitHub, but only for the primary. I also found several companies selling election data, and several universities that had datasets available for researchers with accounts at that university.
The most complete dataset I found, and the only open one that included third party candidates, was through OpenDataSoft. Like the other two, this data is scraped from the NYT. It has data for all the minor party candidates as well as the majors, plus lots of demographic data for each county in the lower 48, plus Hawaii, but not the territories, and the election data for all the Alaska counties is missing.
You can get it either from a GitHub repo, Deleetdk's USA.county.data (look in inst/ext/tidy_data.csv. If you want a larger version with geographic shape data included, clicking through several other opendatasoft pages eventually gets you to an export page, USA 2016 Presidential Election by county, where you can download CSV, JSON, GeoJSON and other formats.
The OpenDataSoft data file was pretty useful, though it had gaps (for instance, there's no data for Alaska). I was able to make my own red-blue-purple plot of county voting results (I'll write separately about how to do that with python-basemap), and to play around with statistics.
Implications of the lack of open data
But the point my search really brought home: By the time I finally found a workable dataset, I was so sick of the search, and so relieved to find anything at all, that I'd stopped being picky about where the data came from. I had long since given up on finding anything from a known source, like a government site or even a newspaper, and was just looking for data, any data.
And that's not good. It means that a lot of the people doing statistics on elections are using data from unverified sources, probably copied from someone else who claimed to have scraped it, using unknown code, from some post-election web page that likely no longer exists. Is it accurate? There's no way of knowing.
What if someone wanted to spread news and misinformation? There's a hunger for data, particularly on something as important as a US Presidential election. Looking at Google's suggested results and "Searches related to" made it clear that it wasn't just me: there are a lot of people searching for this information and not being able to find it through official sources.
If I were a foreign power wanting to spread disinformation, providing easily available data files -- to fill the gap left by the US Government's refusal to do so -- would be a great way to mislead people. I could put anything I wanted in those files: there's no way of checking them against official results since there are no official results. Just make sure the totals add up to what people expect to see. You could easily set up an official-looking site and put made-up data there, and it would look a lot more real than all the people scraping from the NYT.
If our government -- or newspapers, or anyone else -- really wanted to combat "fake news", they should take open data seriously. They should make datasets for important issues like the presidential election publically available, as soon as possible after the election -- not four years later when nobody but historians care any more. Without that, we're leaving ourselves open to fake news and fake data.
[ 16:41 Jan 12, 2017 More politics | permalink to this entry | ]