I've been trying to learn more about weather from a friend who used to work in the field -- in particular, New Mexico's notoriously windy spring. One of the reasons behind our spring winds relates to the location of the jet stream. But I couldn't find many good references showing how the jet stream moves throughout the year. So I decided to try to plot it myself -- if I could find the data. Getting weather data can surprisingly hard.
In my search, I stumbled across Geert Barentsen's excellent Annual variations in the jet stream (video). It wasn't quite what I wanted -- it shows the position of the jet stream in December in successive years -- but the important thing is that he provides a Python script on GitHub that shows how he produced his beautiful animation.
Well -- mostly. It turns out his data sources are no longer available, and he didn't go into a lot of detail on where he got his data, only saying that it was from the ECMWF ERA re-analysis model (with a link that's now 404). That led me on a merry chase through the ECMWF website trying to figure out which part of which database I needed. ECMWF has lots of publically available databases (and even more) and they even have Python libraries to access them; and they even have a lot of documentation, but somehow none of the documentation addresses questions like which database includes which variables and how to find and fetch the data you're after, and a lot of the sample code doesn't actually work. I ended up using the "ERA Interim, Daily" dataset and requesting data for only specific times and only the variables and pressure levels I was interested in. It's a great source of data once you figure out how to request it.
Sign up for an ECMWF API Key
Once you get the email with your initial password, log in using the URL in the email, and change the password. That gave me a "next" button that, when I clicked it, took me to a page warning me that the page was obsolete and I should update whatever bookmark I had used to get there. That page also doesn't offer a link to the new page where you can get your key details, so go here: Your API key. The API Key page gives you some lines you can paste into ~/.ecmwfapirc.
You'll also have to accept the license terms for the databases you want to use.
Install the Python API
That sets you up to use the ECMWF api. They have a Web API and a Python library, plus some other Python packages, but after struggling with a bunch of Magics tutorial examples that mostly crashed or couldn't find data, I decided I was better off sticking to the basic Python downloader API and plotting the results with Matplotlib.
The Python data-fetching API works well. To install it, activate your preferred Python virtualenv or whatever you use for pip packages, then run the pip command shown at Web API Downloads (under "Click here to see the installation/update instructions..."). As always with pip packages, you'll have to decide on a Python version (they support both 2 and 3) and whether to use a virtualenv, the much-disrecommended sudo pip, pip3, etc. I used pip3 in a virtualenv and it worked fine.
Specify a dataset and parameters
That's great, but how do you know which dataset you want to load?
There doesn't seem to be anything that just lists which datasets have which variables. The only way I found is to go to the Web API page for a particular dataset to see the form where you can request different variables. For instance, I ended up using the "interim-full-daily" database, where you can choose date ranges and lists of parameters. There are more choices in the sidebar: for instance, clicking on "Pressure levels" lets you choose from a list of barometric pressures ranging from 1000 all the way down to 1. No units are specified, but they're millibars, also known as hectoPascals (hPa): 1000 is more or less the pressure at ground level, 250 is roughly where the jet stream is, and Los Alamos is roughly at 775 hPa (you can find charts of pressure vs. altitude on the web).
When you go to any of the Web API pages, it will show you a dialog suggesting you read about Data retrieval efficiency, which you should definitely do if you're expecting to request a lot of data, then click on the details for the database you're using to find out how data is grouped in "tape files". For instance, in the ERA-interim database, tapes are grouped by date, so if you're requesting multiple parameters for multiple months, request all the parameters for a given month together, rather than making one request for level 250, another request for level 1000, etc.
Once you've checked the boxes for the data you want, you can fetch the data via the web interface, or click on "View the MARS request" to get parameters you can plug into a Python script.
If you choose the Python script option as I did, you can start with the
basic data retrieval example.
Use the second example, the one that uses
'format' : "netcdf",
which will (eventually) give you a file ending in .nc.
Requesting a specific area
You can request only a limited area,
"area": "75/-20/10/60",but they're not very forthcoming on the syntax of that, and it's particularly confusing since "75/-20/10/60" supposedly means "Europe". It's hard to figure how those numbers as longitudes and latitudes correspond to Europe, which doesn't go down to 10 degrees latitude, let alone -20 degrees. The Post-processing keywords page gives more information: it's North/West/South/East, which still makes no sense for Europe, until you expand the Area examples tab on that page and find out that by "Europe" they mean Europe plus Saudi Arabia and most of North Africa.
Using the data: What's in it?
Once you have the data file, assuming you requested data in netcdf format, you can parse the .nc file with the netCDF4 Python module -- available as Debian package "python3-netcdf4", or via pip -- to read that file:
import netCDF4 data = netCDF4.Dataset('filename.nc')
But what's in that Dataset? Try running the preceding two lines in the interactive Python shell, then:
>>> for key in data.variables: ... print(key) ... longitude latitude level time w vo u v
You can find out more about a parameter, like its units, type, and shape (array dimensions). Let's look at "level":
>>> data['level'] <class 'netCDF4._netCDF4.Variable'> int32 level(level) units: millibars long_name: pressure_level unlimited dimensions: current shape = (3,) filling on, default _FillValue of -2147483647 used >>> data['level'][:] array([ 250, 775, 1000], dtype=int32) >>> type(data['level'][:]) <class 'numpy.ndarray'>
Levels has shape
(3,): it's a one-dimensional array with
three elements: 250, 775 and 1000.
Those are the three levels I requested from the web API and in my
Python script). The units are millibars.
More complicated variables
How about something more complicated? u and v are the two components of wind speed.
>>> data['u'] <class 'netCDF4._netCDF4.Variable'> int16 u(time, level, latitude, longitude) scale_factor: 0.002161405503194121 add_offset: 30.095301438361684 _FillValue: -32767 missing_value: -32767 units: m s**-1 long_name: U component of wind standard_name: eastward_wind unlimited dimensions: time current shape = (30, 3, 241, 480) filling onu (v is the same) has a shape of (30, 3, 241, 480): it's a 4-dimensional array. Why? Looking at the numbers in the shape gives a clue. The second dimension has 3 rows: they correspond to the three levels, because there's a wind speed at every level. The first dimension has 30 rows: it corresponds to the dates I requested (the month of April 2015). I can verify that:
>>> data['time'].shape (30,)
Sure enough, there are 30 times, so that's what the first dimension of u and v correspond to. The other dimensions, presumably, are latitude and longitude. Let's check that:
>>> data['longitude'].shape (480,) >>> data['latitude'].shape (241,)
Sure enough! So, although it would be nice if it actually told you which dimension corresponded with which parameter, you can probably figure it out. If you're not sure, print the shapes of all the variables and work out which dimensions correspond to what:
>>> for key in data.variables: ... print(key, data[key].shape)
Iterating over times
data['time'] has all the times for which you have data (30 data points for my initial test of the days in April 2015). The easiest way to plot anything is to iterate over those values:
timeunits = JSdata.data['time'].units cal = JSdata.data['time'].calendar for i, t in enumerate(JSdata.data['time']): thedate = netCDF4.num2date(t, units=timeunits, calendar=cal)
Then you can use thedate like a datetime,
thedate.strftime or whatever you need.
So that's how to access your data. All that's left is to plot it -- and in this case I had Geert Barentsen's script to start with, so I just modified it a little to work with slightly changed data format, and then added some argument parsing and runtime options.
Converting to Video
I already wrote about how to take the still images the program produces and turn them into a video: Making Videos (that work in Firefox) from a Series of Images.
However, it turns out ffmpeg can't handle files that are named with timestamps, like jetstream-2017-06-14-250.png. It can only handle one sequential integer. So I thought, what if I removed the dashes from the name, and used names like jetstream-20170614-250.png with %8d? No dice: ffmpeg also has the limitation that the integer can have at most four digits.
So I had to rename my images. A shell command works: I ran this in zsh but I think it should work in bash too.
cd outdir mkdir moviedir i=1 for fil in *.png; do newname=$(printf "%04d.png" $i) ln -s ../$fil moviedir/$newname i=$((i+1)) done ffmpeg -i moviedir/%4d.png -filter:v "setpts=2.5*PTS" -pix_fmt yuv420p jetstream.mp4The
-filter:v "setpts=2.5*PTS"controls the delay between frames -- I'm not clear on the units, but larger numbers have more delay, and I think it's a multiplier, so this is 2.5 times slower than the default.
When I uploaded the video to YouTube, I got a warning, "Your videos will process faster if you encode into a streamable file format." I then spent half a day trying to find a combination of ffmpeg arguments that avoided that warning, and eventually gave up. As far as I can tell, the warning only affects the 20 seconds or so of processing that happens after the 5-10 minutes it takes to upload the video, so I'm not sure it's terribly important.
And here's the script, updated from the original Barentsen script and with a bunch of command-line options to let you plot different collections of data: jetstream.py on GitHub.
[ 14:18 May 14, 2018 More programming | permalink to this entry | comments ]