Shallow Thoughts : tags : flask

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sat, 01 Feb 2020

Migrate a sqlite3 Flask App to Postgresql

The New Mexico legislature is in session again, which means the New Mexico Bill Tracker I wrote last year is back in season. But I guess the word has gotten out, because this year, I started seeing a few database errors. Specifically, "sqlite3.OperationalError: database is locked".

It turns out that even read queries on an sqlite3 database inside flask and sqlalchemy can sometimes keep the database open indefinitely. Consider something like:

    userbills = user.get_bills()    # this does a read query

    # Do some slow operations that don't involve the database at all
    for bill in userbills:
        slow_update_involving_web_scraping(bill)

    # Now bills are all updated; add and commit them.
    # Here's where the write operations start.
    for bill in userbills:
        db.session.add(bill)
    db.session.commit()

I knew better than to open a write query that might keep the database open during all those long running operations. But apparently, when using sqlite3, even the initial query of the database to get the user's bill list opens the database and keeps it open ... until when? Can you close it manually, then reopen it when you're ready? Does it help to call db.session.commit() after the read query? No one seems to know, and it's not obvious how to test to find out.

I've suspected for a long time that sqlite was only a temporary solution. While developing the billtracker, I hit quite a few difficulties where the answer turned out to be "well, this would be easy in a real database, but sqlite doesn't support that". I figured I'd eventually migrate to postgresql. But I'm such a database newbie that I'd been putting it off.

And rightly so. It turns out that migrating an existing database from sqlite3 to postgresql isn't something that gets written about much; I really couldn't find any guides on it. Apparently everybody but me just chooses the right database to begin with? Anyway, here are the steps on Debian. Obviously, install postgresql first.

Create a User and a Database

Postgresql has its own notion of users, which you need to create. At least on Debian, the default is that if you create a postgres user named martha, then the Linux user martha on the same machine can access databases that the postgres user martha has access to. This is controlled by the "peer" auth method, which you can read about in the postgresql documentation on pg_hba.conf.

First su to the postgres Linux user and run psql:

$ sudo su - postgres
$ psql

Inside psql, create a postgresql user with the same name as your flask user, and create a database for that user:

CREATE USER myflaskuser WITH PASSWORD 'password';
ALTER ROLE myflaskuser SET client_encoding TO 'utf8';
ALTER ROLE myflaskuser SET default_transaction_isolation TO 'read committed';
ALTER ROLE myflaskuser SET timezone TO 'UTC';

CREATE DATABASE dbname;
GRANT ALL PRIVILEGES ON DATABASE dbname TO myflaskuser;

If you like, you can also create a user yourusername and give it access to the same database, to make debugging easier.

With the database created, the next step is to migrate the old data from the sqlite database.

pgloader (if you have a very recent pgloader)

Using sqlalchemy in my flask app meant that I could use flask db upgrade to create the database schema in any database I chose. It does a lovely job of creating an empty database. Unfortunately, that's no help if you already have an existing database full of user accounts.

Some people suggested exporting data in either SQL or CSV format, then importing it into postgresql. Bad idea. There are many incompatibilities between the two databases: identifiers that work in sqlite but not in postgresql (like "user", which is a reserved word in postgres but a common table name in flask-based apps), capitalization of column names, incompatible date formats, and probably many more.

A program called pgloader takes care of many (but not all) of the incompatibilities. Create a file -- I'll call it migrate.pgloader -- like this:

load database
    from 'latest-sqlite-file.db'
    into postgresql:///new_db_name

with include drop, quote identifiers, create tables, create indexes, reset sequences

set work_mem to '16MB', maintenance_work_mem to '512 MB';

Then, from a Linux user who has access to the database (e.g. the myflaskuser you created earlier), run pgloader migrate.pgloader.

That worked nicely on my Ubuntu 19.10 desktop, which has pgloader 3.6.1. It failed utterly on the server, which is running Debian stable and pgloader 3.3.2. Building the latest pgloader from source didn't work on Debian either; it's based on Common Lisp, and the older CL on Debian dumped me into some kind of assembly debugger when I tried to build pgloader. Rather than build CL from source too, I looked for another option.

On an Older OS: Use pgloader Remotely

Postgresql can take commands from remote machines. So you can configure postgresql to accept remote connections, then run the migration from a machine with a new enough pgloader version.

There are two files to edit. The location of postgresql's configuration directory varies with version, so do a locate pg_hba.conf to find it. In that directory, first edit pg_hba.conf and add these lines to the end to allow net socket connections from IP4 and IP6:

host  all     all   0.0.0.0/0     md5
host  all     all   ::/0          md5

In the same directory, edit postgresql.conf and search for listen_addr. Comment out the localhost line if it's uncommented, and add this to allow connections from anywhere, not just localhost:

listen_addresses = '*'

Then restart the database with

service postgresql restart

Modify the migrate.pgloader file from the previous section so the "into" line looks like

    into postgresql://username:password@host/dbname
The username there is the postgres username, if you made that different from the Unix username. You need to use a password because postgres is no longer using peer auth (see that postgres documentation file I linked earlier).

Assuming this You're done with the remote connection part. If you don't need remote database connections for your app, you can now edit postgresql.conf, comment out that listen_addresses = '*' line, and restart the database again with service postgresql restart. Don't remove the two lines you added in pg_hba.conf; flask apparently needs them.

You're ready for the migration. Make sure you have the latest copy of the server's sqlite database, then, from your desktop, run:

pgloader migrate.pgloader

Migrate Autoincrements to Sequences

But that's not enough. If you're using any integer primary keys that autoincrement -- a pretty common database model -- postgresql doesn't understand that. Instead, it has sequence objects. You need to define a sequence, tie it to a table, and tell postgresql that when it adds a new object to the table, the default value of id is the maximum number in the corresponding sequence. Here's how to do that for the table named "user":

CREATE SEQUENCE user_id_seq OWNED by "user".id;
ALTER TABLE "user" ALTER COLUMN id SET default nextval('user_id_seq');
SELECT setval(pg_get_serial_sequence('user', 'id'), coalesce(max(id)+1,1), false) FROM "user";

Note the quotes around "user" because otherwise user is a postgresql reserved word. Repeat these three lines for every table in your database, except that you don't need the quotes around any table name except user.

Incidentally, I've been told that using autoincrement/sequence primary keys isn't best practice, because it can be a bottleneck if lots of different objects are being created at once. I used it because all the models I was following when I started with flask worked that way, but eventually I plan try to switch to using some other unique primary key.

Update: Turns out there was another problem with the sequences, and it was pretty annoying. I ended up with a bunch of indices with names like "idx_15517_ix_user_email" when they should have been "ix_user_email". The database superficially worked fine, but it havoc ensues if you ever need to do a flask/sqlalchemy/alembic migration, since sqlalchemy doesn't know anything about those indices with the funny numeric names. It's apparently possible to rename indices in postgresql, but it's a tricky operation that has to be done by hand for each index.

Now the database should be ready to test.

Test

Your flask app probably has something like this in config.py:

    SQLALCHEMY_DATABASE_URI = os.environ.get('DATABASE_URL') or \
        'sqlite:///' + os.path.join(basedir, 'dbname.db')

If so, you can export DATABSE_URL=postgresql:///dbname and then test it as you usually would. If you normally test on a local machine and not on the server, remember you can tell flask's test server to accept connections from remote machines with flask run --host=0.0.0.0

Database Backups

You're backing up your database, right? That's easier in sqlite where you can just copy the db file.

From the command line, you can back up a postgresql database with: pg_dump dbname > dbname-backup.pg You can do that from Python in a subprocess:

    with open(backup_file, 'w') as fp:
        subprocess.call(["pg_dump", dbname], stdout=fp)

Verify You're Using The New Database

I had some problems with that DATABASE_URL setting; I'd never used it so I didn't realize that it wasn't in the right place and didn't actually work. So I ran through my migration steps, changed DATABASE_URL, thought I was done, and realized later that the app was still running off sqlite3.

It's better to know for sure what your app is running. For instance, you can add a route to routes.py that prints details like that.

You can print app.config["SQLALCHEMY_DATABASE_URI"]. That's enough in theory, but I wanted to know for sure. Turns out str(db.session.get_bind()) will print the connection the flask app's database is actually using. So I added a route that prints both, plus some other information about the running app.

Whew! I was a bit surprised that migrating was as tricky as it was, and that there wasn't more documentation for it. Happy migrations, everyone.

Tags: , , , , ,
[ 12:34 Feb 01, 2020    More tech/web | permalink to this entry | comments ]

Fri, 25 Jan 2019

Announcing the New Mexico Bill Tracker

For the last few weeks I've been consumed with a project I started last year and then put aside for a while: a bill tracker.

The project sprung out of frustration at the difficulty of following bills as they pass through the New Mexico legislature. Bills I was interested in would die in committee, or they would make it to a vote, and I'd read about it a few days later and wish I'd known that it was a good time to write my representative or show up at the Roundhouse to speak. (I've never spoken at the Roundhouse, and whether I'd have the courage to actually do it remains to be seen, but at least I'd like to have the chance to decide.)

New Mexico has a Legislative web site where you can see the status of each bill, and they even offer a way to register and save a list of bills; but then there's no way to get alerts about bills that change status and might be coming up for debate.

New Mexico legislative sessions are incredibly short: 60 days in odd years, 30 days in even. During last year's 30-day session, I wrote some Python code that scraped the HTML pages describing a bill, extract the useful information like when the bill last changed status and where it was right now, present the information in a table where the user could easily scan it, and email the user a daily summary. Fortunately, the nmlegis.gov site, while it doesn't offer raw data for bill status, at least uses lots of id tags in its HTML which make them relatively easy to scrape.

Then the session ended and there was no further way to test it, since bills' statuses were no longer changing. So the billtracker moved to the back burner.

In the runup to this year's 60-day session, I started with Flask, a lightweight Python web library I've used for a couple of small projects, and added some extensions that help Flask handle tasks like user accounts. Then I patched in the legislative web scraping code from last year, and the result was The New Mexico Bill Tracker. I passed the word to some friends in the League of Women Voters and the Sierra Club to help me test it, and I think (hope) it's ready for wider testing.

There's lots more I'd like to do, of course. I still have no way of knowing when a bill will be up for debate. It looks like this year the Legislative web site is showing committ schedules in a fairly standard way, as opposed to the unparseable PDFs they used in past years, so I may be able to get that. Not that legislative committees actually stick to their published schedules; but at least it's a start.

New Mexico readers (or anyone else interested in following the progress of New Mexico bills) are invited to try it. Let me know about any problems you encounter. And if you want to adapt the billtracker for use in another state, send me a note! I'd love to see it extended and would be happy to work with you. Here's the source: BillTracker on GitHub.

Tags: , , ,
[ 12:34 Jan 25, 2019    More politics | permalink to this entry | comments ]