The New Mexico legislature is in session again, which means the
New Mexico Bill Tracker
I wrote last year is back in season. But I guess the word has gotten
out, because this year, I started seeing a few database errors.
Specifically, "sqlite3.OperationalError: database is locked".
It turns out that even read queries on an sqlite3 database inside
flask and sqlalchemy can sometimes keep the database open
indefinitely. Consider something like:
userbills = user.get_bills() # this does a read query
# Do some slow operations that don't involve the database at all
for bill in userbills:
slow_update_involving_web_scraping(bill)
# Now bills are all updated; add and commit them.
# Here's where the write operations start.
for bill in userbills:
db.session.add(bill)
db.session.commit()
I knew better than to open a write query that might keep the database
open during all those long running operations. But apparently, when
using sqlite3, even the initial query of the database to get the
user's bill list opens the database and keeps it open ... until
when? Can you close it manually, then reopen it when you're ready?
Does it help to call db.session.commit()
after the read query? No one seems to know, and it's not obvious
how to test to find out.
I've suspected for a long time that sqlite was only a temporary
solution. While developing the billtracker, I hit quite a few
difficulties where the answer turned out to be "well, this would be
easy in a real database, but sqlite doesn't support that". I figured
I'd eventually migrate to postgresql. But I'm such a database newbie
that I'd been putting it off.
And rightly so. It turns out that migrating an existing database
from sqlite3 to postgresql isn't something that gets written
about much; I really couldn't find any guides on it.
Apparently everybody but me just chooses the right
database to begin with? Anyway, here are the steps on Debian.
Obviously, install postgresql first.
Create a User and a Database
Postgresql has its own notion of users, which you need to create.
At least on Debian, the default is that if you create a postgres
user named martha, then the Linux user martha on the same machine
can access databases that the postgres user martha has access to.
This is controlled by the "peer" auth method, which you can read about in
the
postgresql documentation on pg_hba.conf.
First su to the postgres Linux user and run psql:
$ sudo su - postgres
$ psql
Inside psql, create a postgresql user with the same name as your
flask user, and create a database for that user:
CREATE USER myflaskuser WITH PASSWORD 'password';
ALTER ROLE myflaskuser SET client_encoding TO 'utf8';
ALTER ROLE myflaskuser SET default_transaction_isolation TO 'read committed';
ALTER ROLE myflaskuser SET timezone TO 'UTC';
CREATE DATABASE dbname;
GRANT ALL PRIVILEGES ON DATABASE dbname TO myflaskuser;
If you like, you can also create a user yourusername and give
it access to the same database, to make debugging easier.
With the database created, the next step is to migrate the old
data from the sqlite database.
pgloader (if you have a very recent pgloader)
Using sqlalchemy in my flask app meant that I could use
flask db upgrade
to create the database schema in any
database I chose. It does a lovely job of creating an empty database.
Unfortunately, that's no help
if you already have an existing database full of user accounts.
Some people suggested exporting data in either SQL or CSV format,
then importing it into postgresql. Bad idea. There are many
incompatibilities between the two databases: identifiers that
work in sqlite but not in postgresql (like "user", which is a reserved
word in postgres but a common table name in flask-based apps),
capitalization of column names, incompatible date formats, and
probably many more.
A program called pgloader takes care of many (but not all)
of the incompatibilities.
Create a file -- I'll call it migrate.pgloader --
like this:
load database
from 'latest-sqlite-file.db'
into postgresql:///new_db_name
with include drop, quote identifiers, create tables, create indexes, reset sequences
set work_mem to '16MB', maintenance_work_mem to '512 MB';
Then, from a Linux user who has access to the database (e.g. the
myflaskuser you created earlier),
run pgloader migrate.pgloader
.
That worked nicely on my Ubuntu 19.10 desktop, which has pgloader
3.6.1. It failed utterly on the server, which is running Debian stable
and pgloader 3.3.2.
Building the latest pgloader from source didn't work on Debian either;
it's based on Common Lisp, and the older CL on Debian dumped me into
some kind of assembly debugger when I tried to build pgloader. Rather
than build CL from source too, I looked for another option.
On an Older OS: Use pgloader Remotely
Postgresql can take commands from remote machines. So you can
configure
postgresql to accept remote connections, then run the migration
from a machine with a new enough pgloader version.
There are two files to edit. The location of postgresql's
configuration directory varies with version, so do a
locate pg_hba.conf
to find it.
In that directory, first edit pg_hba.conf
and add these lines to the end to allow net socket connections
from IP4 and IP6:
host all all 0.0.0.0/0 md5
host all all ::/0 md5
In the same directory, edit postgresql.conf and
search for listen_addr.
Comment out the localhost line if it's uncommented, and add this to allow
connections from anywhere, not just localhost:
listen_addresses = '*'
Then restart the database with
service postgresql restart
Modify the migrate.pgloader file from the previous section
so the "into" line looks like
into postgresql://username:password@host/dbname
The
username there is the postgres username, if you made that
different from the Unix username. You need to use a password because
postgres is no longer using
peer auth (see that postgres
documentation file I linked earlier).
Assuming this
You're done with the remote connection part. If you don't need remote
database connections for your app, you can now edit
postgresql.conf, comment out that
listen_addresses = '*'
line, and restart the database
again with service postgresql restart
.
Don't remove the two lines you added in pg_hba.conf;
flask apparently needs them.
You're ready for the migration.
Make sure you have the latest copy of the server's sqlite database,
then, from your desktop, run:
pgloader migrate.pgloader
Migrate Autoincrements to Sequences
But that's not enough. If you're using any integer primary keys that
autoincrement -- a pretty common database model -- postgresql doesn't
understand that. Instead, it has sequence objects.
You need to define a sequence, tie it to a table, and tell postgresql
that when it adds a new object to the table,
the default value of id is the maximum number in the
corresponding sequence. Here's how to do that for the table named "user":
CREATE SEQUENCE user_id_seq OWNED by "user".id;
ALTER TABLE "user" ALTER COLUMN id SET default nextval('user_id_seq');
SELECT setval(pg_get_serial_sequence('user', 'id'), coalesce(max(id)+1,1), false) FROM "user";
Note the quotes around "user" because otherwise user is a postgresql
reserved word. Repeat these three lines for every table in your database,
except that you don't need the quotes around any table name except user.
Incidentally, I've been told that using autoincrement/sequence
primary keys isn't best practice, because it can be a bottleneck
if lots of different objects are being created at once. I used it
because all the models I was following when I started with flask
worked that way, but eventually I plan try to switch to using some
other unique primary key.
Update:
Turns out there was another problem with the sequences, and it was
pretty annoying. I ended up with a bunch of indices with names like
"idx_15517_ix_user_email" when they should have been "ix_user_email".
The database superficially worked fine, but it havoc ensues if you ever
need to do a flask/sqlalchemy/alembic migration, since sqlalchemy
doesn't know anything about those indices with the funny numeric names.
It's apparently possible to rename indices in postgresql, but it's a
tricky operation that has to be done by hand for each index.
Now the database should be ready to test.
Test
Your flask app probably has something like this in config.py:
SQLALCHEMY_DATABASE_URI = os.environ.get('DATABASE_URL') or \
'sqlite:///' + os.path.join(basedir, 'dbname.db')
If so, you can export DATABSE_URL=postgresql:///dbname
and then test it as you usually would. If you normally test on a
local machine and not on the server, remember you can tell flask's
test server to accept connections from remote machines with
flask run --host=0.0.0.0
Database Backups
You're backing up your database, right?
That's easier in sqlite where you can just copy the db file.
From the command line, you can back up a postgresql database with:
pg_dump dbname > dbname-backup.pg
You can do that from Python in a subprocess:
with open(backup_file, 'w') as fp:
subprocess.call(["pg_dump", dbname], stdout=fp)
Verify You're Using The New Database
I had some problems with that DATABASE_URL setting; I'd never
used it so I didn't realize that it wasn't in the right place and
didn't actually work. So I ran through my migration steps, changed
DATABASE_URL, thought I was done, and realized later
that the app was still running off sqlite3.
It's better to know for sure what your app is running.
For instance, you can add a route to routes.py that prints
details like that.
You can print app.config["SQLALCHEMY_DATABASE_URI"]
.
That's enough in theory, but I wanted to know for sure.
Turns out str(db.session.get_bind())
will print the
connection the flask app's database is actually using. So I added a route
that prints both, plus some other information about the running app.
Whew! I was a bit surprised that migrating was as tricky as it was,
and that there wasn't more documentation for it. Happy migrations, everyone.
Tags: programming, flask, python, database, sqlite, postgresql
[
12:34 Feb 01, 2020
More tech/web |
permalink to this entry |
]
For the last few weeks I've been consumed with a project I started
last year and then put aside for a while: a bill tracker.
The project sprung out of frustration at the difficulty of following
bills as they pass through the New Mexico legislature. Bills I was
interested in would die in committee, or they would make it to a
vote, and I'd read about it a few days later and wish I'd known
that it was a good time to write my representative or show up at
the Roundhouse to speak. (I've never spoken at the Roundhouse,
and whether I'd have the courage to actually do it remains to be
seen, but at least I'd like to have the chance to decide.)
New Mexico has a Legislative web site
where you can see the status of each bill, and they even offer a way
to register and save a list of bills; but then there's no way to
get alerts about bills that change status and might be coming up for debate.
New Mexico legislative sessions are incredibly short: 60 days in
odd years, 30 days in even. During last year's 30-day session,
I wrote some Python code that scraped the HTML pages describing a bill,
extract the useful information like when the bill last changed
status and where it was right now, present the information
in a table where the user could easily scan it, and email the user a
daily summary.
Fortunately, the nmlegis.gov site, while it doesn't offer raw data for
bill status, at least uses lots of id tags in its HTML which make them
relatively easy to scrape.
Then the session ended and there was no further way to test it,
since bills' statuses were no longer changing. So the billtracker
moved to the back burner.
In the runup to this year's 60-day session, I started with Flask, a
lightweight Python web library I've used for a couple of small
projects, and added some extensions that help Flask handle tasks
like user accounts. Then I patched in the legislative web scraping
code from last year, and the result was
The New Mexico Bill Tracker.
I passed the word to some friends in the League of Women Voters and
the Sierra Club to help me test it, and I think (hope) it's ready for
wider testing.
There's lots more I'd like to do, of course. I still have no way of
knowing when a bill will be up for debate. It looks like this year
the Legislative web site is showing committ schedules in a fairly
standard way, as opposed to the unparseable PDFs they used in past years,
so I may be able to get that. Not that legislative committees actually
stick to their published schedules; but at least it's a start.
New Mexico readers (or anyone else interested in following the
progress of New Mexico bills) are invited to try it. Let me know about
any problems you encounter. And if you want to adapt the billtracker
for use in another state, send me a note! I'd love to see it extended
and would be happy to work with you. Here's the source:
BillTracker on GitHub.
Tags: politics, programming, python, flask, government
[
12:34 Jan 25, 2019
More politics |
permalink to this entry |
]