Viewing and modifying epub ebook tags (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sat, 09 Jun 2012

Viewing and modifying epub ebook tags

My epub Books folder is starting to look like my physical bookshelf at home -- huge and overflowing with books I hope to read some day. Mostly free books from the wonderful Project Gutenberg and DRM-free books from publishers and authors who support that model.

With the Nook's standard library viewer that's impossible to manage. All you can do is sort all those books alphabetically by title or author and laboriously page through, some five books to a page, hoping the one you want will catch your eye. Worse, sometimes books show up in the author view but don't show up in the title view, or vice versa. I guess Barnes & Noble think nobody keeps more than ten or so books on their shelves.

Fortunately on my rooted Nook I have the option of using better readers, like FBreader and Aldiko, that let me sort by tags. If I want to read something about the Civil War, or Astronomy, or just relax with some Science Fiction, I can browse by keyword.

Well, in theory. In practice, tagging of ebooks is inconsistent and not very useful.

For instance, the Gutenberg tags for Othello are:

while the tags for Vanity Fair are

The Prince and the Pauper's tag list looks like:

while Captains Courageous looks like

I can understand wanting to tag details like this, but few of those tags are helpful when I'm browsing books on my little handheld device. I can't imagine sitting down to read and thinking, "Let's see, what books do I have on Interracial marriage? Or Saltwater fishing? No, on second thought I'd rather read some fiction set in the time of Edward VI, King of England, 1537-1553."

And of course, with over 90 books loaded on my ebook readers, it means I have hundreds of entries in my tags list, with few of them including more than one book.

Clearly what I needed to do was to change the tags on my ebooks.

Viewing and modifying epub tags

That ought to be simple, right? But ebooks are still a very young technology, and there's surprisingly little software devoted to them. Calibre can probably do it if you don't mind maintaining your whole book collection under calibre; but I like to be able to work on files one at a time or in small groups. And I couldn't find a program that would let me do that.

What to do? Well, epub is a fairly simple XML format, right? So modifying it with Python shouldn't that hard.

Managing epub in Python

An epub file is a collection of XML files packaged in a zip archive. So I unzipped one of my epub books and poked around. I found the tags in a file called content.opf, inside a <metadata> tag. They look like this:

<dc:subject>Science fiction</dc:subject>

So I could use Python's zipfile module to access the content.opf file inside the zip archive, then use the xml.dom.minidom parser to get to the tags. Writing a script to display existing tags was very easy.

What about replacing the old, unweildy tag list with new, simple tags?

It's easy enough to add nodes in Python's minidom. So the trick is writing it back to the epub file. The zipfile module doesn't have a way to modify a zip file in place, so I created a new zip archive and copied files from the old archive to the new one, replacing content.opf with a new version.

Python's difficulty with character sets in XML

But I hit a snag in writing the new content.opf. Python's XML classes have a toprettyxml() method to write the contents of a DOM tree. Seemed simple, and that worked for several ebooks ... until I hit one that contained a non-ASCII character. Then Python threw a UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 606: ordinal not in range(128).

Of course, there are ways (lots of them) to encode that output string -- I could do

ozf.writestr(info, dom.toprettyxml().encode(encoding, 'xmlcharrefreplace'))
, or
writestr(info, dom.toprettyxml(encoding=encoding)
Except ... what should I pass as the encoding? The content.opf file started with its encoding:
<?xml version='1.0' encoding='UTF-8'?>
but Python's minidom offers no way to get that information. In fact, none of Python's XML parsers seem to offer this.

Since you need a charset to avoid the UnicodeEncodeError, the only options are (1) always use a fixed charset, like utf-8, for content.opf, or (2) open content.opf and parse the charset line by hand after Python has already parsed the rest of the file. Yuck! So I chose the first option ... I can always revisit that if the utf-8 in content.opf ever causes problems.

The final script

Charset difficulties aside, though, I'm quite pleased with my epubtags.py script. It's very handy to be able to print tags on any .epub file, and after cleaning up the tags on my ebooks, it's great to be able to browse by category in FBreader. Here's the program: epubtag.py.

Tags: , ,
[ 13:05 Jun 09, 2012    More programming | permalink to this entry | ]

Comments via Disqus:

blog comments powered by Disqus