Converting HTML pages into PDF (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Tue, 17 Jan 2012

Converting HTML pages into PDF

I've long wanted a way of converting my HTML presentation slides to PDF. Mostly because conference organizers tend to freak out at slides in any format other than Open/Libre Office, Powerpoint or PDF. Too many times, I've submitted a tarball of my slides and discovered they weren't even listed on the conference site. (I ask you, in what way is a tarball more difficult to deal with than an .odp file?) Slide-sharing websites also have a limited set of formats they'll accept.

A year or so ago, I added screenshot capability to my webkit-based presentation program, Preso, do "screenshots", but I really needed PDF, not images.

Now, creating PDF from HTML shouldn't be that hard. Every browser has a print function that can print to a PDF file. So why is it so hard to create PDF from HTML in any kind of scriptable way?

After much searching and experimenting, I finally found a Python code snippet that worked: XHTML to PDF using PyGTK4 Webkit from Alex Dong. It uses Python-Qt, not GTK, so I can't integrate it into my Preso app, but that's okay -- a separate tool is just as good.

(I struggled to write an equivalent in PyGTK, but gave up due to the complete lack of documentation of Python-Webkit-GTK, and not much more for gtk.PrintOperation(). QWebView's documentation may not be as complete as I'd like, but at least there is some.)

Printing from QtWebView to QPrinter

Here are the important things I learned about QWebView from fiddling around with Alex's code to adapt it to what I needed, which is printing a list of pages to sequentially numbered files:

To print, you need to wait until the page has finished loading, so connect a function to SIGNAL("loadFinished(bool)"), then load(QUrl(url)).
That loadFinished function remains registered, so as you load new pages, it will be called each time. So you can load() the next URL as the last step in your loadFinished callback.
If you get confused about callbacks and connect more than one of them, bad things happen, and only the last page gets printed, or QApplication.exit() doesn't exit at all.

Things I learned about QPrinter():

All the examples I found online set the page size with lines like QPrinter.setPageSize(QPrinter.A4) or setPaperSize(QPrinter.A4) (setPageSize is apparently deprecated in favor of setPaperSize); but
If you want to set a specific size, you can do that with a line like QPrinter.setPaperSize(QSizeF(1024, 768), QPrinter.DevicePixel) The second argument (DevicePixel) is a unit, from this list.
That line gives you the right aspect ratio. But if you think "DevicePixels" means the size will correspond to pixels in your browser window (just because the documentation says so), you're sadly mistaken.
If you want a PDF page that actually corresponds to the size of your browser window, you can get it by calling QWebView.setZoomFactor(z) You'll have to experiment to find the right value of z; I found I needed about 1.24 if I wanted to capture my full 1366x768 slides, or exactly 2.0 if I wanted to restrict the saved PDF to only the 1024x758 part that shows up in the projector.

Anyway, it's a little hacky with that empirical zoom factor ... but it works! The program is here: qhtmlprint: convert HTML to PDF using Qt Webkit.

And it does produce reasonable PDF, with the text properly vectorized, not just raster screenshots of each page.

Printing the slides in the right order

Terrific -- now I can feed a list of slides to qhtmlprint and get a bunch of PDF files back. How do I print the right slides?

My slides are listed in order in an array inside a Javascript file, one per line. If I grep .html navigate.js, I get a list like this:

    "arduino.html",
    "img.html?pix/arduinos/arduino-clones.jpg",
    "getting_started.html",
    "img.html?pix/projects/led.jpg",
    //"blink.html",
    "arduino-ide.html",

To pass that to qhtmlprint, I only need to remove the commented-out lines (the ones with //) and strip off the quotes and commas. I can do that all in one command with a grep and sed pipeline:

qhtmlprint ` fgrep .html navigate.js  | grep -v // | sed -e 's/",/"/' -e 's/"//g' `

And voiaà! I have a bunch of fileNNN.pdf files.

Creating a multi-page slide deck

Okay, great! Now how do I stick those files all together into one slide deck I can submit to conference organizers?

That part's easy -- Ghostscript can do it.

gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=slidedeck.pdf -dBATCH file*.pdf

And now slidedeck.pdf contains my whole presentation, ready to go.

Tags: programming, speaking, python
[ 12:16 Jan 17, 2012 More programming | permalink to this entry | ]

<	January 2012					>
Su	Mo	Tu	We	Th	Fr	Sa
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

	Feeds: RSS 2.0 \| Atom
	@akkana@fosstodon.org on Mastodon
	@akkakk on Twitter (now inactive)
	Shallow Sky Home
	Contact Akkana

Converting HTML pages into PDF (Shallow Thoughts)

Tue, 17 Jan 2012