Configuring Selenium to Run Headless, Without a Desktop (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sun, 07 Nov 2021

Configuring Selenium to Run Headless, Without a Desktop

This is part 2 of my selenium exploration trying to fetch stories from the NY Times ((as a subscriber).

Part I: Selenium Basics
Part II: Running Headless on a Server (this article)
Part III: Handling Errors and Timeouts

When we left off, I was learning the basics of selenium in order to fetch stories (as a subscriber) from the New York Times. Fetching stories was working properly, and all that remained was to put it in an automated script, then move it to a server where it could run automatically without my desktop machine needing to be on.

Unfortunately, that turned out to be the hardest part of the problem.

Installing Selenium and Firefox Without a Desktop

The problem is that you need to run a full browser (I was using Firefox) on a web or file server that doesn't have X or a desktop installed.

If youd just apt install python3-selenium firefox-esr (or equivalent for your distro; firefox-esr is the Extended Support Release, the only Firefox package in the standard Debian repositories: it's a little more stable than cutting-edge Firefox, and changes less often) and allow the resulting 156 packages to be installed. But I prefer to keep servers as lean as possible, installing only the packages I really need.

It helps quite a bit to turn off recommended and suggested packages. apt install --no-install-recommends --no-install-suggests python3-selenium brings the number of new packages installed down to 31.

Or you can get Firefox from mozilla.org (the ESR release is here). Of course, you'll still need those other dependencies, including X11 (even though you won't be running an X server) plus libraries like GTK; see below for that.

You'll also need geckodriver, which Firefox needs to run headless. Ubuntu has a package firefox-geckodriver, but Debian doesn't -- even though installing selenium on Debian suggests installing firefoxdriver, which turns out not to exist. If your server doesn't have geckodriver as a package, you can get it from Mozilla's github: github.com/mozilla/geckodriver/releases. It's odd that it's only available from github, not anywhere on mozilla.org, given that Mozilla normally doesn't even use git (they use Mercurial as their version control system), but there you go.

The tarballs from that github link extract into a single executable. I moved that executable into the directory where I'd installed Firefox, ~/firefox-esr, so I could add that directory to my PATH. Selenium lets you specify the path to the geckodriver executable, but then geckodriver won't be able to find Firefox unless it's somewhere in your PATH. Easiest to make sure they're both on the path.

Set up a virtual X environment

As mentioned earlier, you'll need the X server installed even though you won't be running it interactively, plus some other libraries like GTK.

For Firefox-ESR from Mozilla.org and geckodriver from github, here's what I needed:

apt install --no-install-recommends --no-install-suggests \
    xvfb python3-xvfbwrapper libgtk-3-0 libdbus-glib-1-2

This pulls in a total of 61 packages, 51 of which are related to libgtk-3.0.

Test Firefox and Create a Profile

Now you theoretically have enough to run Mozilla. To make sure it runs, you need to be able to display X clients. The easiest way is to ssh -X from another machine to your server as the user you plan to use for selenium, and verify that Firefox runs and can create a new profile.

localhost% ssh -X user@servername
servername% cd firefox-esr
servername% ./firefox -p

The -p tells firefox to start the profile manager so you can create a new profile.

I had originally intended to copy my selenium/NYT profile from my desktop machine, but that didn't work because the Firefox versions were too different.

I named the new profile "selenium"; I wrote FeedMe's nyt-selenium helper to look for a profile that has "selenium" in its name.

Once your new profile is created, you can run it from the profile manager, or start firefox with firefox -P selenium.

Now, assuming both firefox and geckodriver are in your PATH, you can run automated selenium scripts:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import os

foxprofiledir = os.path.expanduser("~/.mozilla/firefox/<i>random-string</i>.selenium")

options = Options()
options.headless = True

driver = webdriver.Firefox(firefox_profile=foxprofiledir,
                           options=options)

(This will still give the

DeprecationWarning: firefox_profile has been deprecated, please pass
in a Service object

mentioned in the previous article.)

Put the Log File Somewhere Sensible

One more thing: the geckodriver creates a verbose log file called geckodriver.log containing all the warnings Firefox spewed during its run, mostly hundreds of lines of JavaScript warnings from poorly written scripts. It creates this log in the current directory. If your selenium script runs from a directory that's not writable by your user, the whole process will fail. So choose someplace you'd like the log to go, like /tmp/geckodriver.log if you don't have a better place for it, and specify that when you create the webdriver:

driver = webdriver.Firefox(firefox_profile=foxprofiledir,
                           options=options,
                           service_log_path=path_to_log_file)

With that, your selenium script can run headless on a server.

But there's one more piece: if the code is to run unattended, it needs to handle errors and timeouts. And that's an aspect of selenium that doesn't seem to work very well and isn't well documented, so I've amassed a collection of hacky techniques to deal with it. I'll describe those in the next article.

Tags: programming, python, scraping, selenium
[ 12:18 Nov 07, 2021 More programming | permalink to this entry | ]

<	November 2021					>
Su	Mo	Tu	We	Th	Fr	Sa
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

	Feeds: RSS 2.0 \| Atom
	@akkana@fosstodon.org on Mastodon
	@akkakk on Twitter (now inactive)
	Shallow Sky Home
	Contact Akkana

Configuring Selenium to Run Headless, Without a Desktop (Shallow Thoughts)

Sun, 07 Nov 2021