This is part 2 of my selenium exploration trying to fetch stories from the NY Times ((as a subscriber).
- Part I: Selenium Basics
- Part II: Running Headless on a Server (this article)
- Part III: Handling Errors and Timeouts
When we left off, I was learning the basics of selenium in order to fetch stories (as a subscriber) from the New York Times. Fetching stories was working properly, and all that remained was to put it in an automated script, then move it to a server where it could run automatically without my desktop machine needing to be on.
Unfortunately, that turned out to be the hardest part of the problem.
Installing Selenium and Firefox Without a Desktop
The problem is that you need to run a full browser (I was using Firefox) on a web or file server that doesn't have X or a desktop installed.
If youd just
apt install python3-selenium firefox-esr
(or equivalent for your distro; firefox-esr is the Extended Support Release,
the only Firefox package in the standard Debian repositories:
it's a little more stable than cutting-edge Firefox, and changes less often)
and allow the resulting 156 packages to be installed.
But I prefer to keep servers as lean as possible, installing
only the packages I really need.
It helps quite a bit to turn off recommended and suggested packages.
apt install --no-install-recommends --no-install-suggests
python3-selenium brings the number of new packages installed
down to 31.
Or you can get Firefox from mozilla.org (the ESR release is here). Of course, you'll still need those other dependencies, including X11 (even though you won't be running an X server) plus libraries like GTK; see below for that.
You'll also need geckodriver, which Firefox needs to run headless. Ubuntu has a package firefox-geckodriver, but Debian doesn't -- even though installing selenium on Debian suggests installing firefoxdriver, which turns out not to exist. If your server doesn't have geckodriver as a package, you can get it from Mozilla's github: github.com/mozilla/geckodriver/releases. It's odd that it's only available from github, not anywhere on mozilla.org, given that Mozilla normally doesn't even use git (they use Mercurial as their version control system), but there you go.
The tarballs from that github link extract into a single executable. I moved that executable into the directory where I'd installed Firefox, ~/firefox-esr, so I could add that directory to my PATH. Selenium lets you specify the path to the geckodriver executable, but then geckodriver won't be able to find Firefox unless it's somewhere in your PATH. Easiest to make sure they're both on the path.
Set up a virtual X environment
As mentioned earlier, you'll need the X server installed even though you won't be running it interactively, plus some other libraries like GTK.
For Firefox-ESR from Mozilla.org and geckodriver from github, here's what I needed:
apt install --no-install-recommends --no-install-suggests \ xvfb python3-xvfbwrapper libgtk-3-0 libdbus-glib-1-2
This pulls in a total of 61 packages, 51 of which are related to libgtk-3.0.
Test Firefox and Create a Profile
Now you theoretically have enough to run Mozilla. To make sure it runs,
you need to be able to display X clients. The easiest way is to
ssh -X from another machine to your server
as the user you plan to use for selenium,
and verify that Firefox runs and can create a new profile.
localhost% ssh -X user@servername servername% cd firefox-esr servername% ./firefox -p
The -p tells firefox to start the profile manager so you can create a new profile.
I had originally intended to copy my selenium/NYT profile from my desktop machine, but that didn't work because the Firefox versions were too different.
I named the new profile "selenium"; I wrote FeedMe's nyt-selenium helper to look for a profile that has "selenium" in its name.
Once your new profile is created, you can run it from the profile
manager, or start firefox with
firefox -P selenium.
Now, assuming both firefox and geckodriver are in your PATH, you can run automated selenium scripts:
from selenium import webdriver from selenium.webdriver.firefox.options import Options import os foxprofiledir = os.path.expanduser("~/.mozilla/firefox/<i>random-string</i>.selenium") options = Options() options.headless = True driver = webdriver.Firefox(firefox_profile=foxprofiledir, options=options)(This will still give the
DeprecationWarning: firefox_profile has been deprecated, please pass
in a Service object mentioned in the previous article.)
Put the Log File Somewhere Sensible
driver = webdriver.Firefox(firefox_profile=foxprofiledir, options=options, service_log_path=path_to_log_file)
With that, your selenium script can run headless on a server.
But there's one more piece: if the code is to run unattended, it needs to handle errors and timeouts. And that's an aspect of selenium that doesn't seem to work very well and isn't well documented, so I've amassed a collection of hacky techniques to deal with it. I'll describe those in the next article.
[ 12:18 Nov 07, 2021 More programming | permalink to this entry | ]