This is part 3 of my selenium exploration trying to fetch stories from the NY Times ((as a subscriber).
- Part I: Selenium Basics
- Part II: Running Headless on a Server
- Part III: Handling Errors and Timeouts (this article)
At the end of Part II, selenium was running on a server with the minimal number of X and GTK libraries installed.
But now that it can run unattended, there's nother problem: there are all kinds of ways this can fail, and your script needs to handle those errors somehow.
Before diving in, I should mention that for my original goal, fetching stories from the NY Times as a subscriber, it turned out I didn't need selenium after all. Since handling selenium errors turned out to be so brittle (as I'll describe in this article), I'm now using requests combined with a Python CookieJar. I'll write about that in a future article. Meanwhile ...
Handling Errors and Timeouts
Timeouts are a particular problem with selenium, because there doesn't seem to be any reliable way to change them so the selenium script doesn't hang for ridiculously long periods.
And selenium fetching is particularly prone to timing out,
compared to simpler fetches like
because selenium doesn't consider a
page to be fully loaded until all of the resources it uses
have been fetched: every image, every script, every ad referenced from
scripts in the page -- and until
all the scripts on the page have completely finished running.
One thing that's supposed to help is "eager" page loading:
options.page_load_strategy = "eager"According to the documentation, "eager" tells the web driver to "wait until the initial HTML document has been completely loaded and parsed, and discards loading of stylesheets, images and subframes.". You can also set it to "none", which means "wait until the initial page is downloaded."
It's a nice thought, but I'm not convinced they make any difference; with "eager" I still saw lots of NYT pages that hang for several minutes before selenium finally gave up. When you're fetching dozens of pages, several minutes per page adds up to a long time.
Another approach is to set various timeouts on the web driver. Here are three settings I tried:
webdriver.set_page_load_timeout(25) webdriver.implicitly_wait(20); webdriver.set_script_timeout(20);
Again, even with all these timeouts, I still saw pages hang for multiple minutes.
Worse, if you interrupt selenium's
(say with a Ctrl-C
it throws the web driver into a state where it can't fetch any more pages,
so all future
webdriver.get() calls fail.
I haven't figured out how to reset it, short of creating a whole
new webdriver object.
Searching for solutions, I found several people advocating
webdriver.back() but that never made any difference
I ended up trying all of the above, in the hope that each one might help with some pages; then I kept track of errors and timeouts in other pages, and if I saw three or more, I gave up on fetching any more stories that day.
But even keeping track of errors is tricky because there are so many
different exceptions that are defined in different places.
And they can happen either in the
or later when you try to get
Here's code to capture the exceptions I saw in a week of testing:
from urllib3.exceptions import MaxRetryError, NewConnectionError from selenium.common.exceptions import TimeoutException num_timeouts = 0 MAX_TIMEOUTS = 3 def fetch_article(url): global num_timeouts if num_timeouts >= MAX_TIMEOUTS: return timeout_boilerplate(url, "Gave up after earlier timeout") try: webdriver.get(url) except TimeoutException as e: num_timeouts += 1 print("EEK! TimeoutException", e, file=sys.stderr) return timeout_boilerplate(url, "TimeoutException") except (ConnectionRefusedError, MaxRetryError, NewConnectionError) as e: # MaxRetryError and NewConnectionError come from urllib3.exceptions # ConnectionRefusedError is a Python builtin. num_timeouts += 1 print("EEK! Connection error", e, file=sys.stderr) traceback.print_exc(file=sys.stderr) except Exception as e: num_timeouts += 1 print("EEK! Unexpected exception in webdriver.get: " + str(e)) try: fullhtml = webdriver.page_source except Exception as e: num_timeouts += 1 print("EEK! Fetched page but couldn't get html: " + str(e)) # Got it, whew! Proceed with processing fullhtml.
With that, plus trying to set the shorter timeouts I mentioned earlier, most of the time my automated selenium fetched at least most of the stories. But there's one problem left.
At one point, after making some changes to the timeouts, I tried to ssh to the server to verify the script had run. And I found the server wasn't allowing ssh connections, though it responded to pings.
So I added some code to kill firefox at the end of the cron job (a shell script) that runs the selenium Python script:
echo "Killing firefox" pkill -e firefox sleep 10 echo "Killing -9 firefox, just in case" pkill -e --signal 9 firefox
I'm sure that if I'd continued to use the daily selenium script, I'd have discovered more gotchas like that. This automated-firefox-via-selenium business is more art than science, and it's a world of pain compared to just fetching stories with python-requests like I can do with every other site. Luckily for me, as I mentioned at the beginning, in the specific case of the NY Times it turned out I didn't really need selenium, I just needed cookies from my Firefox profile.
And since this was originally inspired by a question from a local LUG meeting, I'll be giving an informal tutorial on selenium at tonight's (11/11/2021) NMGLUG meeting, 5:30-7 Mountain Time. It's on jitsi, so feel free to show up even if you're not local, if you want to talk selenium or scraping or anything else.
[ 12:07 Nov 11, 2021 More programming | permalink to this entry | ]