Selenium: Handling Timeouts and Errors (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Thu, 11 Nov 2021

Selenium: Handling Timeouts and Errors

This is part 3 of my selenium exploration trying to fetch stories from the NY Times ((as a subscriber).

At the end of Part II, selenium was running on a server with the minimal number of X and GTK libraries installed.

But now that it can run unattended, there's nother problem: there are all kinds of ways this can fail, and your script needs to handle those errors somehow.

Before diving in, I should mention that for my original goal, fetching stories from the NY Times as a subscriber, it turned out I didn't need selenium after all. Since handling selenium errors turned out to be so brittle (as I'll describe in this article), I'm now using requests combined with a Python CookieJar. I'll write about that in a future article. Meanwhile ...

Handling Errors and Timeouts

Timeouts are a particular problem with selenium, because there doesn't seem to be any reliable way to change them so the selenium script doesn't hang for ridiculously long periods.

And selenium fetching is particularly prone to timing out, compared to simpler fetches like requests.get(), because selenium doesn't consider a page to be fully loaded until all of the resources it uses have been fetched: every image, every script, every ad referenced from scripts in the page -- and until all the scripts on the page have completely finished running.

One thing that's supposed to help is "eager" page loading:

options.page_load_strategy = "eager"
According to the documentation, "eager" tells the web driver to "wait until the initial HTML document has been completely loaded and parsed, and discards loading of stylesheets, images and subframes.". You can also set it to "none", which means "wait until the initial page is downloaded."

It's a nice thought, but I'm not convinced they make any difference; with "eager" I still saw lots of NYT pages that hang for several minutes before selenium finally gave up. When you're fetching dozens of pages, several minutes per page adds up to a long time.

Another approach is to set various timeouts on the web driver. Here are three settings I tried:


Again, even with all these timeouts, I still saw pages hang for multiple minutes.

Worse, if you interrupt selenium's .get(url) (say with a Ctrl-C KeyboardInterrupt), it throws the web driver into a state where it can't fetch any more pages, so all future webdriver.get() calls fail. I haven't figured out how to reset it, short of creating a whole new webdriver object. Searching for solutions, I found several people advocating calling webdriver.back() but that never made any difference for me.

I ended up trying all of the above, in the hope that each one might help with some pages; then I kept track of errors and timeouts in other pages, and if I saw three or more, I gave up on fetching any more stories that day.

But even keeping track of errors is tricky because there are so many different exceptions that are defined in different places. And they can happen either in the webdriver.get(url) or later when you try to get webdriver.page_source. Here's code to capture the exceptions I saw in a week of testing:

from urllib3.exceptions import MaxRetryError, NewConnectionError
from selenium.common.exceptions import TimeoutException

num_timeouts = 0

def fetch_article(url):
    global num_timeouts

    if num_timeouts >= MAX_TIMEOUTS:
        return timeout_boilerplate(url, "Gave up after earlier timeout")


    except TimeoutException as e:
        num_timeouts += 1
        print("EEK! TimeoutException", e, file=sys.stderr)
        return timeout_boilerplate(url, "TimeoutException")

    except (ConnectionRefusedError, MaxRetryError, NewConnectionError) as e:
        # MaxRetryError and NewConnectionError come from urllib3.exceptions
        # ConnectionRefusedError is a Python builtin.
        num_timeouts += 1
        print("EEK! Connection error", e, file=sys.stderr)

    except Exception as e:
        num_timeouts += 1
        print("EEK! Unexpected exception in webdriver.get: " + str(e))

        fullhtml = webdriver.page_source

    except Exception as e:
        num_timeouts += 1
        print("EEK! Fetched page but couldn't get html: " + str(e))

    # Got it, whew! Proceed with processing fullhtml.

With that, plus trying to set the shorter timeouts I mentioned earlier, most of the time my automated selenium fetched at least most of the stories. But there's one problem left.

Killing Firefox

At one point, after making some changes to the timeouts, I tried to ssh to the server to verify the script had run. And I found the server wasn't allowing ssh connections, though it responded to pings.

Logging on to the machine's console revealed that it was out of memory. It turns out that Firefox continues running after selenium exits; and since it's loaded up with a bunch of slow javascript advertising scripts, Firefox may keep running forever, taking up increasing amounts of memory, unless something explicitly stops it.

So I added some code to kill firefox at the end of the cron job (a shell script) that runs the selenium Python script:

echo "Killing firefox"
pkill -e firefox
sleep 10
echo "Killing -9 firefox, just in case"
pkill -e --signal 9 firefox

I'm sure that if I'd continued to use the daily selenium script, I'd have discovered more gotchas like that. This automated-firefox-via-selenium business is more art than science, and it's a world of pain compared to just fetching stories with python-requests like I can do with every other site. Luckily for me, as I mentioned at the beginning, in the specific case of the NY Times it turned out I didn't really need selenium, I just needed cookies from my Firefox profile.

Still, if nothing else, it pushed me to learn selenium. I can use that knowledge for writing automated tests for some of my JavaScript web apps that I've been unable to test up to now.

And since this was originally inspired by a question from a local LUG meeting, I'll be giving an informal tutorial on selenium at tonight's (11/11/2021) NMGLUG meeting, 5:30-7 Mountain Time. It's on jitsi, so feel free to show up even if you're not local, if you want to talk selenium or scraping or anything else.

Tags: , , ,
[ 12:07 Nov 11, 2021    More programming | permalink to this entry | ]

Comments via Disqus:

blog comments powered by Disqus