This is part 1 of my selenium exploration.
- Part I: Selenium Basics (this article)
- Part II: Running Headless on a Server
- Part III: Handling Errors and Timeouts
At the New Mexico GNU & Linux User Group, currently meeting virtually on Jitsi, someone expressed interest in scraping websites. Since I do quite a bit of scraping, I offered to give a tutorial on scraping with the Python module BeautifulSoup.
"What about selenium?" he asked. Sorry, I said, I've never needed selenium enough to figure it out.
But then a week later, I found I did have a need.
All I knew about selenium was that it was a way of running a full browser under programmatic control. But that was exactly what I'd need for fetching stories from the NYT. Maybe I needed to look into this selenium thing after all.
The most basic use of selenium is very easy. In the Python 3 console:
>>> from selenium import webdriver >>> webdriver = webdriver.Firefox() # A browser window pops up >>> webdriver.get("http://shallowsky.com/blog/") >>> fullhtml = webdriver.page_source(I'm using firefox here, but there's also a chromium driver that works very similarly.)
The full HTML source of the page is in
Or you can look for specific elements. For instance, suppose I want to loop
over all the
<h2 class="story"> tags and print
for story in webdriver.find_elements_by_class_name("story"): print(story.get_attribute('innerHTML'))... except, wait, that won't work, because there's a
<div class="story">that contains
<h2 class="story">. That selector will get both the
divs and the
h2s, and the
innerHTML of the div is the title plus the whole teaser that follows it.So you could do something like
for story in webdriver.find_elements_by_class_name("story"): if story.tag_name == "h2": print(story.get_attribute('innerHTML'))
If you need to combine different selectors in the same query, you have to use an XPath:
for story in webdriver.find_elements_by_xpath("//h2[@class='story']"): print(story.get_attribute('innerHTML'))
Read more about element selection at the
documentation. It's not as flexible or easy to use as BeautifulSoup,
but if you get frustrated, you always have the option of getting
the full HTML string
and feeding that to BeautifulSoup.
The simple selenium example above created its own new Firefox profile, a perfectly reasonable thing to do so it doesn't interfere with any Firefox that might already be running.
But of course, to do something like fetch NY Times stories as a subscriber, I need a browser profile where I'm logged in and have all the appropriate cookies set.
firefox -p to being up the profile manager, and
created a new profile called selenium. Firefox likes to add
random strings to profile directories, so the directory it used
I ran firefox with that profile, went to nytimes.com and logged in to my subscriber account, then exited firefox.
Now to access that profile from selenium. Here's what nearly all pages tell you to do, and it works:
>>> import os >>> foxprofiledir = os.path.expanduser("~/.mozilla/firefox/<i>random-string</i>.selenium") >>> webdriver = webdriver.Firefox(firefox_profile=foxprofiledir) <stdin>:1: DeprecationWarning: firefox_profile has been deprecated, please pass in a Service object >>>
I'm running selenium 4.0.0~a1, the version in Ubuntu hirsute's python3-selenium package. Presumably all the web tutorials are written for selenium 3 -- as is the online documentation. I haven't been able to find either examples or documenttaion of this mysterious "service object" that selenium 4 apparently wants. Fortunately, despite the deprecation warning, this method seems to work fine, and I'm able to webdriver.get() NYT pages. Perhaps selenium's documentation will eventually catch up with its code and I'll be able to get rid of the warning message.
In all these tests, selenium popped up a browser window and I could see each page load. That's great for testing, but the point of the exercise is to automate the page fetching, and you really don't want a visible browser window popping up for that. Fortunately it's easy to suppress the browser window by running headless:
>>> from selenium.webdriver.firefox.options import Options >>> >>> options = Options() >>> options.headless = True >>> >>> webdriver = webdriver.Firefox(firefox_profile=foxprofiledir, options=options)
With that, I could make a feedme helper (once I patched feedme to accept helper modules, something I'd never needed before) that checks the NY Times RSS feed, loops over new stories, and uses the selenium firefox profile to fetch the stories.
This, alas, is only part of the problem (the easy part). It worked okay as long as I was running the script interactively on my own machine. But the point of this was to run the selenium script automatically each day, creating files that FeedMe could pick up as part of my daily feeds -- even if I'm not at home, even if my laptop isn't running.
I needed to run selenium from a server that's always running. That turned out to be quite a bit more difficult, since servers don't generally have X and GUI libraries and firefox installed. I'll address those issues in a separate article.
[ 19:58 Nov 02, 2021 More programming | permalink to this entry | ]