Ever want to download RSS from news sites or blogs to your phone, PDA, ebook reader, or other handheld device to read offline?
Feedme is a program for fetching stories from RSS or Atom feeds, from news sites, blogs or any other site that offers a feed. It saves simplified HTML files, or can translate to other formats such as Epub, FB2, Plucker or plain text for devices that can't handle HTML.
Download: FeedMe is now maintained on GitHub: FeedMe. It's currently at 1.0b3.
You will need Python, and python's feedparser module (on Ubuntu or Arch, that's the package python-feedparser) and lxml (package python-lxml).
So I wrote FeedMe. I've been using it daily for many years, so I guess it works well enough for my purposes. Maybe it will work for you too.
FeedMe is sort of an RSS version of Sitescooper. By default, it produces simplified HTML, but optionally it can convert pages to EPUB or Plucker format.
The sample feedme.conf configuration file should be vaguely self-explanatory. Install it in ~/$XDG_CONFIG_HOME/feedme/feedme.conf or ~/.config/feedme/feedme.conf. Here are options inside it:
The feedme.conf file starts with a set of default options,
in a section labeled
These options apply to the whole feedme process:
ascii Convert all pages to plain ASCII. Useful for reading devices like Palm that can't display other character sets reliably. dir Where to save the collected pages. See save_days for how long they will be kept. formats Comma-separated list of output formats. Default "none", which will result in HTML output. Other options: epub, fb2, plucker. verbose Print lots of debugging chatter while feeding. min_width The minimum number of characters in an item link. Links shorter than this will be padded to this length (to make tapping easier). Default 25. save_days How long to retain feeds locally.
These options can be set in the
and may be overridden for specific feeds:
continue_on_timeout Normally, if one page times out, feedme will assume the site is down. On sites that link to content from many different URLs, set this to true. encoding Normally feedme will try to guess the encoding from the page. But some pages lie, so use this to override that. levels Level 1: only save the RSS page. Level 2: save sub-pages. nocache Don't check whether we've seen an entry before: collect everything. nonlocal_images Normally feedme will ignore images from other domains (usually ads). But some sites link to images from all over; set this to true in that case. skip_images Don't save images. Default true. skip_links: For sites with levels=1 where you just want a single news feed and never want to click on anything (e.g. slashdot), this can eliminate distracting links that you might tap on accidentally while scrolling. url The RSS URL for the site.
To define each new feed, start by defining the URL.
Then go to the RSS page in a browser, click on a story,
view the HTML source of the story, find the place where the actual story
begins (it may be three quarters of the way down the page or even
farther) and find something that indicates the start of the story,
other cruft. Save this as
Optionally, scroll down to the end of the story and find
page_end as well.
url = http://example.com/feeds.rss page_start =
There are quite a few other options you can set for the feed. Of course, you can override values like levels, nocache, skip_images and so forth. Here are some additional options you may want for some sites to clean up the stories and eliminate stories you don't want to see. All "pats" are lists of patterns (see below for an example).
single_page_pats Sites that spread stories over many pages often have a naming convention in their "view as single page" URLs. This is a pattern, e.g. http://exmaple.com/.*single.html skip_pats For sites that embed annoying things in the middle of stories: maybe video clips, or ads, or distracting pull-quotes. Anything matching one of the skip_pats will be deleted from the story. skip_link_pats Skip links with these patterns. skip_title_pats Skip anything with titles containing these patterns. skip_content_pats Skip anything whose content includes these patterns. index_skip_content_pats Skip anything where the index content includes: when When to check this site, if not always. May be a weekday, e.g. Sat, or a month date, e.g. 1 to check only on the first day of any month.
A few options, like page_start, page_end, single_page_pats, and skip_pats, can specify multiple patterns. Specify them by putting them on separate lines (whitespace at the beginning is optional and ignored):
skip_pats = style=".*?" Follow us on.*? skip_link_pats = skip_link_pat = http://www.site.com/articles/video/ http://www.site.com/articles/podcasts/
Supported formats since 0.7 include plucker and epub; to get them,
formats = epub (or plucker, or plucker,epub to get both).
For HTML only and no additional formats, set
formats = none
Downloaded HTML will be put in ~/feeds/ (which must exist; you can specify a different location in feedme.conf).
Feedme can then convert the HTML into one of three formats: epub, plucker, or fb2. You'll need to have appropriate conversion tools installed on your system: plucker for plucker format, calibre's ebook-convert for the other two. You can specify more than one format, separated by commas, in feedme.conf; or format=none if you only need the downloaded HTML.
FeedMe can optionally convert each page to plain ascii, eliminating
accented characters and such (displaying them used to be a problem
on Palm PDAs, but shouldn't be needed for most modern devices).
For this option, set
ascii="yes" in feedme.conf
and install my ununicode module
somewhere in your python path.
Feedme's configuration file is ~/.config/feedme/feedme.conf.
Feedme's cache is ~/.cache/feedme/feedme.dat. This file should remain relatively small if you have a sane number of feeds, but it doesn't hurt to keep an eye on it. Feedme will also keep backup cache files for about a week, named by date; you can use these to go back to an earlier state in case you lost your feeds or accidentally deleted something.
~/feeds is where it stores the downloaded HTML, by default (you can change this in the configuration file). Stories are downloaded as sitename/number.html, e.g. ~/feeds/BBC_World_News/2.html. These stories are cleaned out every save_days (set in your feedme.conf).
If you save to formats beyond plain HTML, there may be other directories
used for the converted files; for example, plucker files are created in
This is never cleaned out by feedme, so you'll have to prune it yourself.
When I used plucker as my feed
reader, I had an alias that ran
to remove the previous day's plucks just before I ran feedme.
FeedMe's license is GPLv2 or (at your option) later. Thanks to Carla Schroder for the name suggestion!
Currently, I read the feeds I fetch with Feedme with an Android program I wrote, FeedViewer. I'm the first to confess that it's not very well documented, though, and in particular there's no documentation on how to set things up so that FeedMe can generate files that FeedFetcher can generate automatically. If you're trying to use this and can't figure it out, let me know, otherwise I may never get around to documenting it.