Decoding RFC 2047 email headers (like spam Subjects in other charsets)
Having not had much luck with spam filtering solutions like SpamAssassin, I'm forever having to add new spam filters by hand. For instance, after about the sixth time I get "President Waives Refi Requirement" or "Melt your fat! MUST WATCH this video now!" within a couple of hours, I'm pretty tired of it and don't want to see any more of them.
With mail filtering programs like procmail or maildrop, it's easy enough to match a pattern like "Subject:.*Refi Requirement" or "Subject:.*Melt your fat" and filter that message to a spam folder (or /dev/null).
But increasingly, I add patterns I'm seeing in spam messages, and yet the messages with those patterns keep coming in. Why? Because the spammers are using RFC 2047 to encode the subject into some other character set.
Here's how it works. A spammer sends a subject line that looks something like this:
Subject: =?utf-8?B?U3RvcCBPdmVycGF5aW5nIGZvciBQcmludGVyIEluaw==?=
Mail programs are smart enough to decode this into:
Subject: Stop Overpaying for Printer Ink
but spam filtering programs often aren't, so your "printer ink" filter
won't catch it. And if you look through your spam folder with tools like
grep to see why it didn't get caught, or to find particularly spammy
subjects that might call for a filter
(grep Subject spamfolder | sort
is pretty handy),
these encoded subjects will be incognito.
I briefly tried setting up a filter that spam-filed anything with =? in the Subject line. But that's way too broad a brush -- not all people there are legitimate reasons for using other charsets even in English language email. It's relatively rare, but it happens. And some bots, notably the Adafruit forum notification bot and the bot that sends out announcements from my alma mater, unaccountably encode the charset even when they're sending mail entirely in US ASCII.
So what's really needed is not to filter out all messages that specify a charset, but to decode the Subject so the spam filter can see it and filter it accordingly.
How? I couldn't find any ready-made tool available for Linux that could decode RFC 2047 headers; but the Python email package makes decoding a one-line task. In the Python interpreter:
$ python Python 2.7.3 (default, Aug 1 2012, 05:16:07) Type "help", "copyright", "credits" or "license" for more information. >>> import email >>> email.Header.decode_header("Subject: =?utf-8?B?U3RvcCBPdmVycGF5aW5nIGZvciBQcmludGVyIEluaw==?=") [('Subject:', None), ('Stop Overpaying for Printer Ink', 'utf-8')] >>>
So it's easy to write a script that can pull headers out of email messages (files) and decode them. Just look for the line starting with the header you want to match -- e.g. "Subject:" -- and pass that line to email.Header.decode_header().
Only one snag. If the subject is longer than about 20 characters, spammers will often opt to split it up into multiple groups, sometimes even in different character sets. So for example, you might see something like this, spread over multiple lines:
Subject: =?windows-1252?Q?Earn_your_degree_=97_on_your_time?= =?windows-1252?Q?_and_terms?=
The script has to handle that too. If it's reading a header, it has to check the next line, and if that line begins with whitespace, treat it as more of the header.
The resulting script, decodemail.py (on github), seems pretty handy and should be able to be plugged in to a mail filtering program.
[ 21:45 Dec 08, 2012 More programming | permalink to this entry | ]