Shallow Thoughts : tags : spam

Akkana's Musings on Open Source Computing, Science, and Nature.

Sat, 08 Dec 2012

Decoding RFC 2047 email headers (like spam Subjects in other charsets)

Having not had much luck with spam filtering solutions like SpamAssassin, I'm forever having to add new spam filters by hand. For instance, after about the sixth time I get "President Waives Refi Requirement" or "Melt your fat! MUST WATCH this video now!" within a couple of hours, I'm pretty tired of it and don't want to see any more of them.

With mail filtering programs like procmail or maildrop, it's easy enough to match a pattern like "Subject:.*Refi Requirement" or "Subject:.*Melt your fat" and filter that message to a spam folder (or /dev/null).

But increasingly, I add patterns I'm seeing in spam messages, and yet the messages with those patterns keep coming in. Why? Because the spammers are using RFC 2047 to encode the subject into some other character set.

Here's how it works. A spammer sends a subject line that looks something like this:

Subject: =?utf-8?B?U3RvcCBPdmVycGF5aW5nIGZvciBQcmludGVyIEluaw==?=

Mail programs are smart enough to decode this into:

Subject: Stop Overpaying for Printer Ink

but spam filtering programs often aren't, so your "printer ink" filter won't catch it. And if you look through your spam folder with tools like grep to see why it didn't get caught, or to find particularly spammy subjects that might call for a filter (grep Subject spamfolder | sort is pretty handy), these encoded subjects will be incognito.

I briefly tried setting up a filter that spam-filed anything with =? in the Subject line. But that's way too broad a brush -- not all people there are legitimate reasons for using other charsets even in English language email. It's relatively rare, but it happens. And some bots, notably the Adafruit forum notification bot and the bot that sends out announcements from my alma mater, unaccountably encode the charset even when they're sending mail entirely in US ASCII.

So what's really needed is not to filter out all messages that specify a charset, but to decode the Subject so the spam filter can see it and filter it accordingly.

How? I couldn't find any ready-made tool available for Linux that could decode RFC 2047 headers; but the Python email package makes decoding a one-line task. In the Python interpreter:

$ python
Python 2.7.3 (default, Aug  1 2012, 05:16:07) 
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> email.Header.decode_header("Subject: =?utf-8?B?U3RvcCBPdmVycGF5aW5nIGZvciBQcmludGVyIEluaw==?=")
[('Subject:', None), ('Stop Overpaying for Printer Ink', 'utf-8')]
>>>

So it's easy to write a script that can pull headers out of email messages (files) and decode them. Just look for the line starting with the header you want to match -- e.g. "Subject:" -- and pass that line to email.Header.decode_header().

Only one snag. If the subject is longer than about 20 characters, spammers will often opt to split it up into multiple groups, sometimes even in different character sets. So for example, you might see something like this, spread over multiple lines:

Subject: =?windows-1252?Q?Earn_your_degree_=97_on_your_time?=
        =?windows-1252?Q?_and_terms?=

The script has to handle that too. If it's reading a header, it has to check the next line, and if that line begins with whitespace, treat it as more of the header.

The resulting script, decodemail.py (on github), seems pretty handy and should be able to be plugged in to a mail filtering program.

Tags: ,
[ 20:45 Dec 08, 2012    More programming | permalink to this entry | comments ]

Sat, 24 Sep 2011

Headhunters: don't spam people if you want to seem credible

I suspect all technical people -- at least those with a web presence -- get headhunter spam. You know, email saying you're perfect for a job opportunity at "a large Fortune 500 company" requiring ten years' experience with technologies you've never used.

Mostly I just delete it. But this one sent me a followup -- I hadn't responded the first time, so surely I hadn't seen it and here it was again, please respond since I was perfect for it. Maybe I was just in a pissy mood that night. But look, I'm a programmer, not a DBA -- I had to look it up to verify that I knew what DBA stood for. I've never used Oracle. A "Production DBA with extensive Oracle experience" job is right out, and there's certainly nothing in my resume that would suggest that's my line of work.

So I sent a brief reply, asking,

Why do you keep sending this? Why exactly do you think I'm a DBA or an Oracle expert? Have you looked at my resume? Do you think spamming people with jobs completely unrelated to their field will get many responses or help your credibility?

I didn't expect a reply. But I got one:

I must say my credibility is most important and it's unfortunate that recruiters are thought of as less than in these regards. And, I know it is well deserved by many of them.
In fact, Linux and SQL experience is more important than Oracle in this situation and I got your email address through the Peninsula Linux Users Group site which is old info and doesn't give any information about its members' skill or experience. I only used a few addresses to experiment with to see if their info has any value. Sorry you were one of the test cases but I don't think this is spamming and apologize for any inconvenience it caused you.

[name removed], PhD

A courteous reply. But it stunned me. Harvesting names from old pages on a LUG website, then sending a rather specific job description out to all the names harvested, regardless of their skillset -- how could that possibly not be considered spam? isn't that practically the definition of spam? And how could a recruiter expect to seem credible after sending this sort of non-targeted mass solicitation?

To technical recruiters/headhunters: if you're looking for good technical candidates, it does not help your case to spam people with jobs that show you haven't read or understood their resume. All it does is get you a reputation as a spammer. Then if you do, some day, have a job that's relevant, you'll already have lost all credibility.

Tags: , ,
[ 20:30 Sep 24, 2011    More tech | permalink to this entry | comments ]

Tue, 13 Apr 2010

"Joe-job" spam (forged From addresses)

I'm in a Yahoo group where a spammer just posted a message that looked like it was coming from someone in the group, so Yahoo allowed it.

The list owner posted a message about using good passwords so your account isn't hacked since that causes problems for everyone.

Of course, that's good advice and using good passwords is always a good idea. But I though this sounded more like a Joe-job spam, in which the spammer forges the From address to look like it's coming from someone else.

Normal users encounter this in two ways:

  1. You start getting tons of bounce messages that look as though you sent spam to hundreds of people and they're refusing it.
  2. You see spam that looks like it came from a friend of yours, or spam on a mailing list that looks like it came from a legitimate member of that list.

Since this sort of attack is so common, I felt the victim didn't deserve being harangued about not having set up a good password. So I posted a short note to the list explaining about Joe-jobs. But to make the point, I forged the From address of the list owner. Indeed, it got through Yahoo and out to the list just fine:

[ ... ] the spam probably wasn't from a bad password. It was probably just a spammer forging the header to look like it's from a legitimate user. It's called a "joe-job": http://en.wikipedia.org/wiki/Joe-job

To illustrate, I've changed the From address on this message to look like it's coming from Adam. I have not hacked [listowner]'s account or guessed his password or anything else. If this works, and looks like it came from [listowner], then the spam could have been done the same way -- and there's no need to blame the owner of the account, or accuse them of having a bad password.

Why does this work? Why doesn't Yahoo just block messages from user@isp.com if the mail doesn't come from isp.com?

They can't! Many, many people don't send mail from the domains in their email addresses. In effect, people forge their From header all the time. Here are some examples:

If mailing lists rejected posts in all these cases, people would be pretty annoyed. So they don't. But that means that now and then, some Joe-job spam gets through to mailing lists. Unfortunately.

(Update: The message that inspired this may very well have been a hacked password after all case, based on the mail headers. But I found that a lot of people didn't know about Joe-jobbing, so I thought this was worth writing up anyway.)

Tags: , , ,
[ 21:28 Apr 13, 2010    More tech/email | permalink to this entry | comments ]

Thu, 07 May 2009

Pruning those huge Spamassassin files

During a server backup, Dave complained that my .spamassasin directory was taking up 87Mb. I had to agree, that seemed a bit excessive.

The only two large files were auto-whitelist at 42M and bayes_seen at 41M. Apparently these never get pruned by spamassassin.

Unfortunately, these are binary files, so you can't just edit them and remove the early stuff, and spamassassin doesn't seem to have any documentation on how to prune their data files. A thread on the Spamassassin Users list on managing Spamassassin data says it's okay to delete bayes_seen and it will be regenerated.

For pruning auto-whitelist, that same post suggests a program called check-whitelist that is only available in a spamassassin source tarball -- it's not installed as part of distro packages. Run this with --clean. But a search on the spamassassin.com wiki turns up an entry on AutoWhitelist that says you should use tools/sa-awlUtil instead (it doesn't say how to run it or where to get it -- presumably download a source tarball and then RTFSC -- read the source code?)

Really, I'm not sure auto whitelisting is such a good idea anyway, especially auto whitelist entries from several years ago, so I opted for a simpler solution: removing the auto-whitelist file at the same time that I removed bayes_seen. Indeed, both files were immediately generated as new mail came in, but they were now much smaller.

I've run for a few weeks since doing that, and I'm not noticing any difference in either the number of false positives or false negatives. (Both are, unfortuantely, large enough to be noticable, but that was true before the change as well.)

Tags: , ,
[ 19:38 May 07, 2009    More tech/email | permalink to this entry | comments ]

Wed, 12 Nov 2008

Spamassassin false positives: obsolete rules on Etch

I checked my Spam Assassin "probably" folder for the first time in too long, and discovered that I was getting tons of false positives, perfectly legitimate messages that were being filed as spam.

A little analysis of the X-Spam-Status: headers showed that all of the misfiled messages (and lots of messages that didn't quite make it over the threshold) were hitting a rule called DNS_FROM_SECURITYSAGE.

It turned out that this rule is obsolete and has been removed from Spam Assassin, but it hasn't yet been removed from Debian, at least not from Etch.

So I filed a Debian bug. Or at least I think I did -- I got an email acknowledgement from submit@bugs.debian.org but it didn't include a bug number and Debian's HyperEstraier based search engine linked off the bug page doesn't find it (I used reportbug).

Anyway, if you're getting lots of SECURITYSAGE false hits, edit /usr/share/spamassassin/20_dnsbl_tests.cf and comment out the lines for DNS_FROM_SECURITYSAGE and, while you're at it, the lines for RCVD_IN_DSBL, which is also obsolete. Just to be safe, you might also want to add
score DNS_FROM_SECURITYSAGE 0
in your .spamassassin/user_prefs (or equivalent systemwide file) as well.

Now if only I could figure out why it was setting FORGED_RCVD_HELO and UNPARSEABLE_RELAY on messages from what seems to be perfectly legitimate senders ...

Tags: , ,
[ 21:54 Nov 12, 2008    More linux | permalink to this entry | comments ]

Syndicated on:
LinuxChix Live
Ubuntu Women
Women in Free Software
Graphics Planet
DevChix
Ubuntu California
Planet Openbox
Devchix
Planet LCA2009

Friends' Blogs:
Morris "Mojo" Jones
Jane Houston Jones
Dan Heller
Long Live the Village Green
Ups & Downs
DailyBBG

Other Blogs of Interest:
DevChix
Scott Adams
Dave Barry
BoingBoing

Powered by PyBlosxom.