Shallow Thoughts : tags : spam
Akkana's Musings on Open Source Computing, Science, and Nature.
Sat, 08 Dec 2012
Having not had much luck with spam filtering solutions like SpamAssassin,
I'm forever having to add new spam filters by hand. For instance, after
about the sixth time I get "President Waives Refi Requirement"
or "Melt your fat! MUST WATCH this video now!" within a couple of
hours, I'm pretty tired of it and don't want to see any more of them.
With mail filtering programs like procmail or maildrop, it's easy
enough to match a pattern like "Subject:.*Refi Requirement" or
"Subject:.*Melt your fat" and filter that message to a spam folder
But increasingly, I add patterns I'm seeing in spam messages, and yet
the messages with those patterns keep coming in. Why? Because the
spammers are using RFC 2047
to encode the subject into some other character set.
Here's how it works. A spammer sends a subject line that looks
something like this:
Mail programs are smart enough to decode this into:
Subject: Stop Overpaying for Printer Ink
but spam filtering programs often aren't, so your "printer ink" filter
won't catch it. And if you look through your spam folder with tools like
grep to see why it didn't get caught, or to find particularly spammy
subjects that might call for a filter
grep Subject spamfolder | sort is pretty handy),
these encoded subjects will be incognito.
I briefly tried setting up a filter that spam-filed anything with =? in the
Subject line. But that's way too broad a brush -- not all people
there are legitimate reasons for using other charsets even in English
language email. It's relatively rare, but it happens. And some bots,
notably the Adafruit forum notification bot
and the bot that sends out announcements from my alma mater,
unaccountably encode the charset even when they're sending mail
entirely in US ASCII.
So what's really needed is not to filter out all messages that specify
a charset, but to decode the Subject so the spam filter can see it and
filter it accordingly.
How? I couldn't find any ready-made tool
available for Linux that could decode RFC 2047 headers; but the Python
email package makes decoding a one-line task.
In the Python interpreter:
Python 2.7.3 (default, Aug 1 2012, 05:16:07)
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> email.Header.decode_header("Subject: =?utf-8?B?U3RvcCBPdmVycGF5aW5nIGZvciBQcmludGVyIEluaw==?=")
[('Subject:', None), ('Stop Overpaying for Printer Ink', 'utf-8')]
So it's easy to write a script that can pull headers out of email
messages (files) and decode them. Just look for the line starting with
the header you want to match -- e.g. "Subject:" -- and pass that line
Only one snag. If the subject is longer than about 20 characters,
spammers will often opt to split it up into multiple groups, sometimes
even in different character sets. So for example, you might see
something like this, spread over multiple lines:
The script has to handle that too. If it's reading a header, it has to
check the next line, and if that line begins with whitespace, treat it
as more of the header.
The resulting script, decodemail.py
(on github), seems pretty handy and should be able to be plugged in
to a mail filtering program.
[ 20:45 Dec 08, 2012
More programming |
permalink to this entry |
Sat, 24 Sep 2011
I suspect all technical people -- at least those with a web presence
-- get headhunter spam. You know, email saying you're perfect for a
job opportunity at "a large Fortune 500 company" requiring ten years'
experience with technologies you've never used.
Mostly I just delete it. But this one sent me a followup --
I hadn't responded the first time, so surely I hadn't seen it and
here it was again, please respond since I was perfect for it.
Maybe I was just in a pissy mood that night. But
look, I'm a programmer, not a DBA -- I had to look it up to verify
that I knew what DBA stood for. I've never used Oracle.
A "Production DBA with extensive Oracle experience" job is right out,
and there's certainly nothing in my resume that would suggest that's
my line of work.
So I sent a brief reply, asking,
Why do you keep sending this?
Why exactly do you think I'm a DBA or an Oracle expert?
Have you looked at my resume? Do you think spamming people
with jobs completely unrelated to their field will get many
responses or help your credibility?
I didn't expect a reply. But I got one:
I must say my credibility is most important and it's unfortunate
that recruiters are thought of as less than in these regards. And, I know it
is well deserved by many of them.
In fact, Linux and SQL experience is more important than Oracle in this
situation and I got your email address through the Peninsula Linux Users
Group site which is old info and doesn't give any information about its
members' skill or experience. I only used a few addresses to experiment with
to see if their info has any value. Sorry you were one of the test cases but
I don't think this is spamming and apologize for any inconvenience it caused
[name removed], PhD
A courteous reply. But it stunned me.
Harvesting names from old pages on a LUG website, then sending a
rather specific job description out to all the names harvested,
regardless of their skillset -- how could that possibly not be
considered spam? isn't that practically the definition of spam?
And how could a recruiter expect to seem credible after sending this
sort of non-targeted mass solicitation?
To technical recruiters/headhunters: if you're looking for
good technical candidates, it does not help your case to spam people
with jobs that show you haven't read or understood their resume.
All it does is get you a reputation as a spammer. Then if you do, some
day, have a job that's relevant, you'll already have lost all credibility.
[ 20:30 Sep 24, 2011
More tech |
permalink to this entry |
Tue, 13 Apr 2010
I'm in a Yahoo group where a spammer just posted a message that
looked like it was coming from someone in the group, so Yahoo allowed it.
The list owner posted a message about using good passwords so your
account isn't hacked since that causes problems for everyone.
Of course, that's good advice and using good passwords is always a good idea.
But I though this sounded more like a
in which the spammer forges the From address to look like it's coming
from someone else.
Normal users encounter this in two ways:
- You start getting tons of bounce messages that look as though you
sent spam to hundreds of people and they're refusing it.
- You see spam that looks like it came from a friend of yours,
or spam on a mailing list that looks like it came from a
legitimate member of that list.
Since this sort of attack is so common, I felt the victim didn't
deserve being harangued about not having set up a good password.
So I posted a short note to the list explaining about Joe-jobs.
But to make the point, I forged the From address of the list owner.
Indeed, it got through Yahoo and out to the list just fine:
[ ... ] the spam probably
wasn't from a bad password. It was probably just a spammer forging
the header to look like it's from a legitimate user.
It's called a "joe-job": http://en.wikipedia.org/wiki/Joe-job
To illustrate, I've changed the From address on this message to
look like it's coming from Adam. I have not hacked [listowner]'s account
or guessed his password or anything else. If this works, and looks
like it came from [listowner], then the spam could have been done the same
way -- and there's no need to blame the owner of the account, or
accuse them of having a bad password.
Why does this work? Why doesn't Yahoo just block messages from
firstname.lastname@example.org if the mail doesn't come from isp.com?
They can't! Many, many people don't send mail from the domains in their
email addresses. In effect, people forge their From header all the time.
Here are some examples:
- You're using email@example.com, but you're using Thunderbird or Eudora
or Evolution or something to read and send mail from home.
- You're on your computer at home, but you're sending work-related
email from your work account.
- You're on your laptop, using Thunderbird or whatever, mailing
from firstname.lastname@example.org, but you're at a friend's house or a hotel or
conference or somewhere.
- You're sending mail from a public terminal somewhere (eek, do
people really type their mail info in to these things?)
- You're reading and sending mail from a mobile phone.
- You're sending mail from your own domain, email@example.com,
but you're at home or somewhere else other than wherever mydomain.com
If mailing lists rejected posts in all these cases, people would be
pretty annoyed. So they don't. But that means that now and then, some
Joe-job spam gets through to mailing lists. Unfortunately.
(Update: The message that inspired this may very
well have been a hacked password after all case, based on the mail
headers. But I found that a lot of people didn't know about
Joe-jobbing, so I thought this was worth writing up anyway.)
[ 21:28 Apr 13, 2010
More tech/email |
permalink to this entry |
Thu, 07 May 2009
During a server backup, Dave complained that my .spamassasin directory
was taking up 87Mb. I had to agree, that seemed a bit excessive.
The only two large files were auto-whitelist at 42M and bayes_seen at 41M.
Apparently these never get pruned by spamassassin.
Unfortunately, these are binary files, so you can't just edit them
and remove the early stuff, and spamassassin doesn't seem to have any
documentation on how to prune their data files.
A thread on the Spamassassin Users list on
Spamassassin data says it's okay to delete bayes_seen
and it will be regenerated.
For pruning auto-whitelist, that same post suggests a program called
check-whitelist that is only available in a spamassassin source tarball
-- it's not installed as part of distro packages. Run this with
But a search on the spamassassin.com wiki turns up an entry on
that says you should use tools/sa-awlUtil instead (it doesn't
say how to run it or where to get it -- presumably download a source
tarball and then RTFSC -- read the source code?)
Really, I'm not sure auto whitelisting is such a good idea anyway,
especially auto whitelist entries from several years ago,
so I opted for a simpler solution: removing the auto-whitelist file
at the same time that I removed bayes_seen. Indeed, both files were
immediately generated as new mail came in, but they were now much smaller.
I've run for a few weeks since doing that, and I'm not noticing any
difference in either the number of false positives or false
negatives. (Both are, unfortuantely, large enough to be noticable,
but that was true before the change as well.)
[ 19:38 May 07, 2009
More tech/email |
permalink to this entry |
Wed, 12 Nov 2008
I checked my Spam Assassin "probably" folder for the first time in too
long, and discovered that I was getting tons of false positives,
perfectly legitimate messages that were being filed as spam.
A little analysis of the X-Spam-Status: headers showed that all of
the misfiled messages (and lots of messages that didn't quite make it
over the threshold) were hitting a rule called DNS_FROM_SECURITYSAGE.
It turned out that this rule
obsolete and has been removed from Spam Assassin, but it
yet been removed from Debian, at least not from Etch.
So I filed a Debian bug. Or at least I think I did -- I got an
email acknowledgement from firstname.lastname@example.org but it didn't
include a bug number and Debian's
HyperEstraier based search engine
linked off the bug page
doesn't find it (I used reportbug).
Anyway, if you're getting lots of SECURITYSAGE false hits, edit
/usr/share/spamassassin/20_dnsbl_tests.cf and comment out the
lines for DNS_FROM_SECURITYSAGE and, while you're at it, the lines
for RCVD_IN_DSBL, which is also
obsolete. Just to be safe, you might also want to add
score DNS_FROM_SECURITYSAGE 0
in your .spamassassin/user_prefs (or equivalent systemwide file) as well.
Now if only I could figure out why it was setting
FORGED_RCVD_HELO and UNPARSEABLE_RELAY on messages from what seems
to be perfectly legitimate senders ...
[ 21:54 Nov 12, 2008
More linux |
permalink to this entry |