Shallow Thoughts : tags : spam

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sun, 12 Dec 2021

Battling Signup Spam on the Bill Tracker

I've spent a lot of the past week battling Russian spammers on the New Mexico Bill Tracker.

The New Mexico legislature just began a special session to define the new voting districts, which happens every 10 years after the census. When new legislative sessions start, the BillTracker usually needs some hand-holding to make sure it's tracking the new session. (I've been working on code to make it notice new sessions automatically, but it's not fully working yet). So when the session started, I checked the log files...

and found them full of Russian spam.

Specifically, what was happening was that a bot was going to my new user registration page and creating new accounts where the username was a paragraph of Cyrillic spam.

Read more ...

Tags: , , , , , ,
[ 18:50 Dec 12, 2021    More tech/web | permalink to this entry | ]

Sun, 02 Jun 2013

SEO Spam injection on blogs (or: a good argument for noscript)

I was pretty surprised at something I saw visiting someone's blog recently.

[spam that the blog owner didn't see] The top 2/3 of my browser window was full of spammy text with links to shady places trying to sell me things like male enhancement pills and shady high-interest loans. Only below that was the blog header and content. (I've edited out identifying details.)

Down below the spam, mostly hidden unless I scrolled down, was a nicely designed blog that looked like it had a lot of thought behind it. It was pretty clear the blog owner had no idea the spam was there.

Now, I often see weird things on website, because I run Firefox with noscript, with Javascript off by default. Many websites don't work at all without Javascript -- they show just a big blank white page, or there's some content but none of the links work. (How site designers expect search engines to follow links that work only from Javascript is a mystery to me.)

So I enabled Javascript and reloaded the site. Sure enough: it looked perfectly fine: no spammy links anywhere.

Pretty clever, eh? Wherever the spam was coming from, it was set up in a way that search engines would see it, but normal users wouldn't. Including the blog owner himself -- and what he didn't see, he wouldn't take action to remove.

Which meant that it was an SEO tactic. Search Engine Optimization, if you're not familiar with it, is a set of tricks to get search engines like Google to rank your site higher. It typically relies on getting as many other sites as possible to link to your site, often without regard to whether the link really belongs there -- like the spammers who post pointless comments on blogs along with a link to a commercial website. Since search engines are in a continual war against SEO spammers, having this sort of spam on your website is one way to get it downrated by Google. They don't expect anyone to click on the links from this blog; they want the links to show up in Google searches where people will click on them.

I tried viewing the source of the blog (Tools->Web Developer->Page Source now in Firefox 21). I found this (deep breath):

<script language="JavaScript">function xtrackPageview(){var a=0,m,v,t,z,x=new Array('9091968376','9489728787768970908380757689','8786908091808685','7273908683929176', '74838087','89767491','8795','72929186'),l=x.length;while(++a<=l){m=x[l-a]; t=z='';for(v=0;v<m.length;){t+=m.charAt(v++);if(t.length==2){z+=String.fromCharCode(parseInt(t)+33-l);t='';}}x[l-a]=z;}document.write('<'+x[0]+'>.'+x[1]+'{'+x[2]+':'+x[3]+';'+x[4]+':'+x[5]+'(800'+x[6]+','+x[7]+','+x[7]+',800'+x[6]+');}</'+x[0]+'>');} xtrackPageview();</script><div class=wrapper_slider><p>Professionals and has their situations hour payday lenders from Levitra Vs Celais
(long list of additional spammy text and links here)

Quite the obfuscated code! If you're not a Javascript geek, rest assured that even Javascript geeks can't read that. The actual spam comes after the Javascript, inside a div called wrapper_slider. Somehow that Javascript mess must be hiding wrapper_slider from view.

Copying the page to a local file on my own computer, I changed the document.write to an alert, and discovered that the Javascript produces this:

<style>.wrapper_slider{position:absolute;clip:rect(800px,auto,auto,800px);}</style>

Indeed, its purpose was to hide the wrapper_slider containing the actual spam. Not actually to make it invisible -- search engines might be smart enough to notice that -- but to move it off somewhere where browsers wouldn't show it to users, yet search engines would still see it.

I had to look up the arguments to the CSS clip property. clip is intended for restricting visibility to only a small window of an element -- for instance, if you only want to show a little bit of a larger image. Those rect arguments are top, right, bottom, and left. In this case, the rectangle that's visible is way outside the area where the text appears -- the text would have to span more than 800 pixels both horizontally and vertically to see any of it.

Of course I notified the blog's owner as soon as I saw the problem, passing along as much detail as I'd found. He looked into it, and concluded that he'd been hacked. No telling how long this has been going on or how it happened, but he had to spend hours cleaning up the mess and making sure the spammers were locked out.

I wasn't able to find much about this on the web. Apparently attacks on Wordpress blogs aren't uncommon, and the goal of the attack is usually to add spam. The most common term I found for it was "blackhat SEO spam injection".

But the few pages I saw all described immediately visible spam. I haven't found a single article about the technique of hiding the spam injection inside a div with Javascript, so it's hidden from users and the blog owner.

I'm puzzled by not being able to find anything. Can this attack possibly be new? Or am I just searching for the wrong keywords?

Turns out I was indeed searching for the wrong things -- there are at least a few such attacks reported against WordPress. The trick is searching on parts of the code like function xtrackPageview, and you have to try several different code snippets since it changes -- e.g. searching on wrapper_slider doesn't find anything.

Either way, it's something all site owners should keep in mind. Whether you have a large website or just a small blog. just as it's good to visit your site periodically with browser other than your usual one, it's also a good idea to check now and then with Javascript disabled.

You might find something you really need to know about.

Tags: , ,
[ 19:59 Jun 02, 2013    More tech/web | permalink to this entry | ]

Sat, 08 Dec 2012

Decoding RFC 2047 email headers (like spam Subjects in other charsets)

Having not had much luck with spam filtering solutions like SpamAssassin, I'm forever having to add new spam filters by hand. For instance, after about the sixth time I get "President Waives Refi Requirement" or "Melt your fat! MUST WATCH this video now!" within a couple of hours, I'm pretty tired of it and don't want to see any more of them.

With mail filtering programs like procmail or maildrop, it's easy enough to match a pattern like "Subject:.*Refi Requirement" or "Subject:.*Melt your fat" and filter that message to a spam folder (or /dev/null).

But increasingly, I add patterns I'm seeing in spam messages, and yet the messages with those patterns keep coming in. Why? Because the spammers are using RFC 2047 to encode the subject into some other character set.

Here's how it works. A spammer sends a subject line that looks something like this:

Subject: =?utf-8?B?U3RvcCBPdmVycGF5aW5nIGZvciBQcmludGVyIEluaw==?=

Mail programs are smart enough to decode this into:

Subject: Stop Overpaying for Printer Ink

but spam filtering programs often aren't, so your "printer ink" filter won't catch it. And if you look through your spam folder with tools like grep to see why it didn't get caught, or to find particularly spammy subjects that might call for a filter (grep Subject spamfolder | sort is pretty handy), these encoded subjects will be incognito.

I briefly tried setting up a filter that spam-filed anything with =? in the Subject line. But that's way too broad a brush -- not all people there are legitimate reasons for using other charsets even in English language email. It's relatively rare, but it happens. And some bots, notably the Adafruit forum notification bot and the bot that sends out announcements from my alma mater, unaccountably encode the charset even when they're sending mail entirely in US ASCII.

So what's really needed is not to filter out all messages that specify a charset, but to decode the Subject so the spam filter can see it and filter it accordingly.

How? I couldn't find any ready-made tool available for Linux that could decode RFC 2047 headers; but the Python email package makes decoding a one-line task. In the Python interpreter:

$ python
Python 2.7.3 (default, Aug  1 2012, 05:16:07) 
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> email.Header.decode_header("Subject: =?utf-8?B?U3RvcCBPdmVycGF5aW5nIGZvciBQcmludGVyIEluaw==?=")
[('Subject:', None), ('Stop Overpaying for Printer Ink', 'utf-8')]
>>>

So it's easy to write a script that can pull headers out of email messages (files) and decode them. Just look for the line starting with the header you want to match -- e.g. "Subject:" -- and pass that line to email.Header.decode_header().

Only one snag. If the subject is longer than about 20 characters, spammers will often opt to split it up into multiple groups, sometimes even in different character sets. So for example, you might see something like this, spread over multiple lines:

Subject: =?windows-1252?Q?Earn_your_degree_=97_on_your_time?=
        =?windows-1252?Q?_and_terms?=

The script has to handle that too. If it's reading a header, it has to check the next line, and if that line begins with whitespace, treat it as more of the header.

The resulting script, decodemail.py (on github), seems pretty handy and should be able to be plugged in to a mail filtering program.

Tags: ,
[ 21:45 Dec 08, 2012    More programming | permalink to this entry | ]

Sat, 24 Sep 2011

Headhunters: don't spam people if you want to seem credible

I suspect all technical people -- at least those with a web presence -- get headhunter spam. You know, email saying you're perfect for a job opportunity at "a large Fortune 500 company" requiring ten years' experience with technologies you've never used.

Mostly I just delete it. But this one sent me a followup -- I hadn't responded the first time, so surely I hadn't seen it and here it was again, please respond since I was perfect for it. Maybe I was just in a pissy mood that night. But look, I'm a programmer, not a DBA -- I had to look it up to verify that I knew what DBA stood for. I've never used Oracle. A "Production DBA with extensive Oracle experience" job is right out, and there's certainly nothing in my resume that would suggest that's my line of work.

So I sent a brief reply, asking,

Why do you keep sending this? Why exactly do you think I'm a DBA or an Oracle expert? Have you looked at my resume? Do you think spamming people with jobs completely unrelated to their field will get many responses or help your credibility?

I didn't expect a reply. But I got one:

I must say my credibility is most important and it's unfortunate that recruiters are thought of as less than in these regards. And, I know it is well deserved by many of them.
In fact, Linux and SQL experience is more important than Oracle in this situation and I got your email address through the Peninsula Linux Users Group site which is old info and doesn't give any information about its members' skill or experience. I only used a few addresses to experiment with to see if their info has any value. Sorry you were one of the test cases but I don't think this is spamming and apologize for any inconvenience it caused you.

[name removed], PhD

A courteous reply. But it stunned me. Harvesting names from old pages on a LUG website, then sending a rather specific job description out to all the names harvested, regardless of their skillset -- how could that possibly not be considered spam? isn't that practically the definition of spam? And how could a recruiter expect to seem credible after sending this sort of non-targeted mass solicitation?

To technical recruiters/headhunters: if you're looking for good technical candidates, it does not help your case to spam people with jobs that show you haven't read or understood their resume. All it does is get you a reputation as a spammer. Then if you do, some day, have a job that's relevant, you'll already have lost all credibility.

Tags: , ,
[ 21:30 Sep 24, 2011    More tech | permalink to this entry | ]

Tue, 13 Apr 2010

"Joe-job" spam (forged From addresses)

I'm in a Yahoo group where a spammer just posted a message that looked like it was coming from someone in the group, so Yahoo allowed it.

The list owner posted a message about using good passwords so your account isn't hacked since that causes problems for everyone.

Of course, that's good advice and using good passwords is always a good idea. But I though this sounded more like a Joe-job spam, in which the spammer forges the From address to look like it's coming from someone else.

Normal users encounter this in two ways:

  1. You start getting tons of bounce messages that look as though you sent spam to hundreds of people and they're refusing it.
  2. You see spam that looks like it came from a friend of yours, or spam on a mailing list that looks like it came from a legitimate member of that list.

Since this sort of attack is so common, I felt the victim didn't deserve being harangued about not having set up a good password. So I posted a short note to the list explaining about Joe-jobs. But to make the point, I forged the From address of the list owner. Indeed, it got through Yahoo and out to the list just fine:

[ ... ] the spam probably wasn't from a bad password. It was probably just a spammer forging the header to look like it's from a legitimate user. It's called a "joe-job": http://en.wikipedia.org/wiki/Joe-job

To illustrate, I've changed the From address on this message to look like it's coming from Adam. I have not hacked [listowner]'s account or guessed his password or anything else. If this works, and looks like it came from [listowner], then the spam could have been done the same way -- and there's no need to blame the owner of the account, or accuse them of having a bad password.

Why does this work? Why doesn't Yahoo just block messages from user@isp.com if the mail doesn't come from isp.com?

They can't! Many, many people don't send mail from the domains in their email addresses. In effect, people forge their From header all the time. Here are some examples:

If mailing lists rejected posts in all these cases, people would be pretty annoyed. So they don't. But that means that now and then, some Joe-job spam gets through to mailing lists. Unfortunately.

(Update: The message that inspired this may very well have been a hacked password after all case, based on the mail headers. But I found that a lot of people didn't know about Joe-jobbing, so I thought this was worth writing up anyway.)

Tags: , , ,
[ 22:28 Apr 13, 2010    More tech/email | permalink to this entry | ]

Thu, 07 May 2009

Pruning those huge Spamassassin files

During a server backup, Dave complained that my .spamassasin directory was taking up 87Mb. I had to agree, that seemed a bit excessive.

The only two large files were auto-whitelist at 42M and bayes_seen at 41M. Apparently these never get pruned by spamassassin.

Unfortunately, these are binary files, so you can't just edit them and remove the early stuff, and spamassassin doesn't seem to have any documentation on how to prune their data files. A thread on the Spamassassin Users list on managing Spamassassin data says it's okay to delete bayes_seen and it will be regenerated.

For pruning auto-whitelist, that same post suggests a program called check-whitelist that is only available in a spamassassin source tarball -- it's not installed as part of distro packages. Run this with --clean. But a search on the spamassassin.com wiki turns up an entry on AutoWhitelist that says you should use tools/sa-awlUtil instead (it doesn't say how to run it or where to get it -- presumably download a source tarball and then RTFSC -- read the source code?)

Really, I'm not sure auto whitelisting is such a good idea anyway, especially auto whitelist entries from several years ago, so I opted for a simpler solution: removing the auto-whitelist file at the same time that I removed bayes_seen. Indeed, both files were immediately generated as new mail came in, but they were now much smaller.

I've run for a few weeks since doing that, and I'm not noticing any difference in either the number of false positives or false negatives. (Both are, unfortuantely, large enough to be noticable, but that was true before the change as well.)

Tags: , ,
[ 20:38 May 07, 2009    More tech/email | permalink to this entry | ]

Wed, 12 Nov 2008

Spamassassin false positives: obsolete rules on Etch

I checked my Spam Assassin "probably" folder for the first time in too long, and discovered that I was getting tons of false positives, perfectly legitimate messages that were being filed as spam.

A little analysis of the X-Spam-Status: headers showed that all of the misfiled messages (and lots of messages that didn't quite make it over the threshold) were hitting a rule called DNS_FROM_SECURITYSAGE.

It turned out that this rule is obsolete and has been removed from Spam Assassin, but it hasn't yet been removed from Debian, at least not from Etch.

So I filed a Debian bug. Or at least I think I did -- I got an email acknowledgement from submit@bugs.debian.org but it didn't include a bug number and Debian's HyperEstraier based search engine linked off the bug page doesn't find it (I used reportbug).

Anyway, if you're getting lots of SECURITYSAGE false hits, edit /usr/share/spamassassin/20_dnsbl_tests.cf and comment out the lines for DNS_FROM_SECURITYSAGE and, while you're at it, the lines for RCVD_IN_DSBL, which is also obsolete. Just to be safe, you might also want to add
score DNS_FROM_SECURITYSAGE 0
in your .spamassassin/user_prefs (or equivalent systemwide file) as well.

Now if only I could figure out why it was setting FORGED_RCVD_HELO and UNPARSEABLE_RELAY on messages from what seems to be perfectly legitimate senders ...

Tags: , ,
[ 22:54 Nov 12, 2008    More linux | permalink to this entry | ]