Yet more on that comma-inserting regexp, plus a pattern to filter unprintable characters (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Wed, 24 Jul 2013

Yet more on that comma-inserting regexp, plus a pattern to filter unprintable characters

One more brief followup on that comma inserting sed pattern and its followup:

$ echo 20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'
20,130,607,215,015

In the second article, I'd mentioned that the hardest part of the exercise was figuring out where we needed backslashes. Devdas (f3ew) asked on Twitter whether I would still need all the backslash escapes even if I put the pattern in a file -- in other worse, are the backslashes merely to get the shell to pass special characters unchanged?

A good question, and I suspected the need for some of the backslashes would disappear. So I tried this:

$ echo ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' >/tmp/commas   
$ echo 20130607215015 | sed -f /tmp/commas

And it didn't work. No commas were inserted.

The problem, it turns out, is that my shell, zsh, changed both instances of \b to an ASCII backspace, ^H. Editing the file fixes that, and so does

$ echo -E ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' >/tmp/commas   

But that only applies to echo: zsh doesn't do the \b -> ^H substitution in the original command, where you pass the string directly as a sed argument.

Okay, with that straightened out, what about Devdas' question?

Surprisingly, it turns out that all the backslashes are still needed. None of them go away when you echo > file, so they weren't there just to get special characters past the shell; and if you edit the file and try removing some of the backslashes, you'll see that the pattern no longer works. I had thought at least some of them, like the ones before the \{ \}, were extraneous, but even those are still needed.

Filtering unprintable characters

As long as I'm writing about regular expressions, I learned a nice little tidbit last week. I'm getting an increasing flood of Asian-language spams which my mail ISP doesn't filter out (they use spamassassin, which is pretty useless for this sort of filtering). I wanted a simple pattern I could pass to egrep (via procmail) that would filter out anything with a run of more than 4 unprintable characters in a row. [^[:print:]]{4,} should do it, but it wasn't working.

The problem, it turns out, is the definition of what's printable. Apparently when the default system character set is UTF-8, just about everything is considered printable! So the trick is that you need to set LC_ALL to something more restrictive, like C (which basically means ASCII) to before :print: becomes useful for language-based filtering. (Thanks to Mikachu for spotting the problem).

So in a terminal, you can do something like

LC_ALL=C egrep -v '[^[:print:]]' filename

In procmail it was a little harder; I couldn't figure out any way to change LC_ALL from a procmail recipe; the only solution I came up with was to add this to ~/.procmailrc:

export LC_ALL=C

It does work, though, and has cut the spam load by quite a bit.

Tags: , , , ,
[ 19:35 Jul 24, 2013    More linux/cmdline | permalink to this entry | ]

Comments via Disqus:

blog comments powered by Disqus