Yet more on that comma-inserting regexp, plus a pattern to filter unprintable characters
One more brief followup on that comma inserting sed pattern and its followup:
$ echo 20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' 20,130,607,215,015
In the second article, I'd mentioned that the hardest part of the exercise was figuring out where we needed backslashes. Devdas (f3ew) asked on Twitter whether I would still need all the backslash escapes even if I put the pattern in a file -- in other worse, are the backslashes merely to get the shell to pass special characters unchanged?
A good question, and I suspected the need for some of the backslashes would disappear. So I tried this:
$ echo ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' >/tmp/commas $ echo 20130607215015 | sed -f /tmp/commas
And it didn't work. No commas were inserted.
The problem, it turns out, is that my shell, zsh, changed both instances of \b to an ASCII backspace, ^H. Editing the file fixes that, and so does
$ echo -E ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' >/tmp/commas
But that only applies to echo: zsh doesn't do the \b -> ^H substitution in the original command, where you pass the string directly as a sed argument.
Okay, with that straightened out, what about Devdas' question?
Surprisingly, it turns out that all the backslashes are still needed.
None of them go away when you echo > file
, so they
weren't there just to get special characters past the shell; and if
you edit the file and try removing some of the backslashes, you'll
see that the pattern no longer works. I had thought at least some of them,
like the ones before the \{ \}, were extraneous, but even those are
still needed.
Filtering unprintable characters
As long as I'm writing about regular expressions, I learned a nice
little tidbit last week. I'm getting an increasing
flood of Asian-language spams which my mail ISP doesn't filter out (they
use spamassassin, which is pretty useless for this sort of filtering).
I wanted a simple pattern I could pass to egrep (via procmail) that
would filter out anything with a run of more than 4 unprintable characters
in a row. [^[:print:]]{4,}
should do it, but it wasn't working.
The problem, it turns out, is the definition of what's printable. Apparently when the default system character set is UTF-8, just about everything is considered printable! So the trick is that you need to set LC_ALL to something more restrictive, like C (which basically means ASCII) to before :print: becomes useful for language-based filtering. (Thanks to Mikachu for spotting the problem).
So in a terminal, you can do something like
LC_ALL=C egrep -v '[^[:print:]]' filename
In procmail it was a little harder; I couldn't figure out any way to change LC_ALL from a procmail recipe; the only solution I came up with was to add this to ~/.procmailrc:
export LC_ALL=C
It does work, though, and has cut the spam load by quite a bit.
[ 19:35 Jul 24, 2013 More linux/cmdline | permalink to this entry | ]