Shallow Thoughts : tags : sed
Akkana's Musings on Open Source Computing and Technology, Science, and Nature.
Wed, 24 Jul 2013
One more brief followup on that
comma
inserting sed pattern and its
followup:
$ echo 20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'
20,130,607,215,015
In the second article, I'd mentioned that the hardest part of the exercise
was figuring out where we needed backslashes.
Devdas (f3ew) asked on Twitter
whether I would still need all the backslash escapes even
if I put the pattern in a file -- in other worse, are the backslashes
merely to get the shell to pass special characters unchanged?
A good question, and I suspected the need for some of the backslashes
would disappear. So I tried this:
$ echo ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' >/tmp/commas
$ echo 20130607215015 | sed -f /tmp/commas
And it didn't work. No commas were inserted.
The problem, it turns out, is that my shell, zsh, changed both instances
of \b to an ASCII backspace, ^H. Editing the file fixes that, and so does
$ echo -E ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' >/tmp/commas
But that only applies to echo: zsh doesn't do the \b -> ^H substitution
in the original command, where you pass the string directly as a sed argument.
Okay, with that straightened out, what about Devdas' question?
Surprisingly, it turns out that all the backslashes are still needed.
None of them go away when you echo > file
, so they
weren't there just to get special characters past the shell; and if
you edit the file and try removing some of the backslashes, you'll
see that the pattern no longer works. I had thought at least some of them,
like the ones before the \{ \}, were extraneous, but even those are
still needed.
Filtering unprintable characters
As long as I'm writing about regular expressions, I learned a nice
little tidbit last week. I'm getting an increasing
flood of Asian-language spams which my mail ISP doesn't filter out (they
use spamassassin, which is pretty useless for this sort of filtering).
I wanted a simple pattern I could pass to egrep (via procmail) that
would filter out anything with a run of more than 4 unprintable characters
in a row. [^[:print:]]{4,}
should do it, but it wasn't working.
The problem, it turns out, is the definition of what's printable.
Apparently when the default system character set is UTF-8, just about
everything is considered printable! So the trick is that you need to
set LC_ALL to something more restrictive, like C (which basically means
ASCII) to before :print: becomes useful for language-based filtering.
(Thanks to Mikachu for spotting the problem).
So in a terminal, you can do something like
LC_ALL=C egrep -v '[^[:print:]]' filename
In procmail it was a little harder; I couldn't figure out any way to
change LC_ALL from a procmail recipe; the only solution I came up
with was to add this to ~/.procmailrc:
export LC_ALL=C
It does work, though, and has cut the spam load by quite a bit.
Tags: zsh, regexp, sed, cmdline, grep
[
19:35 Jul 24, 2013
More linux/cmdline |
permalink to this entry |
]
Tue, 09 Jul 2013
A few days ago I wrote about a nifty
sed
script to insert commas into numbers that I dissected with the
help of Dana Jansens.
Once we'd figured it out, though, Dana thought this wasn't really the best
solution. For instance, what if you have a file that has some numbers
in it, but also has some digits mixed up with letters? Do you really
want to insert commas into every string of digits? What if you have
some license plates, like abc1234? Maybe it would be better to
restrict the change to digits that stand by themselves and
are obviously meant to be numbers. How much harder would that be?
More regexp fun! We kicked it around a bit, and came up with a solution:
$ echo abc20130607215015 | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta'
abc20,130,607,215,015
$ echo abc20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'
abc20130607215015
$ echo 20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'
20,130,607,215,015
Breaking that down: \b
is any word boundary -- you could
also use \< to indicate that it's the start of a word, much like
\> was the end of a word.
\([0-9]\+\)
is any string of one or more digits, taken as
a group. The \( \)
part marks it as a group so we'll be
able to use it later.
\([0-9]\{3\}\)
is a string of exactly three digits: again,
we're using \( \)
to mark it as our second numbered group.
\b
is another word boundary (we could use \>),
to indicate that the group of three digits must come at the end
of a word, with only whitespace or punctuation following it.
/\1,\2/
: once we've matched the pattern -- a word break,
one or more digits, three digits and another word break -- we'll
replace it with this. \1 matches the first group we found -- that
was the string of one or more digits. \2 matches the second group,
the final trio of digits. And there's a comma in between.
We use the same :a; ;ta
trick as in the first example
to loop around until there are no more triplets to match.
The hardest part of this was figuring out what needed to be escaped
with backslashes. The one that really surprised me was the \+.
Although *
works in sed the same way it does in other
programs, matching zero or more repetitions of the preceding pattern,
sed uses \+
rather than +
for one or more
repetitions. It took us some fiddling to find all the places we needed
backslashes.
Tags: regexp, sed, cmdline
[
21:16 Jul 09, 2013
More linux/cmdline |
permalink to this entry |
]
Sun, 07 Jul 2013
Carla Schroder's recent article,
More Great Linux Awk, Sed, and Bash Tips and Tricks ,
had a nifty sed command I hadn't seen before to take a long number and
insert commas appropriately:
sed -i ':a;s/\B[0-9]\{3\}\gt;/,&/;ta' numbers.txt
.
Or, if you don't have a numbers.txt file, you can do something like
echo 20130607215015 | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta'
(I dropped the -i since that's for doing in-place edits of a file).
Nice! But why does it work?
It would be easy enough to insert commas after every third number,
but that doesn't work unless the number of digits is a multiple of three.
In other words, you don't want 20130607215015 to become
201,306,072,150,15 (note how the last group only has two digits);
it has to count in threes from the right if you want to end up
with 20,130,607,215,015.
Carla's article didn't explain it, and neither did any of the other
sites I found that mentioned this trick.
So, with some help from regexp wizard Dana Jansens (of
OpenBox fame), I've broken it down
into more easily understood bits.
Labels and loops
The first thing to understand is that this is actually several sed commands.
I was familiar with sed's basic substitute command, s/from/to/.
But what's the rest of it? The semicolons separate the commands, so
the whole sed script is:
:a
s/\B[0-9]\{3\}\>/,&/
ta
What this does is set up a label called a. It tries to do the
substitute command, and if the substitute succeeds (if something
was changed), then ta
tells it to loop back around to
label a, the beginning of the script.
So let's look at that substitute command.
The substitute
Sed's s/from/to/ (like the equivalent command in vim and many
other programs) looks for the first instance of the from pattern
and replaces it with the to pattern. So we're searching for
\B[0-9]\{3\}\>
and replacing it with
,&/
Clear as mud, right? Well, the to pattern is easy: &
matches whatever we just substituted (from), so this just
sticks a comma in front of ... something.
The from pattern, \B[0-9]\{3\}\>
, is a bit more
challenging. Let's break down the various groups:
-
\B
-
Matches anything that is not a word boundary.
-
[0-9]
-
Matches any digit.
-
\{3\}
-
Matches three repetitions of whatever precedes it (in this case, a digit).
-
\>
-
Matches a word boundary at the end of a word. This was the hardest part
to figure out, because no sed documentation anywhere bothers to mention
this pattern. But Dana knew it as a vim pattern, and it turns out it
does the same thing in sed even though the docs don't say so.
Okay, put them together, and the whole pattern matches any three digits
that are not preceded by a word boundary but which are
at the end of a word (i.e. they're followed by a word boundary).
Cool! So in our test number, 20130607215015, this matches the last
three digits, 015. It doesn't match any of the other digits because
they're not followed by a word end boundary.
So the substitute will insert a comma before the last three numbers.
Let's test that:
$ echo 20130607215015 | sed 's/\B[0-9]\{3\}\>/,&/'
20130607215,015
Sure enough!
How the loop works
So the substitution pattern just adds the last comma.
Once the comma is inserted, the ta
tells sed to go back
to the beginning (label :a) and do it again.
The second time, the comma that was just inserted is now a word
boundary, so the pattern matches the three digits before the comma,
215, and inserts another comma before them. Let's make sure:
$ echo 20130607215,015 | sed 's/\B[0-9]\{3\}\>/,&/'
20130607,215,015
So that's how the pattern manages to match triplets from right to left.
Dana later commented that this wasn't really the best solution -- what
if the string of digits is attached to other characters and isn't
really a number? I'll cover that in a separate article in a few days.
Update: Here's the smarter pattern,
Sed:
insert commas into numbers, but in a smarter way.
Tags: regexp, sed, cmdline
[
14:14 Jul 07, 2013
More linux/cmdline |
permalink to this entry |
]
Sun, 18 Dec 2011
A friend had a fun problem: she had some XML files she needed to
import into GNUcash, but the program that produced them left names
in all-caps and she wanted them more readable. So she'd have a file
like this:
<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20111125000000[-5:EST]
<TRNAMT>-22.71
<FITID>****
<NAME>SOME COMPANY
<MEMO>SOME COMPANY ANY TOWN CA 11-25-11 330346
</STMTTRN>
and wanted to change the NAME and MEMO lines to read
Some Company and Any Town. However, the tags, like <NAME>,
all had to remain upper case, and presumably so did strings like DEBIT.
How do you change just the NAME and MEMO lines from upper case to title case?
The obvious candidate to do string substitutes is sed.
But there are several components to the problem.
Addresses
First, how do you ensure the replacement only happens on lines with
NAME and MEMO?
sed lets you specify address ranges for just that purpose.
If you say sed 's/xxx/yyy/'
sed will change all xxx's
to yyy; but if you say sed '/NAME/s/xxx/yyy/'
then sed will only do that substitution on lines containing NAME.
But we need this to happen on lines that contain either NAME or MEMO.
How do you do that? With \|
, like this:
sed '/\(NAME\|MEMO\)/s/xxx/yyy/'
Converting to title case
Next, how do you convert upper case to lower case?
There's a
sed
command for that: \L. Run
sed 's/.*/\L&/'
and type some upper and lower case
characters, and they'll all be converted to lower-case.
But here we want title case -- we want most of each word converted
to lowercase, but the first letter should stay uppercase.
That means we need to detect a word and figure out which is the
first letter.
In the strings we're considering, a word is a set of letters A through Z
with one of the following characteristics:
- It's preceded by a space
- It's preceded by a close-angle-bracket, >
So the pattern /[ >][A-Z]*/ will match anything we consider a word
that might need conversion.
But we need to separate the first letter and the rest of the word,
so we can treat them separately. sed's \( \) operators will let us do that.
The pattern \([ >][A-Z]\) finds the first letter of a word (including
the space or > preceding it), and saves that as its first matched
pattern, \1.
Then \([A-Z]*\) right after it will save the rest of the word as \2.
So, taking our \L case converter, we can convert to title case like this:
sed 's/\([ >][A-Z]\)\([A-Z]*\)/\1\L\2/g
Starting to look long and scary, right? But it's not so bad if you build
it up gradually from components. I added a g on the end to tell sed
this is a global replace: do the operation on every word it finds in
the line, otherwise it will only make the substitution once, on the
first word it sees, then quit.
Putting it together
So we know how to seek out specific lines, and how to convert to
title case. Put the two together, and you get the final command:
sed '/\(NAME\|MEMO\)/s/\([ >][A-Z]\)\([A-Z]*\)/\1\L\2/g'
I ran it on the test input, and it worked just fine.
For more information on sed, a good place to start is the
sed
regular expressions manual.
Tags: regexp, cmdline, sed
[
14:13 Dec 18, 2011
More linux/cmdline |
permalink to this entry |
]