Inserting commas into numbers with sed
Carla Schroder's recent article, More Great Linux Awk, Sed, and Bash Tips and Tricks , had a nifty sed command I hadn't seen before to take a long number and insert commas appropriately:
sed -i ':a;s/\B[0-9]\{3\}\gt;/,&/;ta' numbers.txt. Or, if you don't have a numbers.txt file, you can do something like
echo 20130607215015 | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta'(I dropped the -i since that's for doing in-place edits of a file).
Nice! But why does it work? It would be easy enough to insert commas after every third number, but that doesn't work unless the number of digits is a multiple of three. In other words, you don't want 20130607215015 to become 201,306,072,150,15 (note how the last group only has two digits); it has to count in threes from the right if you want to end up with 20,130,607,215,015.
Carla's article didn't explain it, and neither did any of the other sites I found that mentioned this trick.
So, with some help from regexp wizard Dana Jansens (of OpenBox fame), I've broken it down into more easily understood bits.
Labels and loops
The first thing to understand is that this is actually several sed commands. I was familiar with sed's basic substitute command, s/from/to/. But what's the rest of it? The semicolons separate the commands, so the whole sed script is:
:a s/\B[0-9]\{3\}\>/,&/ ta
What this does is set up a label called a. It tries to do the
substitute command, and if the substitute succeeds (if something
was changed), then ta
tells it to loop back around to
label a, the beginning of the script.
So let's look at that substitute command.
The substitute
Sed's s/from/to/ (like the equivalent command in vim and many
other programs) looks for the first instance of the from pattern
and replaces it with the to pattern. So we're searching for
\B[0-9]\{3\}\>
and replacing it with
,&/
Clear as mud, right? Well, the to pattern is easy: & matches whatever we just substituted (from), so this just sticks a comma in front of ... something.
The from pattern, \B[0-9]\{3\}\>
, is a bit more
challenging. Let's break down the various groups:
- \B
- Matches anything that is not a word boundary.
- [0-9]
- Matches any digit.
- \{3\}
- Matches three repetitions of whatever precedes it (in this case, a digit).
- \>
- Matches a word boundary at the end of a word. This was the hardest part to figure out, because no sed documentation anywhere bothers to mention this pattern. But Dana knew it as a vim pattern, and it turns out it does the same thing in sed even though the docs don't say so.
Okay, put them together, and the whole pattern matches any three digits that are not preceded by a word boundary but which are at the end of a word (i.e. they're followed by a word boundary).
Cool! So in our test number, 20130607215015, this matches the last three digits, 015. It doesn't match any of the other digits because they're not followed by a word end boundary.
So the substitute will insert a comma before the last three numbers. Let's test that:
$ echo 20130607215015 | sed 's/\B[0-9]\{3\}\>/,&/' 20130607215,015
Sure enough!
How the loop works
So the substitution pattern just adds the last comma.
Once the comma is inserted, the ta
tells sed to go back
to the beginning (label :a) and do it again.
The second time, the comma that was just inserted is now a word boundary, so the pattern matches the three digits before the comma, 215, and inserts another comma before them. Let's make sure:
$ echo 20130607215,015 | sed 's/\B[0-9]\{3\}\>/,&/' 20130607,215,015
So that's how the pattern manages to match triplets from right to left.
Dana later commented that this wasn't really the best solution -- what if the string of digits is attached to other characters and isn't really a number? I'll cover that in a separate article in a few days. Update: Here's the smarter pattern, Sed: insert commas into numbers, but in a smarter way.
[ 14:14 Jul 07, 2013 More linux/cmdline | permalink to this entry | ]