Inserting commas into numbers with sed (Shallow Thoughts)

Akkana's Musings on Open Source Computing, Science, and Nature.

Sun, 07 Jul 2013

Inserting commas into numbers with sed

Carla Schroder's recent article, More Great Linux Awk, Sed, and Bash Tips and Tricks , had a nifty sed command I hadn't seen before to take a long number and insert commas appropriately:

sed -i ':a;s/\B[0-9]\{3\}\gt;/,&/;ta' numbers.txt
. Or, if you don't have a numbers.txt file, you can do something like
echo 20130607215015 | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta'
(I dropped the -i since that's for doing in-place edits of a file).

Nice! But why does it work? It would be easy enough to insert commas after every third number, but that doesn't work unless the number of digits is a multiple of three. In other words, you don't want 20130607215015 to become 201,306,072,150,15 (note how the last group only has two digits); it has to count in threes from the right if you want to end up with 20,130,607,215,015.

Carla's article didn't explain it, and neither did any of the other sites I found that mentioned this trick.

So, with some help from regexp wizard Dana Jansens (of OpenBox fame), I've broken it down into more easily understood bits.

Labels and loops

The first thing to understand is that this is actually several sed commands. I was familiar with sed's basic substitute command, s/from/to/. But what's the rest of it? The semicolons separate the commands, so the whole sed script is:

:a
s/\B[0-9]\{3\}\>/,&/
ta

What this does is set up a label called a. It tries to do the substitute command, and if the substitute succeeds (if something was changed), then ta tells it to loop back around to label a, the beginning of the script.

So let's look at that substitute command.

The substitute

Sed's s/from/to/ (like the equivalent command in vim and many other programs) looks for the first instance of the from pattern and replaces it with the to pattern. So we're searching for \B[0-9]\{3\}\> and replacing it with ,&/

Clear as mud, right? Well, the to pattern is easy: & matches whatever we just substituted (from), so this just sticks a comma in front of ... something.

The from pattern, \B[0-9]\{3\}\>, is a bit more challenging. Let's break down the various groups:

\B
Matches anything that is not a word boundary.
[0-9]
Matches any digit.
\{3\}
Matches three repetitions of whatever precedes it (in this case, a digit).
\>
Matches a word boundary at the end of a word. This was the hardest part to figure out, because no sed documentation anywhere bothers to mention this pattern. But Dana knew it as a vim pattern, and it turns out it does the same thing in sed even though the docs don't say so.

Okay, put them together, and the whole pattern matches any three digits that are not preceded by a word boundary but which are at the end of a word (i.e. they're followed by a word boundary).

Cool! So in our test number, 20130607215015, this matches the last three digits, 015. It doesn't match any of the other digits because they're not followed by a word end boundary.

So the substitute will insert a comma before the last three numbers. Let's test that:

$ echo 20130607215015 | sed 's/\B[0-9]\{3\}\>/,&/'
20130607215,015

Sure enough!

How the loop works

So the substitution pattern just adds the last comma. Once the comma is inserted, the ta tells sed to go back to the beginning (label :a) and do it again.

The second time, the comma that was just inserted is now a word boundary, so the pattern matches the three digits before the comma, 215, and inserts another comma before them. Let's make sure:

$ echo 20130607215,015 | sed 's/\B[0-9]\{3\}\>/,&/'
20130607,215,015

So that's how the pattern manages to match triplets from right to left.

Dana later commented that this wasn't really the best solution -- what if the string of digits is attached to other characters and isn't really a number? I'll cover that in a separate article in a few days. Update: Here's the smarter pattern, Sed: insert commas into numbers, but in a smarter way.

Tags: , ,
[ 13:14 Jul 07, 2013    More linux/cmdline | permalink to this entry | comments ]
(Commenting requires Javascript from ShallowSky.com and Disqus.com, and a cookie from Disqus.com.)
blog comments powered by Disqus

Syndicated on:
LinuxChix Live
Ubuntu Women
Women in Free Software
Graphics Planet
DevChix
Ubuntu California
Planet Openbox
Devchix
Planet LCA2009

Friends' Blogs:
Morris "Mojo" Jones
Jane Houston Jones
Dan Heller
Long Live the Village Green
Ups & Downs
DailyBBG

Other Blogs of Interest:
DevChix
Scott Adams
Dave Barry
BoingBoing

Powered by PyBlosxom.