Sed: insert commas into numbers, but in a smarter way (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Tue, 09 Jul 2013

Sed: insert commas into numbers, but in a smarter way

A few days ago I wrote about a nifty sed script to insert commas into numbers that I dissected with the help of Dana Jansens.

Once we'd figured it out, though, Dana thought this wasn't really the best solution. For instance, what if you have a file that has some numbers in it, but also has some digits mixed up with letters? Do you really want to insert commas into every string of digits? What if you have some license plates, like abc1234? Maybe it would be better to restrict the change to digits that stand by themselves and are obviously meant to be numbers. How much harder would that be?

More regexp fun! We kicked it around a bit, and came up with a solution:

$ echo abc20130607215015 | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta'
abc20,130,607,215,015
$ echo abc20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'
abc20130607215015
$ echo 20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'   
20,130,607,215,015

Breaking that down: \b is any word boundary -- you could also use \< to indicate that it's the start of a word, much like \> was the end of a word.

\([0-9]\+\) is any string of one or more digits, taken as a group. The \( \) part marks it as a group so we'll be able to use it later.

\([0-9]\{3\}\) is a string of exactly three digits: again, we're using \( \) to mark it as our second numbered group.

\b is another word boundary (we could use \>), to indicate that the group of three digits must come at the end of a word, with only whitespace or punctuation following it.

/\1,\2/: once we've matched the pattern -- a word break, one or more digits, three digits and another word break -- we'll replace it with this. \1 matches the first group we found -- that was the string of one or more digits. \2 matches the second group, the final trio of digits. And there's a comma in between. We use the same :a; ;ta trick as in the first example to loop around until there are no more triplets to match.

The hardest part of this was figuring out what needed to be escaped with backslashes. The one that really surprised me was the \+. Although * works in sed the same way it does in other programs, matching zero or more repetitions of the preceding pattern, sed uses \+ rather than + for one or more repetitions. It took us some fiddling to find all the places we needed backslashes.

Tags: , ,
[ 21:16 Jul 09, 2013    More linux/cmdline | permalink to this entry | ]

Comments via Disqus:

blog comments powered by Disqus