Sed: insert commas into numbers, but in a smarter way (Shallow Thoughts)

Akkana's Musings on Open Source Computing, Science, and Nature.

Tue, 09 Jul 2013

Sed: insert commas into numbers, but in a smarter way

A few days ago I wrote about a nifty sed script to insert commas into numbers that I dissected with the help of Dana Jansens.

Once we'd figured it out, though, Dana thought this wasn't really the best solution. For instance, what if you have a file that has some numbers in it, but also has some digits mixed up with letters? Do you really want to insert commas into every string of digits? What if you have some license plates, like abc1234? Maybe it would be better to restrict the change to digits that stand by themselves and are obviously meant to be numbers. How much harder would that be?

More regexp fun! We kicked it around a bit, and came up with a solution:

$ echo abc20130607215015 | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta'
$ echo abc20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'
$ echo 20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'   

Breaking that down: \b is any word boundary -- you could also use \< to indicate that it's the start of a word, much like \> was the end of a word.

\([0-9]\+\) is any string of one or more digits, taken as a group. The \( \) part marks it as a group so we'll be able to use it later.

\([0-9]\{3\}\) is a string of exactly three digits: again, we're using \( \) to mark it as our second numbered group.

\b is another word boundary (we could use \>), to indicate that the group of three digits must come at the end of a word, with only whitespace or punctuation following it.

/\1,\2/: once we've matched the pattern -- a word break, one or more digits, three digits and another word break -- we'll replace it with this. \1 matches the first group we found -- that was the string of one or more digits. \2 matches the second group, the final trio of digits. And there's a comma in between. We use the same :a; ;ta trick as in the first example to loop around until there are no more triplets to match.

The hardest part of this was figuring out what needed to be escaped with backslashes. The one that really surprised me was the \+. Although * works in sed the same way it does in other programs, matching zero or more repetitions of the preceding pattern, sed uses \+ rather than + for one or more repetitions. It took us some fiddling to find all the places we needed backslashes.

Tags: , ,
[ 20:16 Jul 09, 2013    More linux/cmdline | permalink to this entry | comments ]
(Commenting requires Javascript from and, and a cookie from
blog comments powered by Disqus

Syndicated on:
LinuxChix Live
Ubuntu Women
Women in Free Software
Graphics Planet
Ubuntu California
Planet Openbox
Planet LCA2009

Friends' Blogs:
Morris "Mojo" Jones
Jane Houston Jones
Dan Heller
Long Live the Village Green
Ups & Downs

Other Blogs of Interest:
Scott Adams
Dave Barry

Powered by PyBlosxom.