Shallow Thoughts : : cmdline

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sat, 01 Oct 2016

Zsh magic: remove all raw photos that don't have a corresponding JPEG

Lately, when shooting photos with my DSLR, I've been shooting raw mode but with a JPEG copy as well. When I triage and label my photos (with pho and metapho), I use only the JPEG files, since they load faster and there's no need to index both. But that means that sometimes I delete a .jpg file while the huge .cr2 raw file is still on my disk.

I wanted some way of removing these orphaned raw files: in other words, for every .cr2 file that doesn't have a corresponding .jpg file, delete the .cr2.

That's an easy enough shell function to write: loop over *.cr2, change the .cr2 extension to .jpg, check whether that file exists, and if it doesn't, delete the .cr2.

But as I started to write the shell function, it occurred to me: this is just the sort of magic trick zsh tends to have built in.

So I hopped on over to #zsh and asked, and in just a few minutes, I had an answer:

rm *.cr2(e:'[[ ! -e ${REPLY%.cr2}.jpg ]]':)

Yikes! And it works! But how does it work? It's cheating to rely on people in IRC channels without trying to understand the answer so I can solve the next similar problem on my own.

Most of the answer is in the zshexpn man page, but it still took some reading and jumping around to put the pieces together.

First, we take all files matching the initial wildcard, *.cr2. We're going to apply to them the filename generation code expression in parentheses after the wildcard. (I think you need EXTENDED_GLOB set to use that sort of parenthetical expression.)

The variable $REPLY is set to the filename the wildcard expression matched; so it will be set to each .cr2 filename, e.g. img001.cr2.

The expression ${REPLY%.cr2} removes the .cr2 extension. Then we tack on a .jpg: ${REPLY%.cr2}.jpg. So now we have img001.jpg.

[[ ! -e ${REPLY%.cr2}.jpg ]] checks for the existence of that jpg filename, just like in a shell script.

So that explains the quoted shell expression. The final, and hardest part, is how to use that quoted expression. That's in section 14.8.7 Glob Qualifiers. (estring) executes string as shell code, and the filename will be included in the list if and only if the code returns a zero status.

The colons -- after the e and before the closing parenthesis -- are just separator characters. Whatever character immediately follows the e will be taken as the separator, and anything from there to the next instance of that separator (the second colon, in this case) is taken as the string to execute. Colons seem to be the character to use by convention, but you could use anything. This is also the part of the expression responsible for setting $REPLY to the filename being tested.

So why the quotes inside the colons? They're because some of the substitutions being done would be evaluated too early without them: "Note that expansions must be quoted in the string to prevent them from being expanded before globbing is done. string is then executed as shell code."

Whew! Complicated, but awfully handy. I know I'll have lots of other uses for that.

One additional note: section 14.8.5, Approximate Matching, in that manual page caught my eye. zsh can do fuzzy matches! I can't think offhand what I need that for ... but I'm sure an idea will come to me.

Tags: , , ,
[ 15:28 Oct 01, 2016    More linux/cmdline | permalink to this entry | comments ]

Fri, 04 Dec 2015

Distclean part 2: some useful zsh tricks

I wrote recently about a zsh shell function to run make distclean on a source tree even if something in autoconf is messed up. In order to save any arguments you've previously passed to configure or, my function parsed the arguments from a file called config.log.

But it might be a bit more reliable to use config.status -- I'm guessing this is the file that make uses when it finds it needs to re-run However, the syntax in that file is more complicated, and parsing it taught me some useful zsh tricks.

I can see the relevant line from config.status like this:

$ grep '^ac_cs_config' config.status
ac_cs_config="'--prefix=/usr/local/gimp-git' '--enable-foo' '--disable-bar'"

--enable-foo --disable-bar are options I added purely for testing. I wanted to make sure my shell function would work with multiple arguments.

Ultimately, I want my shell function to call --prefix=/usr/local/gimp-git --enable-foo --disable-bar The goal is to end up with $args being a zsh array containing those three arguments. So I'll need to edit out those quotes and split the line into an array.

Sed tricks

The first thing to do is to get rid of that initial ac_cs_config= in the line from config.status. That's easy with sed:

$ grep '^ac_cs_config' config.status | sed -e 's/ac_cs_config=//'
"'--prefix=/usr/local/gimp-git' '--enable-foo' '--disable-bar'"

But since we're using sed anyway, there's no need to use grep to get the line: we can do it all with sed. First try:

sed -n '/^ac_cs_config/s/ac_cs_config=//p' config.status

Search for the line that starts with ac_cs_config (^ matches the beginning of a line); then replace ac_cs_config= with nothing, and p print the resulting line. -n tells sed not to print anything except when told to with a p.

But it turns out that if you give a sed substitution a blank pattern, it uses the last pattern it was given. So a more compact version, using the search pattern ^ac_cs_config, is:

sed -n '/^ac_cs_config=/s///p' config.status

But there's also another way of doing it:

sed '/^ac_cs_config=/!d;s///' config.status

! after a search pattern matches every line that doesn't match the pattern. d deletes those lines. Then for lines that weren't deleted (the one line that does match), do the substitution. Since there's no -n, sed will print all lines that weren't deleted.

I find that version more difficult to read. But I'm including it because it's useful to know how to chain several commands in sed, and how to use ! to search for lines that don't match a pattern.

You can also use sed to eliminate the double quotes:

sed '/^ac_cs_config=/!d;s///;s/"//g' config.status
'--prefix=/usr/local/gimp-git' '--enable-foo' '--disable-bar'
But it turns out that zsh has a better way of doing that.

Zsh parameter substitution

I'm still relatively new to zsh, but I got some great advice on #zsh. The first suggestion:

sed -n '/^ac_cs_config=/s///p' config.status | IFS= read -r; args=( ${(Q)${(z)${(Q)REPLY}}} ); print -rl - $args

I'll be using final print -rl - $args for all these examples: it prints an array variable with one member per line. For the actual distclean function, of course, I'll be passing the variable to, not printing it out.

First, let's look at the heart of that expression: the args=( ${(Q)${(z)${(Q)REPLY}}}.

The heart of this is the expression ${(Q)${(z)${(Q)x}}} The zsh parameter substitution syntax is a bit arcane, but each of the parenthesized letters does some operation on the variable that follows.

The first (Q) strips off a level of quoting. So:

$ x='"Hello world"'; print $x; print ${(Q)x}
"Hello world"
Hello world

(z) splits an expression and stores it in an array. But to see that, we have to use print -l, so array members will be printed on separate lines.

$ x="a b c"; print -l $x; print "....."; print -l ${(z)x}
a b c

Zsh is smart about quotes, so if you have quoted expressions it will group them correctly when assigning array members:

x="'a a' 'b b' 'c c'"; print -l $x; print "....."; print -l ${(z)x} 'a a' 'b b' 'c c' ..... 'a a' 'b b' 'c c'

So let's break down the larger expression: this is best read from right to left, inner expressions to outer.

${(Q) ${(z) ${(Q) x }}}
   |     |     |   \
   |     |     |    The original expression, 
   |     |     |   "'--prefix=/usr/local/gimp-git' '--enable-foo' '--disable-bar'"
   |     |     \
   |     |      Strip off the double quotes:
   |     |      '--prefix=/usr/local/gimp-git' '--enable-foo' '--disable-bar'
   |     \
   |      Split into an array of three items
    Strip the single quotes from each array member,
    ( --prefix=/usr/local/gimp-git --enable-foo --disable-bar )

For more on zsh parameter substitutions, see the Zsh Guide, Chapter 5: Substitutions.

Passing the sed results to the parameter substitution

There's still a little left to wonder about in our expression, sed -n '/^ac_cs_config=/s///p' config.status | IFS= read -r; args=( ${(Q)${(z)${(Q)REPLY}}} ); print -rl - $args

The IFS= read -r seems to be a common idiom in zsh scripting. It takes standard input and assigns it to the variable $REPLY. IFS is the input field separator: you can split variables into words by spaces, newlines, semicolons or any other character you want. IFS= sets it to nothing. But because the input expression -- "'--prefix=/usr/local/gimp-git' '--enable-foo' '--disable-bar'" -- has quotes around it, IFS is ignored anyway.

So you can do the same thing with this simpler expression, to assign the quoted expression to the variable $x. I'll declare it a local variable: that makes no difference when testing it in the shell, but if I call it in a function, I won't have variables like $x and $args cluttering up my shell afterward.

local x=$(sed -n '/^ac_cs_config=/s///p' config.status); local args=( ${(Q)${(z)${(Q)x}}} ); print -rl - $args

That works in the version of zsh I'm running here, 5.1.1. But I've been warned that it's safer to quote the result of $(). Without quotes, if you ever run the function in an older zsh, $x might end up being set only to the first word of the expression. Second, it's a good idea to put "local" in front of the variable; that way, $x won't end up being set once you've returned from the function. So now we have:

local x="$(sed -n '/^ac_cs_config=/s///p' config.status)"; local args=( ${(Q)${(z)${(Q)x}}} ); print -rl - $args

You don't even need to use a local variable. For added brevity (making the function even more difficult to read! -- but we're way past the point of easy readability), you could say:

args=( ${(Q)${(z)${(Q)"$(sed -n '/^ac_cs_config=/s///p' config.status)"}}} ); print -rl - $args
or even
print -rl - ${(Q)${(z)${(Q)"$(sed -n '/^ac_cs_config=/s///p' config.status)"}}}
... but that final version, since it doesn't assign to a variable at all, isn't useful for the function I'm writing.

Tags: , , , ,
[ 13:25 Dec 04, 2015    More linux/cmdline | permalink to this entry | comments ]

Fri, 15 May 2015

Of file modes, umasks and fmasks, and mounting FAT devices

I have a bunch of devices that use VFAT filesystems. MP3 players, camera SD cards, SD cards in my Android tablet. I mount them through /etc/fstab, and the files always look executable, so when I ls -f them, they all have asterisks after their names. I don't generally execute files on these devices; I'd prefer the files to have a mode that doesn't make them look executable.

I'd like the files to be mode 644 (or 0644 in most programming languages, since it's an octal, or base 8, number). 644 in binary is 110 100 100, or as the Unix ls command puts it, rw-r--r--.

There's a directive, fmask, that you can put in fstab entries to control the mode of files when the device is mounted. (Here's Wikipedia's long umask article.) But how do you get from the mode you want the files to be, 644, to the mask?

The mask (which corresponds to the umask command) represent the bits you don't want to have set. So, for instance, if you don't want the world-execute bit (1) set, you'd put 1 in the mask. If you don't want the world-write bit (2) set, as you likely don't, put 2 in the mask. So that's already a clue that I'm going to want the rightmost byte to be 3: I don't want files mounted from my MP3 player to be either world writable or executable.

But I also don't want to have to puzzle out the details of all nine bits every time I set an fmask. Isn't there some way I can take the mode I want the files to be -- 644 -- and turn them into the mask I'd need to put in /etc/fstab or set as a umask?

Fortunately, there is. It seemed like it ought to be straightforward, but it took a little fiddling to get it into a one-line command I can type. I made it a shell function in my .zshrc:

# What's the complement of a number, e.g. the fmask in fstab to get
# a given file mode for vfat files? Sample usage: invertmask 755
invertmask() {
    python -c "print '0%o' % (~(0777 & 0$1) & 0777)"

This takes whatever argument I give to it -- $1 -- and takes only the three rightmost bytes from it, (0777 & 0$1). It takes the bitwise NOT of that, ~. But the result of that is a negative number, and we only want the three rightmost bytes of the result, (result) & 0777, expressed as an octal number -- which we can do in python by printing it as %o. Whew!

Here's a shorter, cleaner looking alias that does the same thing, though it's not as clear about what it's doing:

invertmask1() {
    python -c "print '0%o' % (0777 - 0$1)"

So now, for my MP3 player I can put this in /etc/fstab:

UUID=0000-009E /mp3 vfat user,noauto,exec,fmask=133,shortname=lower 0 0

Tags: ,
[ 10:27 May 15, 2015    More linux/cmdline | permalink to this entry | comments ]

Tue, 02 Sep 2014

Using strace to find configuration file locations

I was using strace to figure out how to set up a program, lftp, and a friend commented that he didn't know how to use it and would like to learn. I don't use strace often, but when I do, it's indispensible -- and it's easy to use. So here's a little tutorial.

My problem, in this case, was that I needed to find out what configuration file I needed to modify in order to set up an alias in lftp. The lftp man page tells you how to define an alias, but doesn't tell you how to save it for future sessions; apparently you have to edit the configuration file yourself.

But where? The man page suggested a couple of possible config file locations -- ~/.lftprc and ~/.config/lftp/rc -- but neither of those existed. I wanted to use the one that already existed. I had already set up bookmarks in lftp and it remembered them, so it must have a config file already, somewhere. I wanted to find that file and use it.

So the question was, what files does lftp read when it starts up? strace lets you snoop on a program and see what it's doing.

strace shows you all system calls being used by a program. What's a system call? Well, it's anything in section 2 of the Unix manual. You can get a complete list by typing: man 2 syscalls (you may have to install developer man pages first -- on Debian that's the manpages-dev package). But the important thing is that most file access calls -- open, read, chmod, rename, unlink (that's how you remove a file), and so on -- are system calls.

You can run a program under strace directly:

$ strace lftp sitename
Interrupt it with Ctrl-C when you've seen what you need to see.

Pruning the output

And of course, you'll see tons of crap you're not interested in, like rt_sigaction(SIGTTOU) and fcntl64(0, F_GETFL). So let's get rid of that first. The easiest way is to use grep. Let's say I want to know every file that lftp opens. I can do it like this:

$ strace lftp sitename |& grep open

I have to use |& instead of just | because strace prints its output on stderr instead of stdout.

That's pretty useful, but it's still too much. I really don't care to know about strace opening a bazillion files in /usr/share/locale/en_US/LC_MESSAGES, or libraries like /usr/lib/i386-linux-gnu/

In this case, I'm looking for config files, so I really only want to know which files it opens in my home directory. Like this:

$ strace lftp sitename |& grep 'open.*/home/akkana'

In other words, show me just the lines that have either the word "open" or "read" followed later by the string "/home/akkana".

Digression: grep pipelines

Now, you might think that you could use a simpler pipeline with two greps:

$ strace lftp sitename |& grep open | grep /home/akkana

But that doesn't work -- nothing prints out. Why? Because grep, under certain circumstances that aren't clear to me, buffers its output, so in some cases when you pipe grep | grep, the second grep will wait until it has collected quite a lot of output before it prints anything. (This comes up a lot with tail -f as well.) You can avoid that with

$ strace lftp sitename |& grep --line-buffered open | grep /home/akkana
but that's too much to type, if you ask me.

Back to that strace | grep

Okay, whichever way you grep for open and your home directory, it gives:

open("/home/akkana/.local/share/lftp/bookmarks", O_RDONLY|O_LARGEFILE) = 5
open("/home/akkana/.netrc", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/home/akkana/.local/share/lftp/rl_history", O_RDONLY|O_LARGEFILE) = 5
open("/home/akkana/.inputrc", O_RDONLY|O_LARGEFILE) = 5
Now we're getting somewhere! The file where it's getting its bookmarks is ~/.local/share/lftp/bookmarks -- and I probably can't use that to set my alias.

But wait, why doesn't it show lftp trying to open those other config files?

Using script to save the output

At this point, you might be sick of running those grep pipelines over and over. Most of the time, when I run strace, instead of piping it through grep I run it under script to save the whole output.

script is one of those poorly named, ungoogleable commands, but it's incredibly useful. It runs a subshell and saves everything that appears in that subshell, both what you type and all the output, in a file.

Start script, then run lftp inside it:

$ script /tmp/lftp.strace
Script started on Tue 26 Aug 2014 12:58:30 PM MDT
$ strace lftp sitename

After the flood of output stops, I type Ctrl-D or Ctrl-C to exit lftp, then another Ctrl-D to exit the subshell script is using. Now all the strace output was in /tmp/lftp.strace and I can grep in it, view it in an editor or anything I want.

So, what files is it looking for in my home directory and why don't they show up as open attemps?

$ grep /home/akkana /tmp/lftp.strace

Ah, there it is! A bunch of lines like this:

access("/home/akkana/.lftprc", R_OK)    = -1 ENOENT (No such file or directory)
stat64("/home/akkana/.lftp", 0xbff821a0) = -1 ENOENT (No such file or directory)
mkdir("/home/akkana/.config", 0755)     = -1 EEXIST (File exists)
mkdir("/home/akkana/.config/lftp", 0755) = -1 EEXIST (File exists)
access("/home/akkana/.config/lftp/rc", R_OK) = 0

So I should have looked for access and stat as well as open. Now I have the list of files it's looking for. And, curiously, it creates ~/.config/lftp if it doesn't exist already, even though it's not going to write anything there.

So I created ~/.config/lftp/rc and put my alias there. Worked fine. And I was able to edit my bookmark in ~/.local/share/lftp/bookmarks later when I had a need for that. All thanks to strace.

Tags: , ,
[ 13:06 Sep 02, 2014    More linux/cmdline | permalink to this entry | comments ]

Sat, 28 Dec 2013

Finding filenames in a disorganized directory

I've been scanning a bunch of records with Audacity (using as a guide Carla Schroder's excellent Book of Audacity and a Behringer UCA222 USB audio interface -- audacity doesn't seem able to record properly from the built-in sound card on any laptop I own, while it works fine with the Behringer.

Audacity's user interface isn't great for assembly-line recording of lots of tracks one after the other, especially on a laptop with a trackpad that doesn't work very well, so I wasn't always as organized with directory names as I could have been, and I ended up with a mess. I was periodically backing up the recordings to my desktop, but as I shifted from everything-in-one-directory to an organized system, the two directories got out of sync.

To get them back in sync, I needed a way to answer this question: is every file inside directory A (maybe in some subdirectory of it) also somewhere under subdirectory B? In other words, can I safely delete all of A knowing that anything in it is safely stored in B, even though the directory structures are completely different?

I was hoping for some clever find | xargs way to do it, but came up blank. So eventually I used a little zsh loop: one find to get the list of files to test, then for each of those, another find inside the target directory, then test the exit code of find to see if it found the file. (I'm assuming that if the songname.aup file is there, the songname_data directory is too.)

for fil in $(find AAA/ -name '*.aup'); do
  fil=$(basename $fil)
  find BBB -name $fil >/dev/null
  if [[ $? != 0 ]]; then
    echo $fil is not in BBB

Worked fine. But is there an easier way?

Tags: , , ,
[ 10:36 Dec 28, 2013    More linux/cmdline | permalink to this entry | comments ]

Sat, 24 Aug 2013

A nifty shell redirection trick: process substitution

I love shell pipelines, and flatter myself that I'm pretty good at them. But a discussion last week on the Linuxchix Techtalk mailing list on finding added lines in a file turned up a terrific bash/zsh shell redirection trick I'd never seen before:

join -v 2 <(sort A.txt) <(sort B.txt)

I've used backquotes, and their cognate $(), plenty. For instance, you can do things like PS1=$(hostname): or PS1=`hostname`: to set your prompt to the current hostname: the shell runs the hostname command, takes its output, and substitutes that output in place of the backquoted or parenthesized expression.

But I'd never seen that <(...) trick before, and immediately saw how useful it was. Backquotes or $() let you replace arguments to a command with a program's output -- they're great for generating short strings for programs that take all their arguments on the command line. But they're no good for programs that need to read a file, or several files. <(...) lets you take the output of a command and pass it to a program as though it was the contents of a file. And if you can do it more than once in the same command -- as in Little Girl's example -- that could be tremendously useful.

Playing with it to see if it really did what it looked like it did, and what other useful things I could do with it, I tried this (and it worked just fine):

$ diff <(echo hello; echo there) <(echo hello; echo world)
< there
> world
It acts as though I had two files, which each have "hello" as their first line; but one has "there" as the second line, while the other has "world". And diff shows the difference. I don't think there's any way of doing anything like that with backquotes; you'd need to use temp files.

Of course, I wanted to read more about it -- how have I gone all these years without knowing about this? -- and it looks like I'm not the only one who didn't know about it. In fact, none of the pages I found on shell pipeline tricks even mentioned it.

It turns out it's called "process substitution" and I found it documented in Chapter 23 of the Advanced Bash-Scripting Guide.

I tweeted it, and a friend who is a zsh master gave me some similar cool tricks. For instance, in zsh echo hi > >(cat) > >(cat -n) lets you pipe the output of a command to more than one other command.

That's zsh, but in bash (or zsh too, of course), you can use >() and tee to do the same thing: echo hi | tee >(cat) | cat -n

If you want a temp file to be created automatically, one you can both read and write, you can use =(foo) (zsh only?)

Great stuff! Some other pages that discuss some of these tricks:

Tags: , , ,
[ 19:23 Aug 24, 2013    More linux/cmdline | permalink to this entry | comments ]

Wed, 24 Jul 2013

Yet more on that comma-inserting regexp, plus a pattern to filter unprintable characters

One more brief followup on that comma inserting sed pattern and its followup:

$ echo 20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'

In the second article, I'd mentioned that the hardest part of the exercise was figuring out where we needed backslashes. Devdas (f3ew) asked on Twitter whether I would still need all the backslash escapes even if I put the pattern in a file -- in other worse, are the backslashes merely to get the shell to pass special characters unchanged?

A good question, and I suspected the need for some of the backslashes would disappear. So I tried this:

$ echo ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' >/tmp/commas   
$ echo 20130607215015 | sed -f /tmp/commas

And it didn't work. No commas were inserted.

The problem, it turns out, is that my shell, zsh, changed both instances of \b to an ASCII backspace, ^H. Editing the file fixes that, and so does

$ echo -E ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' >/tmp/commas   

But that only applies to echo: zsh doesn't do the \b -> ^H substitution in the original command, where you pass the string directly as a sed argument.

Okay, with that straightened out, what about Devdas' question?

Surprisingly, it turns out that all the backslashes are still needed. None of them go away when you echo > file, so they weren't there just to get special characters past the shell; and if you edit the file and try removing some of the backslashes, you'll see that the pattern no longer works. I had thought at least some of them, like the ones before the \{ \}, were extraneous, but even those are still needed.

Filtering unprintable characters

As long as I'm writing about regular expressions, I learned a nice little tidbit last week. I'm getting an increasing flood of Asian-language spams which my mail ISP doesn't filter out (they use spamassassin, which is pretty useless for this sort of filtering). I wanted a simple pattern I could pass to egrep (via procmail) that would filter out anything with a run of more than 4 unprintable characters in a row. [^[:print:]]{4,} should do it, but it wasn't working.

The problem, it turns out, is the definition of what's printable. Apparently when the default system character set is UTF-8, just about everything is considered printable! So the trick is that you need to set LC_ALL to something more restrictive, like C (which basically means ASCII) to before :print: becomes useful for language-based filtering. (Thanks to Mikachu for spotting the problem).

So in a terminal, you can do something like

LC_ALL=C egrep -v '[^[:print:]]' filename

In procmail it was a little harder; I couldn't figure out any way to change LC_ALL from a procmail recipe; the only solution I came up with was to add this to ~/.procmailrc:

export LC_ALL=C

It does work, though, and has cut the spam load by quite a bit.

Tags: , , , ,
[ 19:35 Jul 24, 2013    More linux/cmdline | permalink to this entry | comments ]

Tue, 09 Jul 2013

Sed: insert commas into numbers, but in a smarter way

A few days ago I wrote about a nifty sed script to insert commas into numbers that I dissected with the help of Dana Jansens.

Once we'd figured it out, though, Dana thought this wasn't really the best solution. For instance, what if you have a file that has some numbers in it, but also has some digits mixed up with letters? Do you really want to insert commas into every string of digits? What if you have some license plates, like abc1234? Maybe it would be better to restrict the change to digits that stand by themselves and are obviously meant to be numbers. How much harder would that be?

More regexp fun! We kicked it around a bit, and came up with a solution:

$ echo abc20130607215015 | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta'
$ echo abc20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'
$ echo 20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'   

Breaking that down: \b is any word boundary -- you could also use \< to indicate that it's the start of a word, much like \> was the end of a word.

\([0-9]\+\) is any string of one or more digits, taken as a group. The \( \) part marks it as a group so we'll be able to use it later.

\([0-9]\{3\}\) is a string of exactly three digits: again, we're using \( \) to mark it as our second numbered group.

\b is another word boundary (we could use \>), to indicate that the group of three digits must come at the end of a word, with only whitespace or punctuation following it.

/\1,\2/: once we've matched the pattern -- a word break, one or more digits, three digits and another word break -- we'll replace it with this. \1 matches the first group we found -- that was the string of one or more digits. \2 matches the second group, the final trio of digits. And there's a comma in between. We use the same :a; ;ta trick as in the first example to loop around until there are no more triplets to match.

The hardest part of this was figuring out what needed to be escaped with backslashes. The one that really surprised me was the \+. Although * works in sed the same way it does in other programs, matching zero or more repetitions of the preceding pattern, sed uses \+ rather than + for one or more repetitions. It took us some fiddling to find all the places we needed backslashes.

Tags: , ,
[ 21:16 Jul 09, 2013    More linux/cmdline | permalink to this entry | comments ]