Shallow Thoughts

Akkana's Musings on Open Source, Science, and Nature.

Thu, 20 Dec 2007

Smart Wrapping with Greedy and Non-Greedy Regular Expressions

I had a chance to spend a day at the AGU conference last week. The American Geophysical Union is a fabulous conference -- something like 14,000 different talks over the course of the week, on anything related to earth or planetary sciences -- geology, solar system astronomy, atmospheric science, geophysics, geochemistry, you name it.

I have no idea how regular attendees manage the information overload of deciding which talks to attend. I wasn't sure how I would, either, but I started by going through the schedule for the day I'd be there, picking out a (way too long) list of potentially interesting talks, and saving them as lines in a file.

Now I had a file full of lines like:

1020      U22A    MS 303  Terrestrial Impact Cratering: New Insights Into the Cratering Process From Geophysics and Geochemistry II
Fine, except that I couldn't print out something like that -- printers stop at 80 columns. I could pass it through a program like "fold" to wrap the long lines, but then it would be hard to scan through quickly to find the talk titles and room numbers. What I really wanted was to wrap it so that the above line turned into something like:
1020      U22A    MS 303  Terrestrial Impact Cratering: New Insights
                          Into the Cratering Process From Geophysics
                          and Geochemistry II
But how to do that? I stared at it for a while, trying to figure out whether there was a clever vim substitute that could handle it. I asked on a couple of IRC channels, just in case there was some amazing Linux smart-wrap utility I'd never heard of. I was on the verge of concluding that the answer was no, and that I'd have to write a python script to do the wrapping I wanted, when Mikael emitted a burst of line noise:
%s/\(.\{72\}\)\(.*\)/\1^M^I^I^I\2/

Only it wasn't line noise. Seems Mikael just happened to have been reading about some of the finer points of vim regular expressions earlier that day, and he knew exactly the trick I needed -- that .\{72\}, which matches lines that are at least 72 characters long. And amazingly, that expression did something very close to what I wanted.

Or at least the first step of it. It inserts the first line break, turning my line into

1020      U22A    MS 303  Terrestrial Impact Cratering: New Insights
                          Into the Cratering Process From Geophysics and Geochemistry II
but I still needed to wrap the second and subsequent lines.

But that was an easier problem -- just do essentially the same thing again, but limit it to only lines starting with a tab. After some tweaking, I arrived at exactly what I wanted:

%s/^\(.\{,65\}\) \(.*\)/\1^M^I^I^I\2/

%g/^^I^I^I.\{58\}/s/^\(.\{,55\}\) \(.*\)/\1^M^I^I^I\2/
I had to run the second line two or three times to wrap the very long lines.

Devdas helpfully translated the second one into English: "You have 3 tabs, followed by 58 characters, out of which you match the first 55 and put that bit in $1, and the capture the remaining in $2, and rewrite to $1 newline tab tab tab $2."

Here's a more detailed breakdown:

Line one:
% Do this over the whole file
s/ Begin global substitute
^ Start at the beginning of the line
\( Remember the result of the next match
.\{,65\}_ Look for up to 65 characters with a space at the end
\) \( End of remembered pattern #1, skip a space, and start remembered pattern #2
.*\) Pattern #2 includes everything to the end of the line
/ End of matched pattern; begin replacement pattern
\1^M Insert saved pattern #1 (the first 65 lines ending with a space) followed by a newline
^I^I^I\2 On the second line, insert three tabs then saved pattern #2
/ End replacement pattern

Line two:
%g/ Over the whole file, only operate on lines with this pattern
^^I^I^I Lines starting with three tabs
.\{58\}/ After the tabs, only match lines that still have at least 58 characters (this guards against wrapping already wrapped lines when it's run repeatedly)
s/ Begin global substitute
^ Start at the beginning of the line
\( Remember the result of the next match
.\{,55\} Up to 55 characters
\) \( End of remembered pattern #1, skip a space, and start remembered pattern #2
.*\) Pattern #2 includes everything to the end of the line
/ End of matched pattern; begin replacement pattern
\1^M The first pattern (up to 55 chars) is one line
^I^I^I\2 Three tabs then the second pattern
/ End replacement pattern

Greedy and non-greedy brace matches

The real key is those curly-brace expressions, \{,65\} and \{58\} -- that's how you control how many characters vim will match and whether or not the match is "greedy". Here's how they work (thanks to Mikael for explaining).

The basic expression is {M,N} -- it means between M and N matches of whatever precedes it. (Vim requires that the first brace be escaped -- \{}. Escaping the second brace is optional.) So .{M,N} can match anything between M and N characters but "prefer" N, i.e. try to match as many as possible up to N. To make it "non-greedy" (match as few as possible, "preferring" M), use .{-M,N}

You can leave out M, N, or both; M defaults to 0 while N defaults to infinity. So {} is short for {0,∞} and is equivalent to *, while {-} means {-0,∞}, like a non-greedy version of *.

Given the string: one, two, three, four, five
,.\{}, matches , two, three, four,
,.\{-}, matches , two,
,.\{5,}, matches , two, three, four,
,.\{-5,}, matches , two, three,
,.\{,2}, matches nothing
,.\{,7}, matches , two,
,.\{5,7}, matches , three,

Of course, this syntax is purely for vim; regular expressions are unfortunately different in sed, perl and every other program. Here's a fun table of regexp terms in various programs.

Tags: , ,
[ 11:44 Dec 20, 2007    More linux/editors | permalink to this entry ]