Shallow Thoughts

Akkana's Musings on Open Source, Science, and Nature.

Sun, 12 Oct 2008

More fun with regexps: Adding "[no output]" in shell logs

Someone on LinuxChix' techtalk list asked whether she could get tcsh to print "[no output]" after any command that doesn't produce output, so that when she makes logs to help her co-workers, they will seem clearer.

I don't know of a way to do that in any shell (the shell would have to capture the output of every command; emacs' shell-mode does that but I don't think any real shells do) but it seemed like it ought to be straightforward enough to do as a regular expression substitute in vi. You're looking for lines where a line beginning with a prompt is followed immediately by another line beginning with a prompt; the goal is to insert a new line consisting only of "[no output]" between the two lines.

It turned out to be pretty easy in vim. Here it is:

:%s/\(^% .*$\n\)\(% \)/\1[no results]\r\2/

Explanation:

:
starts a command
%
do the following command on every line of this file
s/
start a global substitute command
\(
start a "capture group" -- you'll see what it does soon
^
match only patterns starting at the beginning of a line
%
look for a % followed by a space (your prompt)
.*
after the prompt, match any other characters until...
$
the end of the line, after which...
\n
there should be a newline character
\)
end the capture group after the newline character
\(
start a second capture group
%
look for another prompt. In other words, this whole
expression will only match when a line starting with a prompt
is followed immediately by another line starting with a prompt.
\)
end the second capture group
/
We're finally done with the mattern to match!
Now we'll start the replacement pattern.
\1
Insert the full content of the first capture group
(this is also called a "backreference" if you want
to google for a more detailed explanation).
So insert the whole first command up to the newline
after it.
[no results]
After the newline, insert your desired string.
\r
insert a carriage return here (I thought this should be
\n for a newline, but that made vim insert a null instead)
\2
insert the second capture group (that's just the second prompt)
/
end of the substitute pattern

Of course, if you have a different prompt, substitute it for "% ". If you have a complicated prompt that includes time of day or something, you'll have to use a slightly more complicated match pattern to match it.

Tags: , , ,
[ 13:34 Oct 12, 2008    More linux/editors | permalink to this entry ]

Sun, 31 Aug 2008

Useful shell pipeline: Who checks in how much?

I wanted to get a list of who'd been contributing the most in a particular open source project. Most projects of any size have a ChangeLog file, in which check-ins have entries like this:
2008-08-26  Jane Hacker  <hacker@domain.org>

        * src/app/print.c: make sure the Portrait and Landscape
        * buttons update according to the current setting.

I wanted to take each entry, save the name of the developer checking in, then eventually count the number of times each name occurs (the number of times that developer checked in) and print them in order from most check-ins to least.

Getting the names is easy: for check-ins in the last 9 years, I just want the lines that start with "200". (Of course, if I wanted earlier check-ins I could make the match more general.)

grep "^200" ChangeLog

But now I want to trim the line so it includes only the contributor's name. A bit of sed geekery can do that: the date is a fixed format (four characters, a dash, two, dash, two, then two spaces, so "^....-..-.. " matches that pattern.

But I want to remove the email address part too (sometimes people use different email addresses when they check in). So I want a sed pattern that will match something at the front (to discard), something in the middle (keep that part) and something at the end (discard).

Here's how to do that in sed:

grep "^200" ChangeLog | sed 's/^....-..-..  \(.*\)<.*$/\1/'
In English, that says: "For each line in the ChangeLog that starts with 200, find a pattern at the beginning consisting of any four characters, a dash, two characters, dash, two characters, dash, and two spaces; then immediately after that, save all characters up to a < symbol; then throw away the < and any characters that follow until the end of the line."

That works pretty well! But it's not quite right: it includes the two spaces after the name as part of the name. In sed, \s matches any space character (like space or tab). So you'd think this should work:

grep "^200" ChangeLog | sed 's/^....-..-..  \(.*\)\s+<.*$/\1/'
\s+ means it will require that at least one and maybe more space characters immediately before the < are also discarded. But it doesn't work. It turns out the reason is that the \(.*\) expression is "greedier" than the \s+: so the saved name expression grabs the first space, leaving only the second to the \s+.

The way around that is to make the name expression specify that it can't end with a space. \S is the term for "anything that's not a space character"; so the expression becomes

grep "^200" ChangeLog | sed 's/^....-..-..  \(.*\S\)\s\+<.*$/\1/'
(the + turned out to need a backslash before it).

We have the list of names! Add a | sort on the end to sort them alphabetically -- that will make sure you get all the "Jane Hacker" lines listed together. But how to count them? The Unix program most frequently invoked after sort is uniq, which gets rid of all the repeated lines. On a hunch, I checked out the man page, man uniq, and found the -c option: "prefix lines by the number of occurrences". Perfect! Then just sort them by the number, from largest to smallest:

grep "^200" ChangeLog | sed 's/^....-..-..  \(.*\S\)\s+<.*$/\1/' | sort | uniq -c | sort -rn
And we're done!

Now, this isn't perfect since it doesn't catch "Checking in patch contributed by susan@otherhost.com" attributions -- but those aren't in a standard format in most projects, so they have to be handled by hand.

Disclaimer: Of course, number of check-ins is not a good measure of how important or productive someone is. You can check in a lot of one-line fixes, or you can write an important new module and submit it for someone else to merge in. The point here wasn't to rank developers, but just to get an idea who was checking into the tree and how often.

Well, that ... and an excuse to play with nifty Linux shell pipelines.

Tags: , , ,
[ 11:12 Aug 31, 2008    More linux | permalink to this entry ]

Thu, 20 Dec 2007

Smart Wrapping with Greedy and Non-Greedy Regular Expressions

I had a chance to spend a day at the AGU conference last week. The American Geophysical Union is a fabulous conference -- something like 14,000 different talks over the course of the week, on anything related to earth or planetary sciences -- geology, solar system astronomy, atmospheric science, geophysics, geochemistry, you name it.

I have no idea how regular attendees manage the information overload of deciding which talks to attend. I wasn't sure how I would, either, but I started by going through the schedule for the day I'd be there, picking out a (way too long) list of potentially interesting talks, and saving them as lines in a file.

Now I had a file full of lines like:

1020      U22A    MS 303  Terrestrial Impact Cratering: New Insights Into the Cratering Process From Geophysics and Geochemistry II
Fine, except that I couldn't print out something like that -- printers stop at 80 columns. I could pass it through a program like "fold" to wrap the long lines, but then it would be hard to scan through quickly to find the talk titles and room numbers. What I really wanted was to wrap it so that the above line turned into something like:
1020      U22A    MS 303  Terrestrial Impact Cratering: New Insights
                          Into the Cratering Process From Geophysics
                          and Geochemistry II
But how to do that? I stared at it for a while, trying to figure out whether there was a clever vim substitute that could handle it. I asked on a couple of IRC channels, just in case there was some amazing Linux smart-wrap utility I'd never heard of. I was on the verge of concluding that the answer was no, and that I'd have to write a python script to do the wrapping I wanted, when Mikael emitted a burst of line noise:
%s/\(.\{72\}\)\(.*\)/\1^M^I^I^I\2/

Only it wasn't line noise. Seems Mikael just happened to have been reading about some of the finer points of vim regular expressions earlier that day, and he knew exactly the trick I needed -- that .\{72\}, which matches lines that are at least 72 characters long. And amazingly, that expression did something very close to what I wanted.

Or at least the first step of it. It inserts the first line break, turning my line into

1020      U22A    MS 303  Terrestrial Impact Cratering: New Insights
                          Into the Cratering Process From Geophysics and Geochemistry II
but I still needed to wrap the second and subsequent lines.

But that was an easier problem -- just do essentially the same thing again, but limit it to only lines starting with a tab. After some tweaking, I arrived at exactly what I wanted:

%s/^\(.\{,65\}\) \(.*\)/\1^M^I^I^I\2/

%g/^^I^I^I.\{58\}/s/^\(.\{,55\}\) \(.*\)/\1^M^I^I^I\2/
I had to run the second line two or three times to wrap the very long lines.

Devdas helpfully translated the second one into English: "You have 3 tabs, followed by 58 characters, out of which you match the first 55 and put that bit in $1, and the capture the remaining in $2, and rewrite to $1 newline tab tab tab $2."

Here's a more detailed breakdown:

Line one:
% Do this over the whole file
s/ Begin global substitute
^ Start at the beginning of the line
\( Remember the result of the next match
.\{,65\}_ Look for up to 65 characters with a space at the end
\) \( End of remembered pattern #1, skip a space, and start remembered pattern #2
.*\) Pattern #2 includes everything to the end of the line
/ End of matched pattern; begin replacement pattern
\1^M Insert saved pattern #1 (the first 65 lines ending with a space) followed by a newline
^I^I^I\2 On the second line, insert three tabs then saved pattern #2
/ End replacement pattern

Line two:
%g/ Over the whole file, only operate on lines with this pattern
^^I^I^I Lines starting with three tabs
.\{58\}/ After the tabs, only match lines that still have at least 58 characters (this guards against wrapping already wrapped lines when it's run repeatedly)
s/ Begin global substitute
^ Start at the beginning of the line
\( Remember the result of the next match
.\{,55\} Up to 55 characters
\) \( End of remembered pattern #1, skip a space, and start remembered pattern #2
.*\) Pattern #2 includes everything to the end of the line
/ End of matched pattern; begin replacement pattern
\1^M The first pattern (up to 55 chars) is one line
^I^I^I\2 Three tabs then the second pattern
/ End replacement pattern

Greedy and non-greedy brace matches

The real key is those curly-brace expressions, \{,65\} and \{58\} -- that's how you control how many characters vim will match and whether or not the match is "greedy". Here's how they work (thanks to Mikael for explaining).

The basic expression is {M,N} -- it means between M and N matches of whatever precedes it. (Vim requires that the first brace be escaped -- \{}. Escaping the second brace is optional.) So .{M,N} can match anything between M and N characters but "prefer" N, i.e. try to match as many as possible up to N. To make it "non-greedy" (match as few as possible, "preferring" M), use .{-M,N}

You can leave out M, N, or both; M defaults to 0 while N defaults to infinity. So {} is short for {0,∞} and is equivalent to *, while {-} means {-0,∞}, like a non-greedy version of *.

Given the string: one, two, three, four, five
,.\{}, matches , two, three, four,
,.\{-}, matches , two,
,.\{5,}, matches , two, three, four,
,.\{-5,}, matches , two, three,
,.\{,2}, matches nothing
,.\{,7}, matches , two,
,.\{5,7}, matches , three,

Of course, this syntax is purely for vim; regular expressions are unfortunately different in sed, perl and every other program. Here's a fun table of regexp terms in various programs.

Tags: , ,
[ 11:44 Dec 20, 2007    More linux/editors | permalink to this entry ]

Sun, 14 May 2006

Linkifying with Regular Expressions

I had a page of plaintext which included some URLs in it, like this:
Tour of the Hayward Fault
http://www.mcs.csuhayward.edu/~shirschf/tour-1.html

Technical Reports on Hayward Fault
http://quake.usgs.gov/research/geology/docs/lienkaemper_docs06.htm

I wanted to add links around each of the urls, so that I could make it part of a web page, more like this:

Tour of the Hayward Fault
http://www.mcs.csu hayward.edu/~shirschf/tour-1.html

Technical Reports on Hayward Fault
htt p://quake.usgs.gov/research/geology/docs/lienkaemper_docs06.htm

Surely there must be a program to do this, I thought. But I couldn't find one that was part of a standard Linux distribution.

But you can do a fair job of linkifying just using a regular expression in an editor like vim or emacs, or by using sed or perl from the commandline. You just need to specify the input pattern you want to change, then how you want to change it.

Here's a recipe for linkifying with regular expressions.

Within vim:

:%s_\(https\=\|ftp\)://\S\+_<a href="&">&</a>_

If you're new to regular expressions, it might be helpful to see a detailed breakdown of why this works:

:
Tell vim you're about to type a command.
%
The following command should be applied everywhere in the file.
s_
Do a global substitute, and everything up to the next underscore will represent the pattern to match.
\(
This will be a list of several alternate patterns.
http
If you see an "http", that counts as a match.
s\=
Zero or one esses after the http will match: so http and https are okay, but httpsssss isn't.
\|
Here comes another alternate pattern that you might see instead of http or https.
ftp
URLs starting with ftp are okay too.
\)
We're done with the list of alternate patterns.
://
After the http, https or ftp there should always be a colon-slash-slash.
\S
After the ://, there must be a character which is not whitespace.
\+
There can be any number of these non-whitespace characters as long as there's at least one. Keep matching until you see a space.
_
Finally, the underscore that says this is the end of the pattern to match. Next (until the final underscore) will be the expression which will replace the pattern.
<a href="&">
An ampersand, &, in a substitute expression means "insert everything that was in the original pattern". So the whole url will be inserted between the quotation marks.
&</a>
Now, outside the <a href="..."> tag, insert the matched url again, and follow it with a </a> to close the tag.
_
The final underscore which says "this is the end of the replacement pattern". We're done!

Linkifying from the commandline using sed

Sed is a bit trickier: it doesn't understand \S for non-whitespace, nor = for "zero or one occurrence". But this expression does the trick:
sed -e 's_\(http\|https\|ftp\)://[^ \t]\+_<a href="&">&</a>_' <infile.txt >outfile.html

Addendum: George Riley tells me about VST for Vim 7, which looks like a nice package to linkify, htmlify, and various other useful things such as creating HTML presentations. I don't have Vim 7 yet, but once I do I'll definitely check out VST.

Tags: , , ,
[ 12:40 May 14, 2006    More linux/editors | permalink to this entry ]

Mon, 10 Oct 2005

How to Search Your Mozilla Cache

Ever want to look for something in your browser cache, but when you go there, it's just a mass of oddly named files and you can't figure out how to find anything?

(Sure, for whole pages you can use the History window, but what if you just want to find an image you saw this morning that isn't there any more?)

Here's a handy trick.

First, change directory to your cache directory (e.g. $HOME/.mozilla/firefox/blahblah/Cache).

Next, list the files of the type you're looking for, in the order in which they were last modified, and save that list to a file. Like this:

% file `ls -1t` | grep JPEG | sed 's/: .*//' > /tmp/foo

In English: ls -t lists in order of modification date, and -1 ensures that the files will be listed one per line. Pass that through grep for the right pattern (do a file * to see what sorts of patterns get spit out), then pass that through sed to get rid of everything but the filename. Save the result to a temporary file.

The temp file now contains the list of cache files of the type you want, ordered with the most recent first. You can now search through them to find what you want. For example, I viewed them with Pho:

pho `cat /tmp/foo`
For images, use whatever image viewer you normally use; if you're looking for text, you can use grep or whatever search you lke. Alternately, you could ls -lt `cat foo` to see what was modified when and cut down your search a bit further, or any other additional paring you need.

Of course, you don't have to use the temp file at all. I could have said simply:

pho `ls -1t` | grep JPEG | sed 's/: .*//'`
Making the temp file is merely for your convenience if you think you might need to do several types of searches before you find what you're looking for.

Tags: , , , , ,
[ 21:40 Oct 10, 2005    More tech/web | permalink to this entry ]