Useful shell pipeline: Who checks in how much?
I wanted to get a list of who'd been contributing the most in a particular open source project. Most projects of any size have a ChangeLog file, in which check-ins have entries like this:2008-08-26 Jane Hacker <hacker@domain.org> * src/app/print.c: make sure the Portrait and Landscape * buttons update according to the current setting.
I wanted to take each entry, save the name of the developer checking in, then eventually count the number of times each name occurs (the number of times that developer checked in) and print them in order from most check-ins to least.
Getting the names is easy: for check-ins in the last 9 years, I just want the lines that start with "200". (Of course, if I wanted earlier check-ins I could make the match more general.)
grep "^200" ChangeLog
But now I want to trim the line so it includes only the contributor's name. A bit of sed geekery can do that: the date is a fixed format (four characters, a dash, two, dash, two, then two spaces, so "^....-..-.. " matches that pattern.
But I want to remove the email address part too (sometimes people use different email addresses when they check in). So I want a sed pattern that will match something at the front (to discard), something in the middle (keep that part) and something at the end (discard).
Here's how to do that in sed:
grep "^200" ChangeLog | sed 's/^....-..-.. \(.*\)<.*$/\1/'In English, that says: "For each line in the ChangeLog that starts with 200, find a pattern at the beginning consisting of any four characters, a dash, two characters, dash, two characters, dash, and two spaces; then immediately after that, save all characters up to a < symbol; then throw away the < and any characters that follow until the end of the line."
That works pretty well! But it's not quite right: it includes the two spaces after the name as part of the name. In sed, \s matches any space character (like space or tab). So you'd think this should work:
grep "^200" ChangeLog | sed 's/^....-..-.. \(.*\)\s+<.*$/\1/'\s+ means it will require that at least one and maybe more space characters immediately before the < are also discarded. But it doesn't work. It turns out the reason is that the \(.*\) expression is "greedier" than the \s+: so the saved name expression grabs the first space, leaving only the second to the \s+.
The way around that is to make the name expression specify that it can't end with a space. \S is the term for "anything that's not a space character"; so the expression becomes
grep "^200" ChangeLog | sed 's/^....-..-.. \(.*\S\)\s\+<.*$/\1/'(the + turned out to need a backslash before it).
We have the list of names! Add a | sort
on the end to
sort them alphabetically -- that will make sure you get all the
"Jane Hacker" lines listed together. But how to count them?
The Unix program most frequently invoked after sort
is uniq
, which gets rid of all the repeated lines.
On a hunch, I checked out the man page, man uniq
,
and found the -c option: "prefix lines by the number of occurrences".
Perfect! Then just sort them by the number, from largest to
smallest:
grep "^200" ChangeLog | sed 's/^....-..-.. \(.*\S\)\s+<.*$/\1/' | sort | uniq -c | sort -rnAnd we're done!
Now, this isn't perfect since it doesn't catch "Checking in patch contributed by susan@otherhost.com" attributions -- but those aren't in a standard format in most projects, so they have to be handled by hand.
Disclaimer: Of course, number of check-ins is not a good measure of how important or productive someone is. You can check in a lot of one-line fixes, or you can write an important new module and submit it for someone else to merge in. The point here wasn't to rank developers, but just to get an idea who was checking into the tree and how often.
Well, that ... and an excuse to play with nifty Linux shell pipelines.
[ 12:12 Aug 31, 2008 More linux | permalink to this entry | ]