Shallow Thoughts : tags : pipelines
Akkana's Musings on Open Source Computing and Technology, Science, and Nature.
Sat, 03 Sep 2011
Fairly often, I want a list of subdirectories inside a
particular directory. For instance, when posting blog entries,
I may need to decide whether an entry belongs under "linux"
or some sub-category, like "linux/cmdline" -- so I need to remind
myself what categories I have under linux.
But strangely, Linux offers no straightforward way to ask that question.
The ls
command lists directories -- along with the files.
There's no way to list just the directories. You can list the directories
first, with the --group-directories-first option.
Or you can flag the directories specially: ls -F
appends a slash to each directory name, so instead of linux
you'd see linux/
. But you still have to pick the directories
out of a long list of files. You can do that with grep, of course:
ls -1F ~/web/blog/linux | grep /
That's a one, not an ell: it tells ls to list files one per line.
So now you get a list of directories, one per line, with a slash
appended to each one. Not perfect, but it's a start.
Or you can use the find
program, which has an option
-type d
that lists only directories. Perfect, right?
find ~/web/blog/linux -maxdepth 1 -type d
Except that lists everything with full pathnames:
/home/akkana/web/blog/linux, /home/akkana/web/blog/linux/editors,
/home/akkana/web/blog/linux/cmdline and so forth. Way too much noise
to read quickly.
What I'd really like is to have just a list of directory names --
no slashes, no newlines. How do we get from ls or find output to that?
Either we can start with find and strip off all the path information,
either in a loop with basename or with a sed command; or start with ls
-F, pick only the lines with slashes, then strip off those slashes.
The latter sounds easier.
So let's go back to that ls -1F ~/web/blog/linux | grep /
command. To strip off the slashes, you can use sed's s (substitute)
command. Normally the syntax is sed 's/oldpat/newpat/'. But since
slashes are the pattern we're substituting, it's better to use
something else as the separator character. I'll use an underscore.
The old pattern, the one I want to replace, is / -- but I only want to
replace the last slash on the line, so I'll add a $ after it,
representing end-of-line. The new pattern I want instead of the slash
is -- nothing.
So my sed argument is 's_/$__'
and the command becomes:
ls -1F ~/web/blog/linux | grep / | sed 's_/$__'
That does what I want. If I don't want them listed one per line, I can
fudge that using backquotes to pass the output of the whole command to
the shell's echo command:
echo `ls -1F ~/web/blog/linux | grep / | sed 's_/$__'`
If you have a lot of directories to list and you want ls's nice
columnar format, that's a little harder.
You can ls the list of directories (the names inside the backquotes),
ls `your long command`
-- except that now that you've stripped off the path information,
ls won't know where to find the files. So you'd have to change
directory first:
cd ~/web/blog/linux; ls -d `ls -1F | grep / | sed 's_/$__'`
That's not so good, though, because now you've changed directories
from wherever you were before. To get around that, use parentheses
to run the commands inside a subshell:
(cd ~/web/blog/linux; ls -d `ls -1F | grep / | sed 's_/$__'`)
Now the cd only applies within the subshell, and when the command
finishes, your own shell will still be wherever you started.
Finally, I don't want to have to go through this discovery process
every time I want a list of directories. So I turned it into a couple
of shell functions, where $* represents all the arguments I pass to
the command, and $1 is just the first argument.
lsdirs() {
(cd $1; /bin/ls -d `/bin/ls -1F | grep / | sed 's_/$__'`)
}
lsdirs2() {
echo `/bin/ls -1F $* | grep / | sed 's_/$__'`
}
I specify /bin/ls because I have a function overriding ls in my .zshrc.
Most people won't need to, but it doesn't hurt.
Now I can type lsdirs ~/web/blog/linux
and get a nice
list of directories.
Update, shortly after posting:
In zsh (which I use), there's yet another way: */
matches
only directories. It appends a trailing slash to them, but
*(/)
matches directories and omits the trailing slash.
So you can say
echo ~/web/blog/linux/*(/:t)
:t strips the directory part of each match.
To see other useful : modifiers, type
ls *(:
then hit TAB.
Thanks to Mikachu for the zsh tips. Zsh can do anything, if you can
just figure out how ...
Tags: cmdline, shell, pipelines, linux
[
11:22 Sep 03, 2011
More linux/cmdline |
permalink to this entry |
]
Tue, 15 Mar 2011
It's another episode of "How to use Linux to figure out CarTalk puzzlers"!
This time you don't even need any programming.
Last week's puzzler was
A
Seven-Letter Vacation Curiosity. Basically, one couple hiking
in Northern California and another couple carousing in Florida
both see something described by a seven-letter word containing
all five vowels -- but the two things they saw were very different.
What's the word?
That's an easy one to solve using basic Linux command-line skills --
assuming the word is in the standard dictionary. If it's some esoteric
word, all bets are off. But let's try it and see. It's a good beginning
exercise in regular expressions and how to use the command line.
There's a handy word list in /usr/share/dict/words, one word per line.
Depending on what packages you have installed, you may have bigger
dictionaries handy, but you can usually count on /usr/share/dict/words
being there on any Linux system. Some older Unix systems may have it in
/usr/dict/words instead.
We need a way to choose all seven letter words.
That's easy. In a regular expression, . (a dot) matches one letter.
So ....... (seven dots) matches any seven letters.
(There's a more direct way to do that: the expression .\{7\}
will also match 7 letters, and is really a better way. But personally,
I find it harder both to remember and to type than the seven dots.
Still, if you ever need to match 43 characters, or 114, it's good to know the
"right" syntax.)
Fine, but if you grep ....... /usr/share/dict/words
you get a list of words with seven or more letters. See why?
It's because grep prints any line where it finds a match -- and a
word with nine letters certainly contains seven letters within it.
The pattern you need to search for is '^.......$' -- the up-caret ^
matches the beginning of a line, and the dollar sign $ matches the end.
Put single quotes around the pattern so the shell won't try to interpret
the caret or dollar sign as special characters. (When in doubt, it's
always safest to put single quotes around grep patterns.)
So now we can view all seven-letter words:
grep '^.......$' /usr/share/dict/words
How do we choose only the ones that contain all the letters a e i o and u?
That's easy enough to build up using pipelines, using the pipe
character | to pipe the output of one grep into a different grep.
grep '^.......$' /usr/share/dict/words | grep a
sends that list of 7-letter words through another grep command to
make sure you only see words containing an a.
Now tack a grep for each of the other letters on the end, the same way:
grep '^.......$' /usr/share/dict/words | grep a | grep e | grep i | grep o | grep u
Voilà! I won't spoil the puzzler, but there are two words that
match, and one of them is obviously the answer.
The power of the Unix command line to the rescue!
Tags: cmdline, regexp, linux, shell, pipelines, puzzles
[
11:00 Mar 15, 2011
More linux/cmdline |
permalink to this entry |
]
Sun, 31 Aug 2008
I wanted to get a list of who'd been contributing the most in a
particular open source project. Most projects of any size have a
ChangeLog file, in which check-ins have entries like this:
2008-08-26 Jane Hacker <hacker@domain.org>
* src/app/print.c: make sure the Portrait and Landscape
* buttons update according to the current setting.
I wanted to take each entry, save the name of the developer checking
in, then eventually count the number of times each name occurs (the
number of times that developer checked in) and print them in order
from most check-ins to least.
Getting the names is easy: for check-ins in the last 9 years, I just
want the lines that start with "200". (Of course, if I wanted earlier
check-ins I could make the match more general.)
grep "^200" ChangeLog
But now I want to trim the line so it includes only the
contributor's name. A bit of sed geekery can do that: the date is a
fixed format (four characters, a dash, two, dash, two, then two
spaces, so "^....-..-.. " matches that pattern.
But I want to remove the email address part too
(sometimes people use different email addresses
when they check in). So I want a sed pattern that will match
something at the front (to discard), something in the middle (keep that part)
and something at the end (discard).
Here's how to do that in sed:
grep "^200" ChangeLog | sed 's/^....-..-.. \(.*\)<.*$/\1/'
In English, that says: "For each line in the ChangeLog that starts
with 200, find a pattern at the beginning consisting of any four
characters, a dash, two characters, dash, two characters, dash, and
two spaces; then immediately after that, save all characters up to
a < symbol; then throw away the < and any characters that follow
until the end of the line."
That works pretty well! But it's not quite right: it includes the
two spaces after the name as part of the name. In sed, \s matches
any space character (like space or tab).
So you'd think this should work:
grep "^200" ChangeLog | sed 's/^....-..-.. \(.*\)\s+<.*$/\1/'
\s+ means it will require that at least one and maybe more space
characters immediately before the < are also discarded.
But it doesn't work. It turns out the reason is that the \(.*\)
expression is "greedier" than the \s+: so the saved name expression
grabs the first space, leaving only the second to the \s+.
The way around that is to make the name expression specify that it
can't end with a space. \S is the term for "anything that's not a
space character"; so the expression becomes
grep "^200" ChangeLog | sed 's/^....-..-.. \(.*\S\)\s\+<.*$/\1/'
(the + turned out to need a backslash before it).
We have the list of names! Add a | sort
on the end to
sort them alphabetically -- that will make sure you get all the
"Jane Hacker" lines listed together. But how to count them?
The Unix program most frequently invoked after sort
is uniq
, which gets rid of all the repeated lines.
On a hunch, I checked out the man page, man uniq
,
and found the -c option: "prefix lines by the number of occurrences".
Perfect! Then just sort them by the number, from largest to
smallest:
grep "^200" ChangeLog | sed 's/^....-..-.. \(.*\S\)\s+<.*$/\1/' | sort | uniq -c | sort -rn
And we're done!
Now, this isn't perfect since it doesn't catch "Checking in patch
contributed by susan@otherhost.com" attributions -- but those aren't in
a standard format in most projects, so they have to be handled by hand.
Disclaimer: Of course, number of check-ins is not a good measure of
how important or productive someone is. You can check in a lot of
one-line fixes, or you can write an important new module and submit
it for someone else to merge in. The point here wasn't to rank
developers, but just to get an idea who was checking into the tree
and how often.
Well, that ... and an excuse to play with nifty Linux shell pipelines.
Tags: shell, CLI, linux, pipelines, regexp
[
12:12 Aug 31, 2008
More linux |
permalink to this entry |
]
Wed, 28 Feb 2007
I was talking about desktop backgrounds -- wallpaper -- with some
friends the other day, and it occurred to me that it might be fun
to have my system choose a random backdrop for me each morning.
Finding backgrounds is no problem: I have plenty of images
stored in ~/Backgrounds -- mostly photos I've taken over the
years, with a smattering of downloads from sites like the
APOD.
So all I needed was a way to select one file at random from the
directory.
This is Unix, so there's definitely a commandline way to do it, right?
Well, surprisingly, I couldn't find an easy way that didn't involve
any scripting. Some shells have a random number generator built in
($RANDOM in bash) but you still have to do some math on the result.
Of course, I could have googled, since I'm sure other people have
written random-wallpaper scripts ... but what's the fun in that?
If it has to be a script, I might as well write my own.
Rather than write a random wallpaper script, I wanted something that
could be more generally useful: pick one random line from standard
input and print it. Then I could pass it the output of ls -1
$HOME/Backgrounds, and at the same time I'd have a script that
I could also use for other purposes, such as choosing a random
quotation, or choosing a "flash card" question when studying for
an exam.
The obvious approach is to read all of standard input into an array,
count the lines, then pick a random number between one and $num_lines
and print that array element. It took no time to whip that up in
Python and it worked fine. But it's not very efficient -- what if
you're choosing a line from a 10Mb file?
Then Sara Falamaki (thanks, Sara!) pointed me to a
page
with a neat Perl algorithm. It's Perl so it's not easy to read,
but the algorithm is cute. You read through the input line by line,
keeping track of the line number. For each line, the chance that
this line should be the one printed at the end is the reciprocal of
the line number: in other words, there's one chance out of
$line_number that this line is the one to print.
So if there's only one line, of course you print that line;
when you get to the second line, there's one chance out of two that
you should switch; on the third, one chance out of three, and so on.
A neat idea, and it doesn't require storing the whole file in memory.
In retrospect, I should have thought of it myself: this is basically
the same algorithm I used for averaging images in GIMP for
my silly Chix Stack Mars
project, and I later described the method in the image stacking
section of my GIMP book.
To average images by stacking them, you give the bottom layer 100%
opacity, the second layer 50% opacity, the third 33% opacity, and so
on up the stack. Each layer makes an equal contribution to the final
result, so what you see is the average of all layers.
The randomline script, which you can inspect
here,
worked fine, so I hooked it up to accomplish the original
problem: setting a randomly chosen desktop background each day.
Since I use a lightweight window manager (fvwm) rather than gnome or
kde, and I start X manually rather than using gdm, I put this in my
.xinitrc:
(xsetbg -fullscreen -border black `find $HOME/Backgrounds -name "*.*" | randomline`) &
Update: I've switched to using hsetroot, which is a little more
robust than xsetbg. My new command is:
hsetroot -center `find -L $HOME/Backgrounds -name "*.*" | randomline`
So, an overlong article about a relatively trivial but nontheless
nifty algorithm. And now I have a new desktop background each day.
Today it's something prosaic: mud cracks from Death Valley.
Who knows what I'll see tomorrow?
Update, years later:
I've written a script for the whole job,
randombg,
because on my laptop I want to choose from a different set of
backgrounds depending on whether I'm plugged in to an external monitor
or using the lower resolution laptop display.
But meanwhile, I've just been pointed to the shuf command,
which does pretty much what my randomline script did.
So you don't actually need any scripts, just
hsetroot -fill `find ~/Images/Backgrounds/1680x1050/ -name '*.jpg' | shuf -n 1`
Tags: programming, pipelines, shell
[
14:02 Feb 28, 2007
More programming |
permalink to this entry |
]
Fri, 29 Dec 2006
A friend called me for help with a sysadmin problem they were having
at work. The problem: find all files bigger than one gigabyte, print
all the filenames, add up all the sizes and print the total.
And for some reason (not explained to me) they needed to do this
all in one command line.
This is Unix, so of course it's possible somehow!
The obvious place to start is with the find command,
and man find showed how to find all the 1G+ files:
find / -size +1G
(Turns out that's a GNU find syntax, and BSD find, on OS X, doesn't
support it. I left it to my friend to check man find for the
OS X equivalent of -size _1G.)
But for a problem like this, it's pretty clear we'd need to get find
to execute a program that prints both the filename and the size.
Initially I used ls -ls, but Saz (who was helping on IRC)
pointed out that du on a file also does that, and looks a
bit cleaner. With find's unfortunate syntax, that becomes:
find / -size +1G -exec du "{}" \;
But now we needed awk, to collect and add up all the sizes
while printing just the filenames. A little googling (since I don't
use awk very often) and experimenting led to the final solution:
find / -size +1G -exec du "{}" \; | awk '{print $2; total += $1} END { print "Total is", total}'
Ah, the joys of Unix shell pipelines!
Update: Ed Davies suggested an easier way to do the same thing.
turns out du will handle it all by itself: du -hc `find . -size +1G`
Thanks, Ed!
Tags: linux, CLI, shell, backups, pipelines
[
17:53 Dec 29, 2006
More linux |
permalink to this entry |
]
Sun, 14 May 2006
I had a page of plaintext which included some URLs in it, like this:
Tour of the Hayward Fault
http://www.mcs.csuhayward.edu/~shirschf/tour-1.html
Technical Reports on Hayward Fault
http://quake.usgs.gov/research/geology/docs/lienkaemper_docs06.htm
I wanted to add links around each of the urls, so that I could make
it part of a web page, more like this:
Tour of the Hayward Fault
http://www.mcs.csu
hayward.edu/~shirschf/tour-1.html
Technical Reports on Hayward Fault
htt
p://quake.usgs.gov/research/geology/docs/lienkaemper_docs06.htm
Surely there must be a program to do this, I thought. But I couldn't
find one that was part of a standard Linux distribution.
But you can do a fair job of linkifying just using a regular
expression in an editor like vim or emacs, or by using sed or perl from
the commandline. You just need to specify the input pattern you want
to change, then how you want to change it.
Here's a recipe for linkifying with regular expressions.
Within vim:
:%s_\(https\=\|ftp\)://\S\+_<a href="&">&</a>_
If you're new to regular expressions, it might be helpful to see a
detailed breakdown of why this works:
- :
- Tell vim you're about to type a command.
- %
- The following command should be applied everywhere in the file.
- s_
- Do a global substitute, and everything up to the next underscore
will represent the pattern to match.
- \(
- This will be a list of several alternate patterns.
- http
- If you see an "http", that counts as a match.
- s\=
- Zero or one esses after the http will match: so http and https are
okay, but httpsssss isn't.
- \|
- Here comes another alternate pattern that you might see instead
of http or https.
- ftp
- URLs starting with ftp are okay too.
- \)
- We're done with the list of alternate patterns.
- ://
- After the http, https or ftp there should always be a colon-slash-slash.
- \S
- After the ://, there must be a character which is not whitespace.
- \+
- There can be any number of these non-whitespace characters as long
as there's at least one. Keep matching until you see a space.
- _
- Finally, the underscore that says this is the end of the pattern
to match. Next (until the final underscore) will be the expression
which will replace the pattern.
- <a href="&">
- An ampersand, &, in a substitute expression means "insert
everything that was in the original pattern". So the whole url will
be inserted between the quotation marks.
- &</a>
- Now, outside the <a href="..."> tag, insert the matched url
again, and follow it with a </a> to close the tag.
- _
- The final underscore which says "this is the end of the
replacement pattern". We're done!
Linkifying from the commandline using sed
Sed is a bit trickier: it doesn't understand \S for
non-whitespace, nor = for "zero or one occurrence".
But this expression does the trick:
sed -e 's_\(http\|https\|ftp\)://[^ \t]\+_<a href="&">&</a>_' <infile.txt >outfile.html
Addendum: George
Riley tells me about
VST for Vim 7,
which looks like a nice package to linkify, htmlify, and various
other useful things such as creating HTML presentations.
I don't have Vim 7 yet, but once I do I'll definitely check out VST.
Tags: linux, editors, pipelines, regexp, shell, CLI
[
13:40 May 14, 2006
More linux/editors |
permalink to this entry |
]
Mon, 10 Oct 2005
Ever want to look for something in your browser cache, but when you
go there, it's just a mass of oddly named files and you can't figure
out how to find anything?
(Sure, for whole pages you can use the History window, but what if
you just want to find an image you saw this morning
that isn't there any more?)
Here's a handy trick.
First, change directory to your cache directory (e.g.
$HOME/.mozilla/firefox/blahblah/Cache).
Next, list the files of the type you're looking for, in the order in
which they were last modified, and save that list to a file. Like this:
% file `ls -1t` | grep JPEG | sed 's/: .*//' > /tmp/foo
In English:
ls -t lists in order of modification date, and -1 ensures
that the files will be listed one per line. Pass that through
grep for the right pattern (do a file * to see what sorts of
patterns get spit out), then pass that through sed to get rid of
everything but the filename. Save the result to a temporary file.
The temp file now contains the list of cache files of the type you
want, ordered with the most recent first. You can now search through
them to find what you want. For example, I viewed them with Pho:
pho `cat /tmp/foo`
For images, use whatever image viewer you normally use; if you're
looking for text, you can use grep or whatever search you lke.
Alternately, you could
ls -lt `cat foo` to see what was
modified when and cut down your search a bit further, or any
other additional paring you need.
Of course, you don't have to use the temp file at all. I could
have said simply:
pho `ls -1t` | grep JPEG | sed 's/: .*//'`
Making the temp file is merely for your convenience if you think you
might need to do several types of searches before you find what
you're looking for.
Tags: tech, web, mozilla, firefox, pipelines, CLI, shell, regexp
[
22:40 Oct 10, 2005
More tech/web |
permalink to this entry |
]
Wed, 19 Jan 2005
I've been surprised by the recent explosion in Windows desktop search
tools. Why does everyone think this is such a big deal that every
internet company has to jump onto the bandwagon and produce one,
or be left behind?
I finally realized the answer this morning. These people don't have
grep! They don't have any other way of searching out patterns in
files.
I use grep dozens of times every day: for quickly looking up a phone
number in a text file, for looking in my Sent mailbox for that url I
mailed to my mom last week, for checking whether I have any saved
email regarding setting up CUPS, for figuring out where in mozilla
urlbar clicks are being handled.
Every so often, some Windows or Mac person is opining about how
difficult commandlines are and how glad they are not to have to use
them, and I ask them something like, "What if you wanted to search
back through your mail folders to find the link to the cassini probe
images -- e.g. lines that have both http:// and cassini
in them?" I always get a blank look, like it would never occur to
them that such a search would ever be possible.
Of course, expert users have ways of doing such searches (probably
using command-line add-ons such as cygwin); and Mac OS X has the
full FreeBSD commandline built in. And more recent Windows
versions (Win2k and XP) now include a way to search for content
in files (so in the Cassini example, you could search for
http:// or cassini, but probably not both at once.)
But the vast majority of Windows and Mac users have no way to do
such a search, the sort of thing that Linux commandline users
do casually dozens of times per day. Until now.
Now I see why desktop search is such a big deal.
But rather than installing web-based advertising-drive apps with a
host of potential privacy and security implications ...
wouldn't it be easier just to install grep?
Tags: tech, pipelines, CLI, shell
[
12:45 Jan 19, 2005
More tech |
permalink to this entry |
]