Shallow Thoughts : : cmdline

Akkana's Musings on Open Source Computing, Science, and Nature.

Sun, 18 Dec 2011

Convert patterns in only some lines to title case

A friend had a fun problem: she had some XML files she needed to import into GNUcash, but the program that produced them left names in all-caps and she wanted them more readable. So she'd have a file like this:

<STMTTRN>
   <TRNTYPE>DEBIT
   <DTPOSTED>20111125000000[-5:EST]
   <TRNAMT>-22.71
   <FITID>****

   <NAME>SOME    COMPANY
   <MEMO>SOME COMPANY    ANY TOWN   CA 11-25-11 330346
</STMTTRN>
and wanted to change the NAME and MEMO lines to read Some Company and Any Town. However, the tags, like <NAME>, all had to remain upper case, and presumably so did strings like DEBIT. How do you change just the NAME and MEMO lines from upper case to title case?

The obvious candidate to do string substitutes is sed. But there are several components to the problem.

Addresses

First, how do you ensure the replacement only happens on lines with NAME and MEMO?

sed lets you specify address ranges for just that purpose. If you say sed 's/xxx/yyy/' sed will change all xxx's to yyy; but if you say sed '/NAME/s/xxx/yyy/' then sed will only do that substitution on lines containing NAME.

But we need this to happen on lines that contain either NAME or MEMO. How do you do that? With \|, like this: sed '/\(NAME\|MEMO\)/s/xxx/yyy/'

Converting to title case

Next, how do you convert upper case to lower case? There's a sed command for that: \L. Run sed 's/.*/\L&/' and type some upper and lower case characters, and they'll all be converted to lower-case.

But here we want title case -- we want most of each word converted to lowercase, but the first letter should stay uppercase. That means we need to detect a word and figure out which is the first letter.

In the strings we're considering, a word is a set of letters A through Z with one of the following characteristics:

  1. It's preceded by a space
  2. It's preceded by a close-angle-bracket, >

So the pattern /[ >][A-Z]*/ will match anything we consider a word that might need conversion.

But we need to separate the first letter and the rest of the word, so we can treat them separately. sed's \( \) operators will let us do that. The pattern \([ >][A-Z]\) finds the first letter of a word (including the space or > preceding it), and saves that as its first matched pattern, \1. Then \([A-Z]*\) right after it will save the rest of the word as \2.

So, taking our \L case converter, we can convert to title case like this: sed 's/\([ >][A-Z]\)\([A-Z]*\)/\1\L\2/g

Starting to look long and scary, right? But it's not so bad if you build it up gradually from components. I added a g on the end to tell sed this is a global replace: do the operation on every word it finds in the line, otherwise it will only make the substitution once, on the first word it sees, then quit.

Putting it together

So we know how to seek out specific lines, and how to convert to title case. Put the two together, and you get the final command:

sed '/\(NAME\|MEMO\)/s/\([ >][A-Z]\)\([A-Z]*\)/\1\L\2/g'

I ran it on the test input, and it worked just fine.

For more information on sed, a good place to start is the sed regular expressions manual.

Tags: , ,
[ 13:13 Dec 18, 2011    More linux/cmdline | permalink to this entry | comments ]

Sat, 03 Sep 2011

List only directories

Fairly often, I want a list of subdirectories inside a particular directory. For instance, when posting blog entries, I may need to decide whether an entry belongs under "linux" or some sub-category, like "linux/cmdline" -- so I need to remind myself what categories I have under linux.

But strangely, Linux offers no straightforward way to ask that question. The ls command lists directories -- along with the files. There's no way to list just the directories. You can list the directories first, with the --group-directories-first option. Or you can flag the directories specially: ls -F appends a slash to each directory name, so instead of linux you'd see linux/. But you still have to pick the directories out of a long list of files. You can do that with grep, of course:

ls -1F ~/web/blog/linux | grep /
That's a one, not an ell: it tells ls to list files one per line. So now you get a list of directories, one per line, with a slash appended to each one. Not perfect, but it's a start.

Or you can use the find program, which has an option -type d that lists only directories. Perfect, right?

find ~/web/blog/linux -maxdepth 1 -type d

Except that lists everything with full pathnames: /home/akkana/web/blog/linux, /home/akkana/web/blog/linux/editors, /home/akkana/web/blog/linux/cmdline and so forth. Way too much noise to read quickly.

What I'd really like is to have just a list of directory names -- no slashes, no newlines. How do we get from ls or find output to that? Either we can start with find and strip off all the path information, either in a loop with basename or with a sed command; or start with ls -F, pick only the lines with slashes, then strip off those slashes. The latter sounds easier.

So let's go back to that ls -1F ~/web/blog/linux | grep / command. To strip off the slashes, you can use sed's s (substitute) command. Normally the syntax is sed 's/oldpat/newpat/'. But since slashes are the pattern we're substituting, it's better to use something else as the separator character. I'll use an underscore.

The old pattern, the one I want to replace, is / -- but I only want to replace the last slash on the line, so I'll add a $ after it, representing end-of-line. The new pattern I want instead of the slash is -- nothing.

So my sed argument is 's_/$__' and the command becomes:

ls -1F ~/web/blog/linux | grep / | sed 's_/$__'

That does what I want. If I don't want them listed one per line, I can fudge that using backquotes to pass the output of the whole command to the shell's echo command:

echo `ls -1F ~/web/blog/linux | grep / | sed 's_/$__'`

If you have a lot of directories to list and you want ls's nice columnar format, that's a little harder. You can ls the list of directories (the names inside the backquotes), ls `your long command` -- except that now that you've stripped off the path information, ls won't know where to find the files. So you'd have to change directory first:

cd ~/web/blog/linux; ls -d `ls -1F | grep / | sed 's_/$__'`

That's not so good, though, because now you've changed directories from wherever you were before. To get around that, use parentheses to run the commands inside a subshell:

(cd ~/web/blog/linux; ls -d `ls -1F | grep / | sed 's_/$__'`)

Now the cd only applies within the subshell, and when the command finishes, your own shell will still be wherever you started.

Finally, I don't want to have to go through this discovery process every time I want a list of directories. So I turned it into a couple of shell functions, where $* represents all the arguments I pass to the command, and $1 is just the first argument.

lsdirs() { 
  (cd $1; /bin/ls -d `/bin/ls -1F | grep / | sed 's_/$__'`)
}

lsdirs2() { 
  echo `/bin/ls -1F $* | grep / | sed 's_/$__'` 
}
I specify /bin/ls because I have a function overriding ls in my .zshrc. Most people won't need to, but it doesn't hurt.

Now I can type lsdirs ~/web/blog/linux and get a nice list of directories.

Update, shortly after posting: In zsh (which I use), there's yet another way: */ matches only directories. It appends a trailing slash to them, but *(/) matches directories and omits the trailing slash. So you can say

echo ~/web/blog/linux/*(/:t)
:t strips the directory part of each match. To see other useful : modifiers, type ls *(: then hit TAB.

Thanks to Mikachu for the zsh tips. Zsh can do anything, if you can just figure out how ...

Tags: , , ,
[ 10:22 Sep 03, 2011    More linux/cmdline | permalink to this entry | comments ]

Tue, 15 Mar 2011

Using grep to solve another Cartalk puzzler

It's another episode of "How to use Linux to figure out CarTalk puzzlers"! This time you don't even need any programming.

Last week's puzzler was A Seven-Letter Vacation Curiosity. Basically, one couple hiking in Northern California and another couple carousing in Florida both see something described by a seven-letter word containing all five vowels -- but the two things they saw were very different. What's the word?

That's an easy one to solve using basic Linux command-line skills -- assuming the word is in the standard dictionary. If it's some esoteric word, all bets are off. But let's try it and see. It's a good beginning exercise in regular expressions and how to use the command line.

There's a handy word list in /usr/share/dict/words, one word per line. Depending on what packages you have installed, you may have bigger dictionaries handy, but you can usually count on /usr/share/dict/words being there on any Linux system. Some older Unix systems may have it in /usr/dict/words instead.

We need a way to choose all seven letter words. That's easy. In a regular expression, . (a dot) matches one letter. So ....... (seven dots) matches any seven letters.

(There's a more direct way to do that: the expression .\{7\} will also match 7 letters, and is really a better way. But personally, I find it harder both to remember and to type than the seven dots. Still, if you ever need to match 43 characters, or 114, it's good to know the "right" syntax.)

Fine, but if you grep ....... /usr/share/dict/words you get a list of words with seven or more letters. See why? It's because grep prints any line where it finds a match -- and a word with nine letters certainly contains seven letters within it.

The pattern you need to search for is '^.......$' -- the up-caret ^ matches the beginning of a line, and the dollar sign $ matches the end. Put single quotes around the pattern so the shell won't try to interpret the caret or dollar sign as special characters. (When in doubt, it's always safest to put single quotes around grep patterns.)

So now we can view all seven-letter words: grep '^.......$' /usr/share/dict/words
How do we choose only the ones that contain all the letters a e i o and u?

That's easy enough to build up using pipelines, using the pipe character | to pipe the output of one grep into a different grep. grep '^.......$' /usr/share/dict/words | grep a sends that list of 7-letter words through another grep command to make sure you only see words containing an a.

Now tack a grep for each of the other letters on the end, the same way:
grep '^.......$' /usr/share/dict/words | grep a | grep e | grep i | grep o | grep u

Voilà! I won't spoil the puzzler, but there are two words that match, and one of them is obviously the answer.

The power of the Unix command line to the rescue!

Tags: , , , , ,
[ 10:00 Mar 15, 2011    More linux/cmdline | permalink to this entry | comments ]

Wed, 29 Sep 2010

"Who am I?" Maybe nobody!

We hit an interesting problem at work recently. A coworker made a deb package which, during installation, needed to figure out the ID of the user running it, so it could make files writable by that user. Of course, while a package is being installed it's run by root, so the trick is to find out who you were before you sudoed or sued to root.

He was using the command who am i -- reasonable, since it's been a staple since the early days of Unix. For those not familiar with the command, /usr/bin/who, if given two arguments, regardless of what those arguments are, will print information about the current logged-in user. It also offers a -m option to do the same thing. So who am i, who a b, and who -m should all print a line like:

$ who am i
akkana   pts/1        2010-09-29 09:33 (:0.0)

Except they don't. For me, they printed nothing at all -- which broke my colleague's install script.

A quick poll among friends on IRC showed that who am i worked for some people, failed for others, with no obvious logic to it.

It's the terminal

It took some digging to find out what was going on, but the difference turned out to be the terminal being used. The who program -- with or without -m -- gets its info from /var/run/utmp, a file that maintains a record of who's logged in to the system. And it turns out some terminals create a utmp entry, while others don't. So:
Program Creates utmp entry?
gnome-terminal yes
konsole yes
xterm no
xfterm4 yes
terminator no
rxvt no
roxterm yes

I use xterm myself. Xterm is documented (in its man page) to modify the utmp entry, and it has a command-line flat, +ut, plus two X resources, ptyHandshake and utmpInhibit. None of the three work: setting

XTerm*ptyHandshake: true
XTerm*utmpInhibit: false
then running xterm +ut still doesn't show up in who. I guess that's a bug in xterm (or Ubuntu's version of xterm).

How do you get the real user?

Okay, so who am i clearly isn't a reliable way of getting the user ID. What can you use instead?

Several people suggested the id program. It has a -r option which supposedly prints the real UID. Unfortunately, what it really does is print:

$ id -r
id: cannot print only names or real IDs in default format
The man page doesn't offer any suggestions how to use a format other than default, so we're kinda stuck there.

Update: people keep suggesting id -ru to me. Evidently I wasn't very clear in this article: the goal is to get the real id of the login user. In other words, if you're logged in as mary and using sudo, you want mary, not root.

Alas, adding -u to id's flags gets only the effective user id: -u wins over -r. This is very easy to test: sudo id -ru prints 0, as does id -ru inside su.

But elly on Freenode had a great suggestion:

stat -c '%U' `readlink /proc/self/fd/0`
What does this do?

/proc/self is a symlink to /proc/pid, a directory where you can find out all sorts of information about a process.

One of the things you can find out about a process is open file descriptors: in particular, standard input, output and error. So /proc/self/fd/0 corresponds to standard input of the current process -- which in the example above is readlink.

What is readlink? Well, /proc/self/fd/0, in the normal case, is actually a symlink to the terminal controlling the process. readlink prints the file to which that link points -- for instance, /dev/pts/1. That's the terminal being used.

Now that we know the name of the terminal, all we need to do is find out who owns it. (This is the information who am i would have gotten from utmp, had there been a utmp entry.) ls -l /dev/pts/1 will show you that it's you, even if you run it as sudo ls -l /dev/pts/1. You could take that and strip off fields to get the username, but stat, as elly suggested, is a much better way of doing that.

Put it all together, and stat -c '%U' `readlink /proc/self/fd/0 gets standard input for the current process, follows the link to get the controlling terminal, then finds out who owns that terminal.

That's you!

A similar but slightly shorter solution suggested by Mikachu: stat -c %u `tty`

Tags: ,
[ 16:39 Sep 29, 2010    More linux/cmdline | permalink to this entry | comments ]

Fri, 18 Jun 2010

Use "date" to show time abroad

While I was in Europe, Dave stumbled on a handy alias on his Mac to check the time where I was: date -v +10 (+10 is the offset from the current time). But when he tried to translate this to Linux, he found that the -v flag from FreeBSD's date program wasn't available on the GNU date on Linux.

But I suggested he could do the same thing with the TZ environment variable. It's not documented well anywhere I could find, but if you set TZ to the name of a time zone, date will print out the time for that zone rather than your current one.

So, for bash:

$ TZ=Europe/Paris date  # time in Paris
$ TZ=GB date            # time in Great Britain
$ TZ=GMT-02 date        # time two timezones east of GMT
or for csh:
% ( setenv TZ Europe/Paris; date)
% ( setenv TZ GB; date)
% ( setenv TZ GMT-02; date)

That's all very well. But when I tried

% ( setenv TZ UK; date)
% ( setenv TZ FR; date)
they gave the wrong time, even though Wikipedia's list of time zones seemed to indicate that those abbreviations were okay.

The trick seems to be that setting TZ only works for abbreviations in /usr/share/zoneinfo/, or maybe in /usr/share/zoneinfo/posix/. If you give an abbreviation, like UK or FR or America/San_Francisco, it won't give you an error, it'll just print GMT as if that was what you had asked for.

So this trick is useful for printing times abroad -- but if you want to be safe, either stick to syntaxes like GMT-2, or make a script that checks whether your abbreviation exists in the directory before calling date, and warns you rather than just printing the wrong time.

Tags: , , ,
[ 13:04 Jun 18, 2010    More linux/cmdline | permalink to this entry | comments ]

Fri, 27 Nov 2009

Tip: Bash remembering history across sessions

Two separate friends just had this problem, one of them a fairly experienced Linux user:

You're in bash, history works, but it's not remembered across sessions. Why?

Maybe the size of the history file somehow got set to zero?

$ echo $HISTFILESIZE
500
Nope -- that's not it.

Maybe it's using the wrong file. In bash you can set $HISTFILE to point to different places; for instance, you can use that to maintain different histories per window, or per machine.

$ echo $HISTFILE
/home/username/.bash_history
Nope, that's not it either.

The problem, for both people, turned out to be really simple:

$ ls -l $HISTFILE
-rw------- 1 root root 92 2007-08-20 14:03 /home/user/.bash_history

I'm not sure how it happens, but sometimes the .bash_history file becomes owned by root, and then as a normal user you can't update your history any more.

So a simple

$ rm $HISTFILE
and you're all set -- history across sessions should start working again.

Tags: , ,
[ 13:42 Nov 27, 2009    More linux/cmdline | permalink to this entry | comments ]

Syndicated on:
LinuxChix Live
Ubuntu Women
Women in Free Software
Graphics Planet
Ubuntu California
Planet Openbox
Planet LCA2009

Friends' Blogs:
Ups & Downs
DailyBBG
Long Live the Village Green
Dan Heller
Morris "Mojo" Jones
Jane Houston Jones

Other Blogs:
DevChix
Scott Adams
Dave Barry
BoingBoing (Cory Doctorow)
Young Female Scientist

Powered by PyBlosxom.