Shallow Thoughts : : linux

Akkana's Musings on Open Source Computing, Science, and Nature.

Sat, 28 Dec 2013

Finding filenames in a disorganized directory

I've been scanning a bunch of records with Audacity (using as a guide Carla Schroder's excellent Book of Audacity and a Behringer UCA222 USB audio interface -- audacity doesn't seem able to record properly from the built-in sound card on any laptop I own, while it works fine with the Behringer.

Audacity's user interface isn't great for assembly-line recording of lots of tracks one after the other, especially on a laptop with a trackpad that doesn't work very well, so I wasn't always as organized with directory names as I could have been, and I ended up with a mess. I was periodically backing up the recordings to my desktop, but as I shifted from everything-in-one-directory to an organized system, the two directories got out of sync.

To get them back in sync, I needed a way to answer this question: is every file inside directory A (maybe in some subdirectory of it) also somewhere under subdirectory B? In other words, can I safely delete all of A knowing that anything in it is safely stored in B, even though the directory structures are completely different?

I was hoping for some clever find | xargs way to do it, but came up blank. So eventually I used a little zsh loop: one find to get the list of files to test, then for each of those, another find inside the target directory, then test the exit code of find to see if it found the file. (I'm assuming that if the songname.aup file is there, the songname_data directory is too.)

for fil in $(find AAA/ -name '*.aup'); do
  fil=$(basename $fil)
  find BBB -name $fil >/dev/null
  if [[ $? != 0 ]]; then
    echo $fil is not in BBB
  fi
done

Worked fine. But is there an easier way?

Tags: , , ,
[ 09:36 Dec 28, 2013    More linux/cmdline | permalink to this entry | comments ]

Wed, 13 Nov 2013

Does scrolling output make a program slower? Followup.

Last week I wrote about some tests I'd made to answer the question Does scrolling output make a program slower? My test showed that when running a program that generates lots of output, like an rsync -av, the rsync process will slow way down as it waits for all that output to scroll across whatever terminal client you're using. Hiding the terminal helps a lot if it's an xterm or a Linux console, but doesn't help much with gnome-terminal.

A couple of people asked in the comments about the actual source of the slowdown. Is the original process -- the rsync, or my test script, that's actually producing all that output -- actually blocking waiting for the terminal? Or is it just that the CPU is so busy doing all that font rendering that it has no time to devote to the original program, and that's why it's so much slower?

I found pingu on IRC (thanks to JanC) and the group had a very interesting discussion, during which I ran a series of additional tests.

In the end, I'm convinced that CPU allocation to the original process is not the issue, and that output is indeed blocked waiting for the terminal to display the output. Here's why.

First, I installed a couple of performance meters and looked at the CPU load while rendering. With conky, CPU use went up equally (about 35-40%) on both CPU cores while the test was running. But that didn't tell me anything about which processes were getting all that CPU.

htop was more useful. It showed X first among CPU users, xterm second, and my test script third. However, the test script never got more than 10% of the total CPU during the test; X and xterm took up nearly all the remaining CPU.

Even with the xterm hidden, X and xterm were the top two CPU users. But this time the script, at number 3, got around 30% of the CPU rather than 10%. That still doesn't seem like it could account for the huge difference in speed (the test ran about 7 times faster with xterm hidden); but it's interesting to know that even a hidden xterm will take up that much CPU.

It was also suggested that I try running it to /dev/null, something I definitely should have thought to try before. The test took .55 seconds with its output redirected to /dev/null, and .57 seconds redirected to a file on disk (of course, the kernel would have been buffering, so there was no disk wait involved). For comparison, the test had taken 56 seconds with xterm visible and scrolling, and 8 seconds with xterm hidden.

I also spent a lot of time experimenting with sleeping for various amounts of time between printed lines. With time.sleep(.0001) and xterm visible, the test took 104.71 seconds. With xterm shaded and the same sleep, it took 98.36 seconds, only 6 seconds faster. Redirected to /dev/null but with a .0001 sleep, it took 97.44 sec.

I think this argues for the blocking theory rather than the CPU-bound one: the argument being that the sleep gives the program a chance to wait for the output rather than blocking the whole time. If you figure it's CPU bound, I'm not sure how you'd explain the result.

But a .0001 second sleep probably isn't very accurate anyway -- we were all skeptical that Linux can manage sleep times that small. So I made another set of tests, with a .001 second sleep every 10 lines of output. The results: 65.05 with xterm visible; 63.36 with xterm hidden; 57.12 to /dev/null. That's with a total of 50 seconds of sleeping included (my test prints 500000 lines). So with all that CPU still going toward font rendering, the visible-xterm case still only took 7 seconds longer than the /dev/null case. I think this argues even more strongly that the original test, without the sleep, is blocking, not CPU bound.

But then I realized what the ultimate test should be. What happens when I run the test over an ssh connection, with xterm and X running on my local machine but the actual script running on the remote machine?

The remote machine I used for the ssh tests was a little slower than the machine I used to run the other tests, but that probably doesn't make much difference to the results.

The results? 60.29 sec printing over ssh (LAN) to a visible xterm; 7.24 sec doing the same thing with xterm hidden. Fairly similar to what I'd seen before when the test, xterm and X were all running on the same machine.

Interestingly, the ssh process during the test took 7% of my CPU, almost as much as the python script was getting before, just to transfer all the output lines so xterm could display them.

So I'm convinced now that the performance bottleneck has nothing to do with the process being CPU bound and having all its CPU sucked away by rendering the output, and that the bottleneck is in the process being blocked in writing its output while waiting for the terminal to catch up.

I'd be interested it hear further comments -- are there other interpretations of the results besides mine? I'm also happy to run further tests.

Tags: , , ,
[ 16:19 Nov 13, 2013    More linux | permalink to this entry | comments ]

Fri, 08 Nov 2013

Does scrolling output make a program slower?

While watching my rsync -av messages scroll by during a big backup, I wondered, as I often have, whether that -v (verbose) flag was slowing my backup down.

In other words: when you run a program that prints lots of output, so there's so much output the terminal can't display it all in real-time -- like an rsync -v on lots of small files -- does the program wait ("block") while the terminal catches up?

And if the program does block, can you speed up your backup by hiding the terminal, either by switching to another desktop, or by iconifying or shading the terminal window so it's not visible? Is there any difference among the different ways of hiding the terminal, like switching desktops, iconifying and shading?

Since I've never seen a discussion of that, I decided to test it myself. I wrote a very simple Python program:

import time

start = time.time()

for i in xrange(500000):
    print "Now we have printed", i, "relatively long lines to stdout."

print time.time() - start, "seconds to print", i, "lines."

I ran it under various combinations of visible and invisible terminal. The results were striking. These are rounded to the nearest tenth of a second, in most cases the average of several runs:

Terminal type Seconds
xterm, visible 56.0
xterm, other desktop 8.0
xterm, shaded 8.5
xterm, iconified 8.0
Linux framebuffer, visible 179.1
Linux framebuffer, hidden 3.7
gnome-terminal, visible 56.9
gnome-terminal, other desktop 56.7
gnome-terminal, iconified 56.7
gnome-terminal, shaded 43.8

Discussion:

First, the answer to the original question is clear. If I'm displaying output in an xterm, then hiding it in any way will make a huge difference in how long the program takes to complete.

On the other hand, if you use gnome-terminal instead of xterm, hiding your terminal window won't make much difference. Gnome-terminal is nearly as fast as xterm when it's displaying; but it apparently lacks xterm's smarts about not doing that work when it's hidden. If you use gnome-terminal, you don't get much benefit out of hiding it.

I was surprised how slow the Linux console was (I'm using the framebuffer in the Debian 3.2.0-4-686-pae on Intel graphics). But it's easy to see where that time is going when you watch the output: in xterm, you see lots of blank space as xterm skips drawing lines trying to keep up with the program's output. The framebuffer doesn't do that: it prints and scrolls every line, no matter how far behind it gets.

But equally interesting is how much faster the framebuffer is when it's not visible. (I typed Ctrl-alt-F2, logged in, ran the program, then typed Ctrl-alt-F7 to go back to X while the program ran.) Obviously xterm is doing some background processing that the framebuffer console doesn't need to do. The absolute time difference, less than four seconds, is too small to worry about, but it's interesting anyway.

I would have liked to try it my test a base Linux console, with no framebuffer, but figuring out how to get a distro kernel out of framebuffer mode was a bigger project than I wanted to tackle that afternoon.

I should mention that I wasn't super-scientific about these tests. I avoided doing any heavy work on the machine while the tests were running, but I was still doing light editing (like this article), reading mail and running xchat. The times for multiple runs were quite consistent, so I don't think my light system activity affected the results much.

So there you have it. If you're running an output-intensive program like rsync -av and you care how fast it runs, use either xterm or the console, and leave it hidden most of the time.

Tags: , , ,
[ 14:17 Nov 08, 2013    More linux | permalink to this entry | comments ]

Sat, 24 Aug 2013

A nifty shell redirection trick: process substitution

I love shell pipelines, and flatter myself that I'm pretty good at them. But a discussion last week on the Linuxchix Techtalk mailing list on finding added lines in a file turned up a terrific bash/zsh shell redirection trick I'd never seen before:

join -v 2 <(sort A.txt) <(sort B.txt)

I've used backquotes, and their cognate $(), plenty. For instance, you can do things like PS1=$(hostname): or PS1=`hostname`: to set your prompt to the current hostname: the shell runs the hostname command, takes its output, and substitutes that output in place of the backquoted or parenthesized expression.

But I'd never seen that <(...) trick before, and immediately saw how useful it was. Backquotes or $() let you replace arguments to a command with a program's output -- they're great for generating short strings for programs that take all their arguments on the command line. But they're no good for programs that need to read a file, or several files. <(...) lets you take the output of a command and pass it to a program as though it was the contents of a file. And if you can do it more than once in the same command -- as in Little Girl's example -- that could be tremendously useful.

Playing with it to see if it really did what it looked like it did, and what other useful things I could do with it, I tried this (and it worked just fine):

$ diff <(echo hello; echo there) <(echo hello; echo world)
2c2
< there
---
> world
It acts as though I had two files, which each have "hello" as their first line; but one has "there" as the second line, while the other has "world". And diff shows the difference. I don't think there's any way of doing anything like that with backquotes; you'd need to use temp files.

Of course, I wanted to read more about it -- how have I gone all these years without knowing about this? -- and it looks like I'm not the only one who didn't know about it. In fact, none of the pages I found on shell pipeline tricks even mentioned it.

It turns out it's called "process substitution" and I found it documented in Chapter 23 of the Advanced Bash-Scripting Guide.

I tweeted it, and a friend who is a zsh master gave me some similar cool tricks. For instance, in zsh echo hi > >(cat) > >(cat -n) lets you pipe the output of a command to more than one other command.

That's zsh, but in bash (or zsh too, of course), you can use >() and tee to do the same thing: echo hi | tee >(cat) | cat -n

If you want a temp file to be created automatically, one you can both read and write, you can use =(foo) (zsh only?)

Great stuff! Some other pages that discuss some of these tricks:

Tags: , , ,
[ 18:23 Aug 24, 2013    More linux/cmdline | permalink to this entry | comments ]

Wed, 24 Jul 2013

Yet more on that comma-inserting regexp, plus a pattern to filter unprintable characters

One more brief followup on that comma inserting sed pattern and its followup:

$ echo 20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'
20,130,607,215,015

In the second article, I'd mentioned that the hardest part of the exercise was figuring out where we needed backslashes. Devdas (f3ew) asked on Twitter whether I would still need all the backslash escapes even if I put the pattern in a file -- in other worse, are the backslashes merely to get the shell to pass special characters unchanged?

A good question, and I suspected the need for some of the backslashes would disappear. So I tried this:

$ echo ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' >/tmp/commas   
$ echo 20130607215015 | sed -f /tmp/commas

And it didn't work. No commas were inserted.

The problem, it turns out, is that my shell, zsh, changed both instances of \b to an ASCII backspace, ^H. Editing the file fixes that, and so does

$ echo -E ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' >/tmp/commas   

But that only applies to echo: zsh doesn't do the \b -> ^H substitution in the original command, where you pass the string directly as a sed argument.

Okay, with that straightened out, what about Devdas' question?

Surprisingly, it turns out that all the backslashes are still needed. None of them go away when you echo > file, so they weren't there just to get special characters past the shell; and if you edit the file and try removing some of the backslashes, you'll see that the pattern no longer works. I had thought at least some of them, like the ones before the \{ \}, were extraneous, but even those are still needed.

Filtering unprintable characters

As long as I'm writing about regular expressions, I learned a nice little tidbit last week. I'm getting an increasing flood of Asian-language spams which my mail ISP doesn't filter out (they use spamassassin, which is pretty useless for this sort of filtering). I wanted a simple pattern I could pass to egrep (via procmail) that would filter out anything with a run of more than 4 unprintable characters in a row. [^[:print:]]{4,} should do it, but it wasn't working.

The problem, it turns out, is the definition of what's printable. Apparently when the default system character set is UTF-8, just about everything is considered printable! So the trick is that you need to set LC_ALL to something more restrictive, like C (which basically means ASCII) to before :print: becomes useful for language-based filtering. (Thanks to Mikachu for spotting the problem).

So in a terminal, you can do something like

LC_ALL=C egrep -v '[^[:print:]]' filename

In procmail it was a little harder; I couldn't figure out any way to change LC_ALL from a procmail recipe; the only solution I came up with was to add this to ~/.procmailrc:

export LC_ALL=C

It does work, though, and has cut the spam load by quite a bit.

Tags: , , , ,
[ 18:35 Jul 24, 2013    More linux/cmdline | permalink to this entry | comments ]

Tue, 09 Jul 2013

Sed: insert commas into numbers, but in a smarter way

A few days ago I wrote about a nifty sed script to insert commas into numbers that I dissected with the help of Dana Jansens.

Once we'd figured it out, though, Dana thought this wasn't really the best solution. For instance, what if you have a file that has some numbers in it, but also has some digits mixed up with letters? Do you really want to insert commas into every string of digits? What if you have some license plates, like abc1234? Maybe it would be better to restrict the change to digits that stand by themselves and are obviously meant to be numbers. How much harder would that be?

More regexp fun! We kicked it around a bit, and came up with a solution:

$ echo abc20130607215015 | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta'
abc20,130,607,215,015
$ echo abc20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'
abc20130607215015
$ echo 20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'   
20,130,607,215,015

Breaking that down: \b is any word boundary -- you could also use \< to indicate that it's the start of a word, much like \> was the end of a word.

\([0-9]\+\) is any string of one or more digits, taken as a group. The \( \) part marks it as a group so we'll be able to use it later.

\([0-9]\{3\}\) is a string of exactly three digits: again, we're using \( \) to mark it as our second numbered group.

\b is another word boundary (we could use \>), to indicate that the group of three digits must come at the end of a word, with only whitespace or punctuation following it.

/\1,\2/: once we've matched the pattern -- a word break, one or more digits, three digits and another word break -- we'll replace it with this. \1 matches the first group we found -- that was the string of one or more digits. \2 matches the second group, the final trio of digits. And there's a comma in between. We use the same :a; ;ta trick as in the first example to loop around until there are no more triplets to match.

The hardest part of this was figuring out what needed to be escaped with backslashes. The one that really surprised me was the \+. Although * works in sed the same way it does in other programs, matching zero or more repetitions of the preceding pattern, sed uses \+ rather than + for one or more repetitions. It took us some fiddling to find all the places we needed backslashes.

Tags: , ,
[ 20:16 Jul 09, 2013    More linux/cmdline | permalink to this entry | comments ]

Sun, 07 Jul 2013

Inserting commas into numbers with sed

Carla Schroder's recent article, More Great Linux Awk, Sed, and Bash Tips and Tricks , had a nifty sed command I hadn't seen before to take a long number and insert commas appropriately:

sed -i ':a;s/\B[0-9]\{3\}\gt;/,&/;ta' numbers.txt
. Or, if you don't have a numbers.txt file, you can do something like
echo 20130607215015 | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta'
(I dropped the -i since that's for doing in-place edits of a file).

Nice! But why does it work? It would be easy enough to insert commas after every third number, but that doesn't work unless the number of digits is a multiple of three. In other words, you don't want 20130607215015 to become 201,306,072,150,15 (note how the last group only has two digits); it has to count in threes from the right if you want to end up with 20,130,607,215,015.

Carla's article didn't explain it, and neither did any of the other sites I found that mentioned this trick.

So, with some help from regexp wizard Dana Jansens (of OpenBox fame), I've broken it down into more easily understood bits.

Labels and loops

The first thing to understand is that this is actually several sed commands. I was familiar with sed's basic substitute command, s/from/to/. But what's the rest of it? The semicolons separate the commands, so the whole sed script is:

:a
s/\B[0-9]\{3\}\>/,&/
ta

What this does is set up a label called a. It tries to do the substitute command, and if the substitute succeeds (if something was changed), then ta tells it to loop back around to label a, the beginning of the script.

So let's look at that substitute command.

The substitute

Sed's s/from/to/ (like the equivalent command in vim and many other programs) looks for the first instance of the from pattern and replaces it with the to pattern. So we're searching for \B[0-9]\{3\}\> and replacing it with ,&/

Clear as mud, right? Well, the to pattern is easy: & matches whatever we just substituted (from), so this just sticks a comma in front of ... something.

The from pattern, \B[0-9]\{3\}\>, is a bit more challenging. Let's break down the various groups:

\B
Matches anything that is not a word boundary.
[0-9]
Matches any digit.
\{3\}
Matches three repetitions of whatever precedes it (in this case, a digit).
\>
Matches a word boundary at the end of a word. This was the hardest part to figure out, because no sed documentation anywhere bothers to mention this pattern. But Dana knew it as a vim pattern, and it turns out it does the same thing in sed even though the docs don't say so.

Okay, put them together, and the whole pattern matches any three digits that are not preceded by a word boundary but which are at the end of a word (i.e. they're followed by a word boundary).

Cool! So in our test number, 20130607215015, this matches the last three digits, 015. It doesn't match any of the other digits because they're not followed by a word end boundary.

So the substitute will insert a comma before the last three numbers. Let's test that:

$ echo 20130607215015 | sed 's/\B[0-9]\{3\}\>/,&/'
20130607215,015

Sure enough!

How the loop works

So the substitution pattern just adds the last comma. Once the comma is inserted, the ta tells sed to go back to the beginning (label :a) and do it again.

The second time, the comma that was just inserted is now a word boundary, so the pattern matches the three digits before the comma, 215, and inserts another comma before them. Let's make sure:

$ echo 20130607215,015 | sed 's/\B[0-9]\{3\}\>/,&/'
20130607,215,015

So that's how the pattern manages to match triplets from right to left.

Dana later commented that this wasn't really the best solution -- what if the string of digits is attached to other characters and isn't really a number? I'll cover that in a separate article in a few days. Update: Here's the smarter pattern, Sed: insert commas into numbers, but in a smarter way.

Tags: , ,
[ 13:14 Jul 07, 2013    More linux/cmdline | permalink to this entry | comments ]

Sat, 15 Jun 2013

Autocompleting xchat channel log filenames in zsh

Sometimes zsh is a little too smart for its own good.

Something I do surprisingly often is to complete the filenames for my local channel logs in xchat. Xchat gives its logs crazy filenames like /home/akkana/.xchat2/xchatlogs/FreeNode-#ubuntu-us-ca.log. They're hard to autocomplete -- I have to type something like: ~/.xc<tab>xc<tab>l<tab>Fr<tab>\#ub<tab>us<tab> Even with autocompletion, that's a lot of typing!

Bug zsh makes it even worse: I have to put that backslash in front of the hash, \#, or else zsh will see it either as a comment (unless I unsetopt interactivecomments, in which case I can't paste functions from my zshrc when I'm testing them); or as an extended regular expression (unless I unsetopt extendedglob). I don't want to unset either of those options: I use both of them.

Tonight I was fiddling with something else related to extendedglob, and was moved to figure out another solution to the xchat completion problem. Why not get zsh's smart zle editor to insert most of that annoying, not easily autocompletable string for me?

The easy solution was to bind it to a function key. I picked F8 for testing, and figured out its escape sequence by typing echo , then Ctrl-V, then hitting F8. It turns out to insert <ESC>[20~. So I made a binding:

bindkey -s '\e[20~' '~/.xchat2/xchatlogs/ \\\#^B^B^B'

When I press F8, that inserts the following string:

~/.xchat2/xchatlogs/ \#
                    ↑ (cursor ends up here)
... moving the cursor back three characters, so it's right before the space. The space is there so I can autocomplete the server name by typing something like Fr<TAB> for FreeNode. Then I delete the space (Ctrl-D), go to the end of the line (Ctrl-E), and start typing my channel name, like ubu<TAB>us<TAB>. I don't have to worry about typing the rest of the path, or the escaped hash sign.

That's pretty cool. But I wished I could bind it to a character sequence, like maybe .xc, rather than using a function key. (I could use my Crikey program to do that at the X level, but that's cheating; I wanted to do it within zsh.) You can't just use bindkey -s '.xch' '~/.xchat2/xchatlogs/ \\\#^B^B^B' because it's recursive: as soon as zsh inserts the ~/.xc part, that expands too, and you end up with ~/~/.xchat2/xchatlogs/hat2/xchatlogs/ \# \#.

The solution, though it's a lot more lines, is to use the special variables LBUFFER and RBUFFER. LBUFFER is everything left of the cursor position, and RBUFFER everything right of it. So I define a function to set those, then set a zle "widget" to that function, then finally bindkey to that widget:

function autoxchat()
{
    LBUFFER+="~/.xchat2/xchatlogs/"
    RBUFFER=" \\#$RBUFFER"
}
zle -N autoxchat
bindkey ".xc" autoxchat

Pretty cool! The only down side: now that I've gone this far in zle bindings, I'm probably an addict and will waste a lot more time tweaking them.

Tags: , ,
[ 20:31 Jun 15, 2013    More linux/cmdline | permalink to this entry | comments ]

Syndicated on:
LinuxChix Live
Ubuntu Women
Women in Free Software
Graphics Planet
DevChix
Ubuntu California
Planet Openbox
Devchix
Planet LCA2009

Friends' Blogs:
Morris "Mojo" Jones
Jane Houston Jones
Dan Heller
Long Live the Village Green
Ups & Downs
DailyBBG

Other Blogs of Interest:
DevChix
Scott Adams
Dave Barry
BoingBoing

Powered by PyBlosxom.