Last week I wrote about some tests I'd made to answer the question Does scrolling output make a program slower? My test showed that when running a program that generates lots of output, like an rsync -av, the rsync process will slow way down as it waits for all that output to scroll across whatever terminal client you're using. Hiding the terminal helps a lot if it's an xterm or a Linux console, but doesn't help much with gnome-terminal.
A couple of people asked in the comments about the actual source of the slowdown. Is the original process -- the rsync, or my test script, that's actually producing all that output -- actually blocking waiting for the terminal? Or is it just that the CPU is so busy doing all that font rendering that it has no time to devote to the original program, and that's why it's so much slower?
I found pingu on IRC (thanks to JanC) and the group had a very interesting discussion, during which I ran a series of additional tests.
In the end, I'm convinced that CPU allocation to the original process is not the issue, and that output is indeed blocked waiting for the terminal to display the output. Here's why.
First, I installed a couple of performance meters and looked at the CPU load while rendering. With conky, CPU use went up equally (about 35-40%) on both CPU cores while the test was running. But that didn't tell me anything about which processes were getting all that CPU.
htop was more useful. It showed X first among CPU users, xterm second, and my test script third. However, the test script never got more than 10% of the total CPU during the test; X and xterm took up nearly all the remaining CPU.
Even with the xterm hidden, X and xterm were the top two CPU users. But this time the script, at number 3, got around 30% of the CPU rather than 10%. That still doesn't seem like it could account for the huge difference in speed (the test ran about 7 times faster with xterm hidden); but it's interesting to know that even a hidden xterm will take up that much CPU.
It was also suggested that I try running it to /dev/null, something I definitely should have thought to try before. The test took .55 seconds with its output redirected to /dev/null, and .57 seconds redirected to a file on disk (of course, the kernel would have been buffering, so there was no disk wait involved). For comparison, the test had taken 56 seconds with xterm visible and scrolling, and 8 seconds with xterm hidden.
I also spent a lot of time experimenting with sleeping for various amounts of time between printed lines. With time.sleep(.0001) and xterm visible, the test took 104.71 seconds. With xterm shaded and the same sleep, it took 98.36 seconds, only 6 seconds faster. Redirected to /dev/null but with a .0001 sleep, it took 97.44 sec.
I think this argues for the blocking theory rather than the CPU-bound one: the argument being that the sleep gives the program a chance to wait for the output rather than blocking the whole time. If you figure it's CPU bound, I'm not sure how you'd explain the result.
But a .0001 second sleep probably isn't very accurate anyway -- we were all skeptical that Linux can manage sleep times that small. So I made another set of tests, with a .001 second sleep every 10 lines of output. The results: 65.05 with xterm visible; 63.36 with xterm hidden; 57.12 to /dev/null. That's with a total of 50 seconds of sleeping included (my test prints 500000 lines). So with all that CPU still going toward font rendering, the visible-xterm case still only took 7 seconds longer than the /dev/null case. I think this argues even more strongly that the original test, without the sleep, is blocking, not CPU bound.
But then I realized what the ultimate test should be. What happens when I run the test over an ssh connection, with xterm and X running on my local machine but the actual script running on the remote machine?
The remote machine I used for the ssh tests was a little slower than the machine I used to run the other tests, but that probably doesn't make much difference to the results.
The results? 60.29 sec printing over ssh (LAN) to a visible xterm; 7.24 sec doing the same thing with xterm hidden. Fairly similar to what I'd seen before when the test, xterm and X were all running on the same machine.
Interestingly, the ssh process during the test took 7% of my CPU, almost as much as the python script was getting before, just to transfer all the output lines so xterm could display them.
So I'm convinced now that the performance bottleneck has nothing to do with the process being CPU bound and having all its CPU sucked away by rendering the output, and that the bottleneck is in the process being blocked in writing its output while waiting for the terminal to catch up.
I'd be interested it hear further comments -- are there other interpretations of the results besides mine? I'm also happy to run further tests.
[ 17:19 Nov 13, 2013 More linux | permalink to this entry | ]