Blog Post Length: Fun with Shell Pipelines, GNUplot and Matplotlib
Sometimes I tend to ramble on, and wonder if articles I'm writing are really too long for a blog post. I try to keep them under about 200 lines, but sometimes a really meaty topic demands more. It occurred to me to wonder how long a typical Shallow Thoughts post is.
A quick measure is lines, which I can measure this way starting in the directory where I have the source files for all my past posts:
find . -name '*.blx' -exec wc -l '{}' \; | sort -h >/tmp/bloglen.dat
The find produces lines like:
79 ./linux/cmdline/random-command.blxso if I sort -h (human-readable numbers), it will sort on the first column and give me a sorted list of all posts in order of size. The shortest posts, three of them, were only five lines; the longest was 346 lines.
But what's the distribution of lengths?
I can plot the sorted data easily with gnuplot:
gnuplot -p -e 'plot "/tmp/bloglen.dat"'or, if I didn't want the temp file, I could have done that all with one command:
find . -name '*.blx' -exec wc -l '{}' \; | sort -h | gnuplot -p -e 'plot "/dev/stdin"'
That's kind of interesting. But I was really more interested in seeing a frequency distribution: do I have a lot more shorter posts, or longer ones? For that I do need the temp file.
I wasted some time trying to find a way in gnuplot to plot frequency distribution. The best I found was
set style fill solid plot '/tmp/bloglen' u ($1):(1) t 'data' smooth frequency w boxes pause mouse close(put that in a file and then run gnuplot on that file).
But it's not actually right: the bargraph shows 1 for lots of blog
post lengths that aren't represented in the data.
I finally gave up on gnuplot, having wasted enough time that I could easily have written a Python script, and did so, which only took a few minutes.
import matplotlib.pyplot as plt posts = [] with open('/tmp/bloglen') as fp: for line in fp: posts.append(int(line.split()[0])) plt.hist(posts, bins=max(posts)) plt.show()
Turns out I'm doing pretty well at keeping them under 200 lines. The vast majority of posts are fairly short, with a peak around 50 lines, and relatively few exceed 200. Only a couple of outliers get over 300.
I think I'm okay with that. Whether you, the readers, agree -- well, feel free to tell me!
For comparison, this post is 95 lines.
[ 21:28 Nov 15, 2019 More blogging | permalink to this entry | ]