Blog Post Length: Fun with Shell Pipelines, GNUplot and Matplotlib (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Fri, 15 Nov 2019

Blog Post Length: Fun with Shell Pipelines, GNUplot and Matplotlib

Sometimes I tend to ramble on, and wonder if articles I'm writing are really too long for a blog post. I try to keep them under about 200 lines, but sometimes a really meaty topic demands more. It occurred to me to wonder how long a typical Shallow Thoughts post is.

A quick measure is lines, which I can measure this way starting in the directory where I have the source files for all my past posts:

find . -name '*.blx' -exec wc -l '{}' \; | sort -h >/tmp/bloglen.dat

The find produces lines like:

79 ./linux/cmdline/random-command.blx
so if I sort -h (human-readable numbers), it will sort on the first column and give me a sorted list of all posts in order of size. The shortest posts, three of them, were only five lines; the longest was 346 lines.

But what's the distribution of lengths?

[Length of all blog posts, sorted] I can plot the sorted data easily with gnuplot:

gnuplot -p -e 'plot "/tmp/bloglen.dat"'
or, if I didn't want the temp file, I could have done that all with one command:
find . -name '*.blx' -exec wc -l '{}' \; | sort -h | gnuplot -p -e 'plot "/dev/stdin"'

That's kind of interesting. But I was really more interested in seeing a frequency distribution: do I have a lot more shorter posts, or longer ones? For that I do need the temp file.

I wasted some time trying to find a way in gnuplot to plot frequency distribution. The best I found was

set style fill solid
plot '/tmp/bloglen' u ($1):(1) t 'data' smooth frequency w boxes
pause mouse close
(put that in a file and then run gnuplot on that file).

But it's not actually right: the bargraph shows 1 for lots of blog post lengths that aren't represented in the data.

I finally gave up on gnuplot, having wasted enough time that I could easily have written a Python script, and did so, which only took a few minutes.

import matplotlib.pyplot as plt

posts = []
with open('/tmp/bloglen') as fp:
    for line in fp:
        posts.append(int(line.split()[0]))

plt.hist(posts, bins=max(posts))

plt.show()

[Length of all blog posts, frequency distribution] Turns out I'm doing pretty well at keeping them under 200 lines. The vast majority of posts are fairly short, with a peak around 50 lines, and relatively few exceed 200. Only a couple of outliers get over 300.

I think I'm okay with that. Whether you, the readers, agree -- well, feel free to tell me!

For comparison, this post is 95 lines.

Tags: , , ,
[ 21:28 Nov 15, 2019    More blogging | permalink to this entry | comments ]