Shallow Thoughts : tags : regexp
Akkana's Musings on Open Source Computing and Technology, Science, and Nature.
Fri, 04 Dec 2015
I wrote recently about a zsh shell function to
run
make distclean on a source tree even if something in autoconf
is messed up. In order to save any arguments you've previously
passed to configure or autogen.sh, my function parsed the arguments
from a file called config.log.
But it might be a bit more reliable to use config.status --
I'm guessing this is the file that make
uses when it finds it needs to re-run autogen.sh.
However, the syntax in that file is more complicated,
and parsing it taught me some useful zsh tricks.
I can see the relevant line from config.status like this:
$ grep '^ac_cs_config' config.status
ac_cs_config="'--prefix=/usr/local/gimp-git' '--enable-foo' '--disable-bar'"
--enable-foo --disable-bar are options I added
purely for testing. I wanted to make sure my shell function would
work with multiple arguments.
Ultimately, I want my shell function to call
autogen.sh --prefix=/usr/local/gimp-git --enable-foo --disable-bar
The goal is to end up with $args being a zsh array containing those
three arguments. So I'll need to edit out those quotes and split the
line into an array.
Sed tricks
The first thing to do is to get rid of that initial ac_cs_config=
in the line from config.status. That's easy with sed:
$ grep '^ac_cs_config' config.status | sed -e 's/ac_cs_config=//'
"'--prefix=/usr/local/gimp-git' '--enable-foo' '--disable-bar'"
But since we're using sed anyway, there's no need to use grep to
get the line: we can do it all with sed.
First try:
sed -n '/^ac_cs_config/s/ac_cs_config=//p' config.status
Search for the line that starts with ac_cs_config (^ matches
the beginning of a line);
then replace ac_cs_config= with nothing, and p
print the resulting line.
-n tells sed not to print anything except when told to with a p.
But it turns out that if you give a sed substitution a blank pattern,
it uses the last pattern it was given. So a more compact version,
using the search pattern ^ac_cs_config, is:
sed -n '/^ac_cs_config=/s///p' config.status
But there's also another way of doing it:
sed '/^ac_cs_config=/!d;s///' config.status
! after a search pattern matches every line that doesn't match
the pattern. d deletes those lines. Then for lines that weren't
deleted (the one line that does match), do the substitution.
Since there's no -n, sed will print all lines that weren't deleted.
I find that version more difficult to read. But I'm including it
because it's useful to know how to chain several commands in sed,
and how to use ! to search for lines that don't match a pattern.
You can also use sed to eliminate the double quotes:
sed '/^ac_cs_config=/!d;s///;s/"//g' config.status
'--prefix=/usr/local/gimp-git' '--enable-foo' '--disable-bar'
But it turns out that zsh has a better way of doing that.
Zsh parameter substitution
I'm still relatively new to zsh, but I got some great advice on #zsh.
The first suggestion:
sed -n '/^ac_cs_config=/s///p' config.status | IFS= read -r; args=( ${(Q)${(z)${(Q)REPLY}}} ); print -rl - $args
I'll be using final print -rl - $args
for all these examples:
it prints an array variable with one member per line.
For the actual distclean function, of course, I'll be passing
the variable to autogen.sh, not printing it out.
First, let's look at the heart of that expression: the
args=( ${(Q)${(z)${(Q)REPLY}}}
.
The heart of this is the expression ${(Q)${(z)${(Q)x}}}
The zsh parameter substitution syntax is a bit arcane, but each of
the parenthesized letters does some operation on the variable that follows.
The first (Q)
strips off a level of quoting.
So:
$ x='"Hello world"'; print $x; print ${(Q)x}
"Hello world"
Hello world
(z)
splits an expression and stores it in an array.
But to see that, we have to use print -l
, so array members
will be printed on separate lines.
$ x="a b c"; print -l $x; print "....."; print -l ${(z)x}
a b c
.....
a
b
c
Zsh is smart about quotes, so if you have quoted expressions it will
group them correctly when assigning array members:
$
x="'a a' 'b b' 'c c'"; print -l $x; print "....."; print -l ${(z)x}
'a a' 'b b' 'c c'
.....
'a a'
'b b'
'c c'
So let's break down the larger expression: this is best read
from right to left, inner expressions to outer.
${(Q) ${(z) ${(Q) x }}}
| | | \
| | | The original expression,
| | | "'--prefix=/usr/local/gimp-git' '--enable-foo' '--disable-bar'"
| | \
| | Strip off the double quotes:
| | '--prefix=/usr/local/gimp-git' '--enable-foo' '--disable-bar'
| \
| Split into an array of three items
\
Strip the single quotes from each array member,
( --prefix=/usr/local/gimp-git --enable-foo --disable-bar )
Neat!
For more on zsh parameter substitutions, see the
Zsh
Guide, Chapter 5: Substitutions.
Passing the sed results to the parameter substitution
There's still a little left to wonder about in our expression,
sed -n '/^ac_cs_config=/s///p' config.status | IFS= read -r; args=( ${(Q)${(z)${(Q)REPLY}}} ); print -rl - $args
The IFS= read -r
seems to be a common idiom in zsh scripting.
It takes standard input and assigns it to the variable $REPLY. IFS is
the input field separator: you can split variables into words by
spaces, newlines, semicolons or any other character you
want. IFS= sets it to nothing. But because the input expression --
"'--prefix=/usr/local/gimp-git' '--enable-foo' '--disable-bar'" --
has quotes around it, IFS is ignored anyway.
So you can do the same thing with this simpler expression, to
assign the quoted expression to the variable $x.
I'll declare it a local variable: that makes no difference
when testing it in the shell, but if I call it in a function, I won't
have variables like $x and $args cluttering up my shell afterward.
local x=$(sed -n '/^ac_cs_config=/s///p' config.status); local args=( ${(Q)${(z)${(Q)x}}} ); print -rl - $args
That works in the version of zsh I'm running here, 5.1.1. But I've
been warned that it's safer to quote the result of $(). Without
quotes, if you ever run the function in an older zsh, $x might end up
being set only to the first word of the expression. Second, it's a
good idea to put "local" in front of the variable; that way, $x won't
end up being set once you've returned from the function. So now we have:
local x="$(sed -n '/^ac_cs_config=/s///p' config.status)"; local args=( ${(Q)${(z)${(Q)x}}} ); print -rl - $args
You don't even need to use a local variable. For added brevity (making
the function even more difficult to read! -- but we're way past the
point of easy readability), you could say:
args=( ${(Q)${(z)${(Q)"$(sed -n '/^ac_cs_config=/s///p' config.status)"}}} ); print -rl - $args
or even
print -rl - ${(Q)${(z)${(Q)"$(sed -n '/^ac_cs_config=/s///p' config.status)"}}}
... but that final version, since it doesn't assign to a variable at all,
isn't useful for the function I'm writing.
Tags: zsh, shell, regexp, gimp, programming
[
13:25 Dec 04, 2015
More linux/cmdline |
permalink to this entry |
]
Tue, 20 Aug 2013
A few days ago I wrote about how I use
making
waypoint files for a list of house addresses is OsmAnd.
For waypoint files, you need latitude/longitude coordinates, and I was
getting those from a web page that used the online Google Maps API to
convert
an address into latitude and longitude coordinates.
It was pretty cool at first, but pasting every address into the
latitude/longitude web page and then pasting the resulting coordinates
into the address file, got old, fast.
That's exactly the sort of repetitive task that computers are supposed
to handle for us.
The lat/lon page used Javascript and the Google Maps API.
and I already had a Google Maps API key (they have all sorts of fun
APIs for map geeks) ... but I really wanted something
that could run locally, reading and converting a local file.
And then I discovered the
Python googlemaps
package. Exactly what I needed! It's in the Python Package Index,
so I installed it with pip install googlemaps
.
That enabled me to change my
waymaker
Python script: if the first line of a
description wasn't a latitude and longitude, instead it looked for
something that might be an address.
Addresses in my data files might be one line or might be two,
but since they're all US addresses, I know they'll end with a
two-capital-letter state abbreviation and a 5-digit zip code:
2948 W Main St. Anytown, NM 12345.
You can find that with a regular expression:
match = re.search('.*[A-Z]{2}\s+\d{5}$', line)
But first I needed to check whether the first line of the entry was already
latitude/longitude coordinates, since I'd already converted some of
my files. That uses another regular expression. Python doesn't seem
to have a built-in way to search for generic numeric expressions
(containing digits, decimal points or +/- symbols) so I made one,
since I had to use it twice if I was searching for two numbers with
whitespace between them.
numeric = '[\+\-\d\.]'
match = re.search('^(%s+)\s+(%s+)$' % (numeric, numeric),
line)
(For anyone who wants to quibble, I know the regular expression
isn't perfect.
For instance, it would match expressions like 23+48..6.1-64.5.
Not likely to be a problem in these files, so I didn't tune it further.)
If the script doesn't find coordinates
but does find something that looks like an address, it feeds the
address into Google Maps and gets the resulting coordinates.
That code looks like this:
from googlemaps import GoogleMaps
gmaps = GoogleMaps('YOUR GOOGLE MAPS API KEY HERE')
try:
lat, lon = gmaps.address_to_latlng(addr)
except googlemaps.GoogleMapsError, e:
print "Oh, no! Couldn't geocode", addr
print e
Overall, a nice simple solution made possible with python-googlemaps.
The full script is on github:
waymaker.
Tags: mapping, geolocation, geocaching, python, househunting, programming, regexp
[
12:24 Aug 20, 2013
More mapping |
permalink to this entry |
]
Wed, 24 Jul 2013
One more brief followup on that
comma
inserting sed pattern and its
followup:
$ echo 20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'
20,130,607,215,015
In the second article, I'd mentioned that the hardest part of the exercise
was figuring out where we needed backslashes.
Devdas (f3ew) asked on Twitter
whether I would still need all the backslash escapes even
if I put the pattern in a file -- in other worse, are the backslashes
merely to get the shell to pass special characters unchanged?
A good question, and I suspected the need for some of the backslashes
would disappear. So I tried this:
$ echo ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' >/tmp/commas
$ echo 20130607215015 | sed -f /tmp/commas
And it didn't work. No commas were inserted.
The problem, it turns out, is that my shell, zsh, changed both instances
of \b to an ASCII backspace, ^H. Editing the file fixes that, and so does
$ echo -E ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' >/tmp/commas
But that only applies to echo: zsh doesn't do the \b -> ^H substitution
in the original command, where you pass the string directly as a sed argument.
Okay, with that straightened out, what about Devdas' question?
Surprisingly, it turns out that all the backslashes are still needed.
None of them go away when you echo > file
, so they
weren't there just to get special characters past the shell; and if
you edit the file and try removing some of the backslashes, you'll
see that the pattern no longer works. I had thought at least some of them,
like the ones before the \{ \}, were extraneous, but even those are
still needed.
Filtering unprintable characters
As long as I'm writing about regular expressions, I learned a nice
little tidbit last week. I'm getting an increasing
flood of Asian-language spams which my mail ISP doesn't filter out (they
use spamassassin, which is pretty useless for this sort of filtering).
I wanted a simple pattern I could pass to egrep (via procmail) that
would filter out anything with a run of more than 4 unprintable characters
in a row. [^[:print:]]{4,}
should do it, but it wasn't working.
The problem, it turns out, is the definition of what's printable.
Apparently when the default system character set is UTF-8, just about
everything is considered printable! So the trick is that you need to
set LC_ALL to something more restrictive, like C (which basically means
ASCII) to before :print: becomes useful for language-based filtering.
(Thanks to Mikachu for spotting the problem).
So in a terminal, you can do something like
LC_ALL=C egrep -v '[^[:print:]]' filename
In procmail it was a little harder; I couldn't figure out any way to
change LC_ALL from a procmail recipe; the only solution I came up
with was to add this to ~/.procmailrc:
export LC_ALL=C
It does work, though, and has cut the spam load by quite a bit.
Tags: zsh, regexp, sed, cmdline, grep
[
19:35 Jul 24, 2013
More linux/cmdline |
permalink to this entry |
]
Tue, 09 Jul 2013
A few days ago I wrote about a nifty
sed
script to insert commas into numbers that I dissected with the
help of Dana Jansens.
Once we'd figured it out, though, Dana thought this wasn't really the best
solution. For instance, what if you have a file that has some numbers
in it, but also has some digits mixed up with letters? Do you really
want to insert commas into every string of digits? What if you have
some license plates, like abc1234? Maybe it would be better to
restrict the change to digits that stand by themselves and
are obviously meant to be numbers. How much harder would that be?
More regexp fun! We kicked it around a bit, and came up with a solution:
$ echo abc20130607215015 | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta'
abc20,130,607,215,015
$ echo abc20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'
abc20130607215015
$ echo 20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'
20,130,607,215,015
Breaking that down: \b
is any word boundary -- you could
also use \< to indicate that it's the start of a word, much like
\> was the end of a word.
\([0-9]\+\)
is any string of one or more digits, taken as
a group. The \( \)
part marks it as a group so we'll be
able to use it later.
\([0-9]\{3\}\)
is a string of exactly three digits: again,
we're using \( \)
to mark it as our second numbered group.
\b
is another word boundary (we could use \>),
to indicate that the group of three digits must come at the end
of a word, with only whitespace or punctuation following it.
/\1,\2/
: once we've matched the pattern -- a word break,
one or more digits, three digits and another word break -- we'll
replace it with this. \1 matches the first group we found -- that
was the string of one or more digits. \2 matches the second group,
the final trio of digits. And there's a comma in between.
We use the same :a; ;ta
trick as in the first example
to loop around until there are no more triplets to match.
The hardest part of this was figuring out what needed to be escaped
with backslashes. The one that really surprised me was the \+.
Although *
works in sed the same way it does in other
programs, matching zero or more repetitions of the preceding pattern,
sed uses \+
rather than +
for one or more
repetitions. It took us some fiddling to find all the places we needed
backslashes.
Tags: regexp, sed, cmdline
[
21:16 Jul 09, 2013
More linux/cmdline |
permalink to this entry |
]
Sun, 07 Jul 2013
Carla Schroder's recent article,
More Great Linux Awk, Sed, and Bash Tips and Tricks ,
had a nifty sed command I hadn't seen before to take a long number and
insert commas appropriately:
sed -i ':a;s/\B[0-9]\{3\}\gt;/,&/;ta' numbers.txt
.
Or, if you don't have a numbers.txt file, you can do something like
echo 20130607215015 | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta'
(I dropped the -i since that's for doing in-place edits of a file).
Nice! But why does it work?
It would be easy enough to insert commas after every third number,
but that doesn't work unless the number of digits is a multiple of three.
In other words, you don't want 20130607215015 to become
201,306,072,150,15 (note how the last group only has two digits);
it has to count in threes from the right if you want to end up
with 20,130,607,215,015.
Carla's article didn't explain it, and neither did any of the other
sites I found that mentioned this trick.
So, with some help from regexp wizard Dana Jansens (of
OpenBox fame), I've broken it down
into more easily understood bits.
Labels and loops
The first thing to understand is that this is actually several sed commands.
I was familiar with sed's basic substitute command, s/from/to/.
But what's the rest of it? The semicolons separate the commands, so
the whole sed script is:
:a
s/\B[0-9]\{3\}\>/,&/
ta
What this does is set up a label called a. It tries to do the
substitute command, and if the substitute succeeds (if something
was changed), then ta
tells it to loop back around to
label a, the beginning of the script.
So let's look at that substitute command.
The substitute
Sed's s/from/to/ (like the equivalent command in vim and many
other programs) looks for the first instance of the from pattern
and replaces it with the to pattern. So we're searching for
\B[0-9]\{3\}\>
and replacing it with
,&/
Clear as mud, right? Well, the to pattern is easy: &
matches whatever we just substituted (from), so this just
sticks a comma in front of ... something.
The from pattern, \B[0-9]\{3\}\>
, is a bit more
challenging. Let's break down the various groups:
-
\B
-
Matches anything that is not a word boundary.
-
[0-9]
-
Matches any digit.
-
\{3\}
-
Matches three repetitions of whatever precedes it (in this case, a digit).
-
\>
-
Matches a word boundary at the end of a word. This was the hardest part
to figure out, because no sed documentation anywhere bothers to mention
this pattern. But Dana knew it as a vim pattern, and it turns out it
does the same thing in sed even though the docs don't say so.
Okay, put them together, and the whole pattern matches any three digits
that are not preceded by a word boundary but which are
at the end of a word (i.e. they're followed by a word boundary).
Cool! So in our test number, 20130607215015, this matches the last
three digits, 015. It doesn't match any of the other digits because
they're not followed by a word end boundary.
So the substitute will insert a comma before the last three numbers.
Let's test that:
$ echo 20130607215015 | sed 's/\B[0-9]\{3\}\>/,&/'
20130607215,015
Sure enough!
How the loop works
So the substitution pattern just adds the last comma.
Once the comma is inserted, the ta
tells sed to go back
to the beginning (label :a) and do it again.
The second time, the comma that was just inserted is now a word
boundary, so the pattern matches the three digits before the comma,
215, and inserts another comma before them. Let's make sure:
$ echo 20130607215,015 | sed 's/\B[0-9]\{3\}\>/,&/'
20130607,215,015
So that's how the pattern manages to match triplets from right to left.
Dana later commented that this wasn't really the best solution -- what
if the string of digits is attached to other characters and isn't
really a number? I'll cover that in a separate article in a few days.
Update: Here's the smarter pattern,
Sed:
insert commas into numbers, but in a smarter way.
Tags: regexp, sed, cmdline
[
14:14 Jul 07, 2013
More linux/cmdline |
permalink to this entry |
]
Sat, 19 Jan 2013
I'm fiddling with a serial motor controller board, trying to get it
working with a Raspberry Pi. (It works nicely with an Arduino, but
one thing I'm learning is that everything hardware-related is far
easier with Arduino than with RPi.)
The excellent Arduino library helpfully
provided by Pololu
has a list of all the commands the board understands. Since it's Arduino,
they're in C++, and look something like this:
#define QIK_GET_FIRMWARE_VERSION 0x81
#define QIK_GET_ERROR_BYTE 0x82
#define QIK_GET_CONFIGURATION_PARAMETER 0x83
[ ... ]
#define QIK_CONFIG_DEVICE_ID 0
#define QIK_CONFIG_PWM_PARAMETER 1
and so on.
On the Arduino side, I'd prefer to use Python, so I need to get
them to look more like:
QIK_GET_FIRMWARE_VERSION = 0x81
QIK_GET_ERROR_BYTE = 0x82
QIK_GET_CONFIGURATION_PARAMETER = 0x83
[ ... ]
QIK_CONFIG_DEVICE_ID = 0
QIK_CONFIG_PWM_PARAMETER = 1
... and so on ... with an indent at the beginning of each line since
I want this to be part of a class.
There are 32 #defines, so of course, I didn't want to make all those
changes by hand. So I used vim. It took a little fiddling -- mostly
because I'd forgotten that vim doesn't offer + to mean "one or more
repetitions", so I had to use * instead. Here's the expression
I ended up with:
.,$s/\#define *\([A-Z0-9_]*\) *\(.*\)/ \1 = \2/
In English, you can read this as:
From the current line to the end of the file (,.$/
),
look for a pattern
consisting of only capital letters, digits and underscores
([A-Z0-9_]
). Save that as expression #1 (\( \)
).
Skip over any spaces, then take the rest of the line (.*
),
and call it expression #2 (\( \)
).
Then replace all that with a new line consisting of 4 spaces,
expression 1, a spaced-out equals sign, and expression 2
( \1 = \2
).
Who knew that all you needed was a one-line regular expression to
translate C into Python?
(Okay, so maybe it's not quite that simple.
Too bad a regexp won't handle the logic inside the library as well,
and the pin assignments.)
Tags: regexp, linux, programming, python, editors, vim
[
21:38 Jan 19, 2013
More linux/editors |
permalink to this entry |
]
Sun, 18 Dec 2011
A friend had a fun problem: she had some XML files she needed to
import into GNUcash, but the program that produced them left names
in all-caps and she wanted them more readable. So she'd have a file
like this:
<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20111125000000[-5:EST]
<TRNAMT>-22.71
<FITID>****
<NAME>SOME COMPANY
<MEMO>SOME COMPANY ANY TOWN CA 11-25-11 330346
</STMTTRN>
and wanted to change the NAME and MEMO lines to read
Some Company and Any Town. However, the tags, like <NAME>,
all had to remain upper case, and presumably so did strings like DEBIT.
How do you change just the NAME and MEMO lines from upper case to title case?
The obvious candidate to do string substitutes is sed.
But there are several components to the problem.
Addresses
First, how do you ensure the replacement only happens on lines with
NAME and MEMO?
sed lets you specify address ranges for just that purpose.
If you say sed 's/xxx/yyy/'
sed will change all xxx's
to yyy; but if you say sed '/NAME/s/xxx/yyy/'
then sed will only do that substitution on lines containing NAME.
But we need this to happen on lines that contain either NAME or MEMO.
How do you do that? With \|
, like this:
sed '/\(NAME\|MEMO\)/s/xxx/yyy/'
Converting to title case
Next, how do you convert upper case to lower case?
There's a
sed
command for that: \L. Run
sed 's/.*/\L&/'
and type some upper and lower case
characters, and they'll all be converted to lower-case.
But here we want title case -- we want most of each word converted
to lowercase, but the first letter should stay uppercase.
That means we need to detect a word and figure out which is the
first letter.
In the strings we're considering, a word is a set of letters A through Z
with one of the following characteristics:
- It's preceded by a space
- It's preceded by a close-angle-bracket, >
So the pattern /[ >][A-Z]*/ will match anything we consider a word
that might need conversion.
But we need to separate the first letter and the rest of the word,
so we can treat them separately. sed's \( \) operators will let us do that.
The pattern \([ >][A-Z]\) finds the first letter of a word (including
the space or > preceding it), and saves that as its first matched
pattern, \1.
Then \([A-Z]*\) right after it will save the rest of the word as \2.
So, taking our \L case converter, we can convert to title case like this:
sed 's/\([ >][A-Z]\)\([A-Z]*\)/\1\L\2/g
Starting to look long and scary, right? But it's not so bad if you build
it up gradually from components. I added a g on the end to tell sed
this is a global replace: do the operation on every word it finds in
the line, otherwise it will only make the substitution once, on the
first word it sees, then quit.
Putting it together
So we know how to seek out specific lines, and how to convert to
title case. Put the two together, and you get the final command:
sed '/\(NAME\|MEMO\)/s/\([ >][A-Z]\)\([A-Z]*\)/\1\L\2/g'
I ran it on the test input, and it worked just fine.
For more information on sed, a good place to start is the
sed
regular expressions manual.
Tags: regexp, cmdline, sed
[
14:13 Dec 18, 2011
More linux/cmdline |
permalink to this entry |
]
Tue, 15 Mar 2011
It's another episode of "How to use Linux to figure out CarTalk puzzlers"!
This time you don't even need any programming.
Last week's puzzler was
A
Seven-Letter Vacation Curiosity. Basically, one couple hiking
in Northern California and another couple carousing in Florida
both see something described by a seven-letter word containing
all five vowels -- but the two things they saw were very different.
What's the word?
That's an easy one to solve using basic Linux command-line skills --
assuming the word is in the standard dictionary. If it's some esoteric
word, all bets are off. But let's try it and see. It's a good beginning
exercise in regular expressions and how to use the command line.
There's a handy word list in /usr/share/dict/words, one word per line.
Depending on what packages you have installed, you may have bigger
dictionaries handy, but you can usually count on /usr/share/dict/words
being there on any Linux system. Some older Unix systems may have it in
/usr/dict/words instead.
We need a way to choose all seven letter words.
That's easy. In a regular expression, . (a dot) matches one letter.
So ....... (seven dots) matches any seven letters.
(There's a more direct way to do that: the expression .\{7\}
will also match 7 letters, and is really a better way. But personally,
I find it harder both to remember and to type than the seven dots.
Still, if you ever need to match 43 characters, or 114, it's good to know the
"right" syntax.)
Fine, but if you grep ....... /usr/share/dict/words
you get a list of words with seven or more letters. See why?
It's because grep prints any line where it finds a match -- and a
word with nine letters certainly contains seven letters within it.
The pattern you need to search for is '^.......$' -- the up-caret ^
matches the beginning of a line, and the dollar sign $ matches the end.
Put single quotes around the pattern so the shell won't try to interpret
the caret or dollar sign as special characters. (When in doubt, it's
always safest to put single quotes around grep patterns.)
So now we can view all seven-letter words:
grep '^.......$' /usr/share/dict/words
How do we choose only the ones that contain all the letters a e i o and u?
That's easy enough to build up using pipelines, using the pipe
character | to pipe the output of one grep into a different grep.
grep '^.......$' /usr/share/dict/words | grep a
sends that list of 7-letter words through another grep command to
make sure you only see words containing an a.
Now tack a grep for each of the other letters on the end, the same way:
grep '^.......$' /usr/share/dict/words | grep a | grep e | grep i | grep o | grep u
Voilà! I won't spoil the puzzler, but there are two words that
match, and one of them is obviously the answer.
The power of the Unix command line to the rescue!
Tags: cmdline, regexp, linux, shell, pipelines, puzzles
[
11:00 Mar 15, 2011
More linux/cmdline |
permalink to this entry |
]
Mon, 29 Mar 2010
I maintain the websites for several clubs. No surprise there -- anyone
with even a moderate knowledge of HTML, or just a lack of fear of
it, invariably gets drafted into that job in any non-computer club.
In one club, the person in charge of scheduling sends out an elaborate
document every three months in various formats -- Word, RTF, Excel, it's
different each time. The only regularity is that it's always full of
crap that makes it hard to turn it into a nice simple HTML table.
This quarter, the formats were Word and RTF. I used unrtf to turn
the RTF version into HTML -- and found a horrifying page full of
lines like this:
<body><font size=3></font><font size=3><br>
</font><font size=3></font><b><font size=4></font></b><b><font size=4><table border=2>
</font></b><b><font size=4><tr><td><b><font size=4><font face="Arial">Club Schedule</font></font></b><b><font size=4></font></b><b><font size=4></font></b></td>
<font size=3></font><font size=3><td><font size=3><b><font face="Arial">April 13</font></b></font><font size=3></font><font size=3><br>
</font><font size=3></font><font size=3><b></b></font></td>
I've put the actual page content in bold; the rest is just junk,
mostly doing nothing, mostly not even legal HTML,
that needs to be eliminated if I want
the page to load and display reasonably.
I didn't want to clean up that mess by hand! So I needed some regular
expressions to clean it up in an editor.
I tried emacs first, but emacs makes it hard to try an expression then
modify it a little when the first try doesn't work, so I switched to vim.
The key to this sort of cleanup is non-greedy regular expressions.
When you have a bad tag sandwiched in the middle of a line containing
other tags, you want to remove everything from the <font
through the next > -- but no farther, or else you'll delete
real content. If you have a line like
<td><font size=3>Hello<font> world</td>
you only want to delete through the <font>, not through the </td>.
In general, you make a regular expression non-greedy by adding a ?
after the wildcard -- e.g. <font.*?>. But that doesn't work
in vim. In vim, you have to use \{M,N} which matches
from M to N repetitions of whatever immediately precedes it.
You can also use the shortcut \{-} to mean the same thing
as *? (0 or more matches) in other programs.
Using that, I built up a series of regexp substitutes to clean up
that unrtf mess in vim:
:%s/<\/\{0,1}font.\{-}>//g
:%s/<b><\/b>//g
:%s/<\/b><b>//g
:%s/<\/i><i>//g
:%s/<td><\/td>/<td><br><\/td>/g
:%s/<\/\{0,1}span.\{-}>//g
:%s/<\/\{0,1}center>//g
That took care of 90% of the work, leaving me with hardly any cleanup
I needed to do by hand. I'll definitely keep that list around for
the next time I need to do this.
Tags: regexp, html, editors, vim
[
23:02 Mar 29, 2010
More linux/editors |
permalink to this entry |
]
Sun, 12 Oct 2008
Someone on LinuxChix' techtalk list asked whether she could get
tcsh to print "[no output]" after any command that doesn't produce
output, so that when she makes logs to help her co-workers, they
will seem clearer.
I don't know of a way to do that in any shell (the shell would have
to capture the output of every command; emacs' shell-mode does that
but I don't think any real shells do) but it seemed like it ought
to be straightforward enough to do as a regular expression substitute
in vi. You're looking for lines where a line beginning with a prompt
is followed immediately by another line beginning with a prompt;
the goal is to insert a new line consisting only of "[no output]"
between the two lines.
It turned out to be pretty easy in vim. Here it is:
:%s/\(^% .*$\n\)\(% \)/\1[no results]\r\2/
Explanation:
- :
- starts a command
- %
- do the following command on every line of this file
- s/
- start a global substitute command
- \(
- start a "capture group" -- you'll see what it does soon
- ^
- match only patterns starting at the beginning of a line
- %
- look for a % followed by a space (your prompt)
- .*
- after the prompt, match any other characters until...
- $
- the end of the line, after which...
- \n
- there should be a newline character
- \)
- end the capture group after the newline character
- \(
- start a second capture group
- %
- look for another prompt. In other words, this whole
- expression will only match when a line starting with a prompt
- is followed immediately by another line starting with a prompt.
- \)
- end the second capture group
- /
- We're finally done with the mattern to match!
- Now we'll start the replacement pattern.
- \1
- Insert the full content of the first capture group
- (this is also called a "backreference" if you want
- to google for a more detailed explanation).
- So insert the whole first command up to the newline
- after it.
- [no results]
- After the newline, insert your desired string.
- \r
- insert a carriage return here (I thought this should be
- \n for a newline, but that made vim insert a null instead)
- \2
- insert the second capture group (that's just the second prompt)
- /
- end of the substitute pattern
Of course, if you have a different prompt, substitute it for "% ".
If you have a complicated prompt that includes time of day or
something, you'll have to use a slightly more complicated match
pattern to match it.
Tags: regexp, shell, CLI, linux, editors
[
14:34 Oct 12, 2008
More linux/editors |
permalink to this entry |
]
Sun, 31 Aug 2008
I wanted to get a list of who'd been contributing the most in a
particular open source project. Most projects of any size have a
ChangeLog file, in which check-ins have entries like this:
2008-08-26 Jane Hacker <hacker@domain.org>
* src/app/print.c: make sure the Portrait and Landscape
* buttons update according to the current setting.
I wanted to take each entry, save the name of the developer checking
in, then eventually count the number of times each name occurs (the
number of times that developer checked in) and print them in order
from most check-ins to least.
Getting the names is easy: for check-ins in the last 9 years, I just
want the lines that start with "200". (Of course, if I wanted earlier
check-ins I could make the match more general.)
grep "^200" ChangeLog
But now I want to trim the line so it includes only the
contributor's name. A bit of sed geekery can do that: the date is a
fixed format (four characters, a dash, two, dash, two, then two
spaces, so "^....-..-.. " matches that pattern.
But I want to remove the email address part too
(sometimes people use different email addresses
when they check in). So I want a sed pattern that will match
something at the front (to discard), something in the middle (keep that part)
and something at the end (discard).
Here's how to do that in sed:
grep "^200" ChangeLog | sed 's/^....-..-.. \(.*\)<.*$/\1/'
In English, that says: "For each line in the ChangeLog that starts
with 200, find a pattern at the beginning consisting of any four
characters, a dash, two characters, dash, two characters, dash, and
two spaces; then immediately after that, save all characters up to
a < symbol; then throw away the < and any characters that follow
until the end of the line."
That works pretty well! But it's not quite right: it includes the
two spaces after the name as part of the name. In sed, \s matches
any space character (like space or tab).
So you'd think this should work:
grep "^200" ChangeLog | sed 's/^....-..-.. \(.*\)\s+<.*$/\1/'
\s+ means it will require that at least one and maybe more space
characters immediately before the < are also discarded.
But it doesn't work. It turns out the reason is that the \(.*\)
expression is "greedier" than the \s+: so the saved name expression
grabs the first space, leaving only the second to the \s+.
The way around that is to make the name expression specify that it
can't end with a space. \S is the term for "anything that's not a
space character"; so the expression becomes
grep "^200" ChangeLog | sed 's/^....-..-.. \(.*\S\)\s\+<.*$/\1/'
(the + turned out to need a backslash before it).
We have the list of names! Add a | sort
on the end to
sort them alphabetically -- that will make sure you get all the
"Jane Hacker" lines listed together. But how to count them?
The Unix program most frequently invoked after sort
is uniq
, which gets rid of all the repeated lines.
On a hunch, I checked out the man page, man uniq
,
and found the -c option: "prefix lines by the number of occurrences".
Perfect! Then just sort them by the number, from largest to
smallest:
grep "^200" ChangeLog | sed 's/^....-..-.. \(.*\S\)\s+<.*$/\1/' | sort | uniq -c | sort -rn
And we're done!
Now, this isn't perfect since it doesn't catch "Checking in patch
contributed by susan@otherhost.com" attributions -- but those aren't in
a standard format in most projects, so they have to be handled by hand.
Disclaimer: Of course, number of check-ins is not a good measure of
how important or productive someone is. You can check in a lot of
one-line fixes, or you can write an important new module and submit
it for someone else to merge in. The point here wasn't to rank
developers, but just to get an idea who was checking into the tree
and how often.
Well, that ... and an excuse to play with nifty Linux shell pipelines.
Tags: shell, CLI, linux, pipelines, regexp
[
12:12 Aug 31, 2008
More linux |
permalink to this entry |
]
Thu, 20 Dec 2007
I had a chance to spend a day at the AGU conference last week. The
American Geophysical Union is a fabulous conference -- something like
14,000 different talks over the course of the week, on anything
related to earth or planetary sciences -- geology, solar system
astronomy, atmospheric science, geophysics, geochemistry, you name it.
I have no idea how regular attendees manage the information overload
of deciding which talks to attend. I wasn't sure how I would, either,
but I started by going
through the schedule
for the day I'd be there, picking out a (way too long) list of
potentially interesting talks, and saving them as lines in a file.
Now I had a file full of lines like:
1020 U22A MS 303 Terrestrial Impact Cratering: New Insights Into the Cratering Process From Geophysics and Geochemistry II
Fine, except that I couldn't print out something like that -- printers
stop at 80 columns. I could pass it through a program like "fold" to
wrap the long lines, but then it would be hard to scan through quickly
to find the talk titles and room numbers. What I really wanted was to
wrap it so that the above line turned into something like:
1020 U22A MS 303 Terrestrial Impact Cratering: New Insights
Into the Cratering Process From Geophysics
and Geochemistry II
But how to do that? I stared at it for a while, trying to figure out
whether there was a clever vim substitute that could handle it.
I asked on a couple of IRC channels, just in case there was some
amazing Linux smart-wrap utility I'd never heard of.
I was on the verge of concluding that the answer was no, and that I'd
have to write a python script to do the wrapping I wanted, when
Mikael emitted a burst of line noise:
%s/\(.\{72\}\)\(.*\)/\1^M^I^I^I\2/
Only it wasn't line noise. Seems Mikael just happened to have been
reading about some of the finer points of vim regular expressions
earlier that day, and he knew exactly the trick I needed -- that
.\{72\}
, which matches lines that are at least 72
characters long. And amazingly, that expression did something very
close to what I wanted.
Or at least the first step of it. It inserts the first line break,
turning my line into
1020 U22A MS 303 Terrestrial Impact Cratering: New Insights
Into the Cratering Process From Geophysics and Geochemistry II
but I still needed to wrap the second and subsequent lines.
But that was an easier problem -- just do essentially the same thing
again, but limit it to only lines starting with a tab.
After some tweaking, I arrived at exactly what I wanted:
%s/^\(.\{,65\}\) \(.*\)/\1^M^I^I^I\2/
%g/^^I^I^I.\{58\}/s/^\(.\{,55\}\) \(.*\)/\1^M^I^I^I\2/
I had to run the second line two or three times to wrap the very long
lines.
Devdas helpfully translated the second one into English:
"You have 3 tabs, followed by 58 characters, out of
which you match the first 55 and put that bit in $1, and the capture
the remaining in $2, and rewrite to $1 newline tab tab tab $2."
Here's a more detailed breakdown:
Line one:
% | Do this over the whole file
|
---|
s/ | Begin global substitute
|
---|
^ | Start at the beginning of the line
|
---|
\( | Remember the result of the next match
|
---|
.\{,65\}_ | Look for up to 65 characters with a space at the end
|
---|
\) \( | End of remembered pattern #1, skip a space, and
start remembered pattern #2
|
---|
.*\) | Pattern #2 includes everything to the end of the line
|
---|
/ | End of matched pattern; begin replacement pattern
|
---|
\1^M | Insert saved pattern #1 (the first 65 lines ending with a
space) followed by a newline
|
---|
^I^I^I\2 | On the second line, insert three tabs then
saved pattern #2
|
---|
/ | End replacement pattern
|
---|
Line two:
%g/ | Over the whole file, only operate on lines with this pattern
|
---|
^^I^I^I | Lines starting with three tabs
|
---|
.\{58\}/ | After the tabs, only match lines that still have at
least 58 characters
(this guards against wrapping already wrapped lines
when it's run repeatedly)
|
---|
s/ | Begin global substitute
|
---|
^ | Start at the beginning of the line
|
---|
\( | Remember the result of the next match
|
---|
.\{,55\} | Up to 55 characters
|
---|
\) \( | End of remembered pattern #1, skip a space, and
start remembered pattern #2
|
---|
.*\) | Pattern #2 includes everything to the end of the line
|
---|
/ | End of matched pattern; begin replacement pattern
|
---|
\1^M | The first pattern (up to 55 chars) is one line
|
---|
^I^I^I\2 | Three tabs then the second pattern
|
---|
/ | End replacement pattern
|
---|
Greedy and non-greedy brace matches
The real key is those curly-brace expressions, \{,65\}
and \{58\}
-- that's how you control how many characters
vim will match and whether or not the match is "greedy".
Here's how they work (thanks to Mikael for explaining).
The basic expression is {M,N}
--
it means between M and N matches of whatever precedes it.
(Vim requires that the first brace be escaped -- \{}. Escaping the
second brace is optional.)
So .{M,N}
can match anything between M and N characters
but "prefer" N, i.e. try to match as many as possible up to N.
To make it "non-greedy" (match as few as possible, "preferring" M),
use .{-M,N}
You can leave out M, N, or both; M defaults to 0 while N defaults to
infinity. So {} is short for {0,∞} and is
equivalent to *, while {-} means {-0,∞}, like a non-greedy
version of *.
Given the string: one, two, three, four, five
,.\{}, | matches , two, three, four,
|
---|
,.\{-}, | matches , two,
|
---|
,.\{5,}, | matches , two, three, four,
|
---|
,.\{-5,}, | matches , two, three,
|
---|
,.\{,2}, | matches nothing
|
---|
,.\{,7}, | matches , two,
|
---|
,.\{5,7}, | matches , three,
|
---|
Of course, this syntax is purely for vim; regular expressions are
unfortunately different in sed, perl and every other program.
Here's a fun
table of
regexp terms in various programs.
Tags: linux, editors, regexp
[
12:44 Dec 20, 2007
More linux/editors |
permalink to this entry |
]
Sun, 14 May 2006
I had a page of plaintext which included some URLs in it, like this:
Tour of the Hayward Fault
http://www.mcs.csuhayward.edu/~shirschf/tour-1.html
Technical Reports on Hayward Fault
http://quake.usgs.gov/research/geology/docs/lienkaemper_docs06.htm
I wanted to add links around each of the urls, so that I could make
it part of a web page, more like this:
Tour of the Hayward Fault
http://www.mcs.csu
hayward.edu/~shirschf/tour-1.html
Technical Reports on Hayward Fault
htt
p://quake.usgs.gov/research/geology/docs/lienkaemper_docs06.htm
Surely there must be a program to do this, I thought. But I couldn't
find one that was part of a standard Linux distribution.
But you can do a fair job of linkifying just using a regular
expression in an editor like vim or emacs, or by using sed or perl from
the commandline. You just need to specify the input pattern you want
to change, then how you want to change it.
Here's a recipe for linkifying with regular expressions.
Within vim:
:%s_\(https\=\|ftp\)://\S\+_<a href="&">&</a>_
If you're new to regular expressions, it might be helpful to see a
detailed breakdown of why this works:
- :
- Tell vim you're about to type a command.
- %
- The following command should be applied everywhere in the file.
- s_
- Do a global substitute, and everything up to the next underscore
will represent the pattern to match.
- \(
- This will be a list of several alternate patterns.
- http
- If you see an "http", that counts as a match.
- s\=
- Zero or one esses after the http will match: so http and https are
okay, but httpsssss isn't.
- \|
- Here comes another alternate pattern that you might see instead
of http or https.
- ftp
- URLs starting with ftp are okay too.
- \)
- We're done with the list of alternate patterns.
- ://
- After the http, https or ftp there should always be a colon-slash-slash.
- \S
- After the ://, there must be a character which is not whitespace.
- \+
- There can be any number of these non-whitespace characters as long
as there's at least one. Keep matching until you see a space.
- _
- Finally, the underscore that says this is the end of the pattern
to match. Next (until the final underscore) will be the expression
which will replace the pattern.
- <a href="&">
- An ampersand, &, in a substitute expression means "insert
everything that was in the original pattern". So the whole url will
be inserted between the quotation marks.
- &</a>
- Now, outside the <a href="..."> tag, insert the matched url
again, and follow it with a </a> to close the tag.
- _
- The final underscore which says "this is the end of the
replacement pattern". We're done!
Linkifying from the commandline using sed
Sed is a bit trickier: it doesn't understand \S for
non-whitespace, nor = for "zero or one occurrence".
But this expression does the trick:
sed -e 's_\(http\|https\|ftp\)://[^ \t]\+_<a href="&">&</a>_' <infile.txt >outfile.html
Addendum: George
Riley tells me about
VST for Vim 7,
which looks like a nice package to linkify, htmlify, and various
other useful things such as creating HTML presentations.
I don't have Vim 7 yet, but once I do I'll definitely check out VST.
Tags: linux, editors, pipelines, regexp, shell, CLI
[
13:40 May 14, 2006
More linux/editors |
permalink to this entry |
]
Mon, 10 Oct 2005
Ever want to look for something in your browser cache, but when you
go there, it's just a mass of oddly named files and you can't figure
out how to find anything?
(Sure, for whole pages you can use the History window, but what if
you just want to find an image you saw this morning
that isn't there any more?)
Here's a handy trick.
First, change directory to your cache directory (e.g.
$HOME/.mozilla/firefox/blahblah/Cache).
Next, list the files of the type you're looking for, in the order in
which they were last modified, and save that list to a file. Like this:
% file `ls -1t` | grep JPEG | sed 's/: .*//' > /tmp/foo
In English:
ls -t lists in order of modification date, and -1 ensures
that the files will be listed one per line. Pass that through
grep for the right pattern (do a file * to see what sorts of
patterns get spit out), then pass that through sed to get rid of
everything but the filename. Save the result to a temporary file.
The temp file now contains the list of cache files of the type you
want, ordered with the most recent first. You can now search through
them to find what you want. For example, I viewed them with Pho:
pho `cat /tmp/foo`
For images, use whatever image viewer you normally use; if you're
looking for text, you can use grep or whatever search you lke.
Alternately, you could
ls -lt `cat foo` to see what was
modified when and cut down your search a bit further, or any
other additional paring you need.
Of course, you don't have to use the temp file at all. I could
have said simply:
pho `ls -1t` | grep JPEG | sed 's/: .*//'`
Making the temp file is merely for your convenience if you think you
might need to do several types of searches before you find what
you're looking for.
Tags: tech, web, mozilla, firefox, pipelines, CLI, shell, regexp
[
22:40 Oct 10, 2005
More tech/web |
permalink to this entry |
]