Linkifying with Regular Expressions (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sun, 14 May 2006

Linkifying with Regular Expressions

I had a page of plaintext which included some URLs in it, like this:
Tour of the Hayward Fault
http://www.mcs.csuhayward.edu/~shirschf/tour-1.html

Technical Reports on Hayward Fault
http://quake.usgs.gov/research/geology/docs/lienkaemper_docs06.htm

I wanted to add links around each of the urls, so that I could make it part of a web page, more like this:

Tour of the Hayward Fault
http://www.mcs.csu hayward.edu/~shirschf/tour-1.html

Technical Reports on Hayward Fault
htt p://quake.usgs.gov/research/geology/docs/lienkaemper_docs06.htm

Surely there must be a program to do this, I thought. But I couldn't find one that was part of a standard Linux distribution.

But you can do a fair job of linkifying just using a regular expression in an editor like vim or emacs, or by using sed or perl from the commandline. You just need to specify the input pattern you want to change, then how you want to change it.

Here's a recipe for linkifying with regular expressions.

Within vim:

:%s_\(https\=\|ftp\)://\S\+_<a href="&">&</a>_

If you're new to regular expressions, it might be helpful to see a detailed breakdown of why this works:

:
Tell vim you're about to type a command.
%
The following command should be applied everywhere in the file.
s_
Do a global substitute, and everything up to the next underscore will represent the pattern to match.
\(
This will be a list of several alternate patterns.
http
If you see an "http", that counts as a match.
s\=
Zero or one esses after the http will match: so http and https are okay, but httpsssss isn't.
\|
Here comes another alternate pattern that you might see instead of http or https.
ftp
URLs starting with ftp are okay too.
\)
We're done with the list of alternate patterns.
://
After the http, https or ftp there should always be a colon-slash-slash.
\S
After the ://, there must be a character which is not whitespace.
\+
There can be any number of these non-whitespace characters as long as there's at least one. Keep matching until you see a space.
_
Finally, the underscore that says this is the end of the pattern to match. Next (until the final underscore) will be the expression which will replace the pattern.
<a href="&">
An ampersand, &, in a substitute expression means "insert everything that was in the original pattern". So the whole url will be inserted between the quotation marks.
&</a>
Now, outside the <a href="..."> tag, insert the matched url again, and follow it with a </a> to close the tag.
_
The final underscore which says "this is the end of the replacement pattern". We're done!

Linkifying from the commandline using sed

Sed is a bit trickier: it doesn't understand \S for non-whitespace, nor = for "zero or one occurrence". But this expression does the trick:
sed -e 's_\(http\|https\|ftp\)://[^ \t]\+_<a href="&">&</a>_' <infile.txt >outfile.html

Addendum: George Riley tells me about VST for Vim 7, which looks like a nice package to linkify, htmlify, and various other useful things such as creating HTML presentations. I don't have Vim 7 yet, but once I do I'll definitely check out VST.

Tags: , , , , ,
[ 13:40 May 14, 2006    More linux/editors | permalink to this entry | ]

Comments via Disqus:

blog comments powered by Disqus