Non-greedy regular expressions to clean up crappy autogenerated HTML (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Mon, 29 Mar 2010

Non-greedy regular expressions to clean up crappy autogenerated HTML

I maintain the websites for several clubs. No surprise there -- anyone with even a moderate knowledge of HTML, or just a lack of fear of it, invariably gets drafted into that job in any non-computer club.

In one club, the person in charge of scheduling sends out an elaborate document every three months in various formats -- Word, RTF, Excel, it's different each time. The only regularity is that it's always full of crap that makes it hard to turn it into a nice simple HTML table.

This quarter, the formats were Word and RTF. I used unrtf to turn the RTF version into HTML -- and found a horrifying page full of lines like this:

<body><font size=3></font><font size=3><br>
</font><font size=3></font><b><font size=4></font></b><b><font size=4><table border=2>
</font></b><b><font size=4><tr><td><b><font size=4><font face="Arial">Club Schedule</font></font></b><b><font size=4></font></b><b><font size=4></font></b></td>
<font size=3></font><font size=3><td><font size=3><b><font face="Arial">April 13</font></b></font><font size=3></font><font size=3><br>
</font><font size=3></font><font size=3><b></b></font></td>
I've put the actual page content in bold; the rest is just junk, mostly doing nothing, mostly not even legal HTML, that needs to be eliminated if I want the page to load and display reasonably.

I didn't want to clean up that mess by hand! So I needed some regular expressions to clean it up in an editor. I tried emacs first, but emacs makes it hard to try an expression then modify it a little when the first try doesn't work, so I switched to vim.

The key to this sort of cleanup is non-greedy regular expressions. When you have a bad tag sandwiched in the middle of a line containing other tags, you want to remove everything from the <font through the next > -- but no farther, or else you'll delete real content. If you have a line like

<td><font size=3>Hello<font> world</td>
you only want to delete through the <font>, not through the </td>.

In general, you make a regular expression non-greedy by adding a ? after the wildcard -- e.g. <font.*?>. But that doesn't work in vim. In vim, you have to use \{M,N} which matches from M to N repetitions of whatever immediately precedes it. You can also use the shortcut \{-} to mean the same thing as *? (0 or more matches) in other programs.

Using that, I built up a series of regexp substitutes to clean up that unrtf mess in vim:

:%s/<\/\{0,1}font.\{-}>//g
:%s/<b><\/b>//g
:%s/<\/b><b>//g
:%s/<\/i><i>//g
:%s/<td><\/td>/<td><br><\/td>/g
:%s/<\/\{0,1}span.\{-}>//g
:%s/<\/\{0,1}center>//g

That took care of 90% of the work, leaving me with hardly any cleanup I needed to do by hand. I'll definitely keep that list around for the next time I need to do this.

Tags: , , ,
[ 23:02 Mar 29, 2010    More linux/editors | permalink to this entry | ]

Comments via Disqus:

blog comments powered by Disqus