Non-greedy regular expressions to clean up crappy autogenerated HTML
I maintain the websites for several clubs. No surprise there -- anyone with even a moderate knowledge of HTML, or just a lack of fear of it, invariably gets drafted into that job in any non-computer club.
In one club, the person in charge of scheduling sends out an elaborate document every three months in various formats -- Word, RTF, Excel, it's different each time. The only regularity is that it's always full of crap that makes it hard to turn it into a nice simple HTML table.
This quarter, the formats were Word and RTF. I used unrtf to turn the RTF version into HTML -- and found a horrifying page full of lines like this:
<body><font size=3></font><font size=3><br> </font><font size=3></font><b><font size=4></font></b><b><font size=4><table border=2> </font></b><b><font size=4><tr><td><b><font size=4><font face="Arial">Club Schedule</font></font></b><b><font size=4></font></b><b><font size=4></font></b></td> <font size=3></font><font size=3><td><font size=3><b><font face="Arial">April 13</font></b></font><font size=3></font><font size=3><br> </font><font size=3></font><font size=3><b></b></font></td>I've put the actual page content in bold; the rest is just junk, mostly doing nothing, mostly not even legal HTML, that needs to be eliminated if I want the page to load and display reasonably.
I didn't want to clean up that mess by hand! So I needed some regular expressions to clean it up in an editor. I tried emacs first, but emacs makes it hard to try an expression then modify it a little when the first try doesn't work, so I switched to vim.
The key to this sort of cleanup is non-greedy regular expressions. When you have a bad tag sandwiched in the middle of a line containing other tags, you want to remove everything from the <font through the next > -- but no farther, or else you'll delete real content. If you have a line like
<td><font size=3>Hello<font> world</td>you only want to delete through the <font>, not through the </td>.
In general, you make a regular expression non-greedy by adding a ? after the wildcard -- e.g. <font.*?>. But that doesn't work in vim. In vim, you have to use \{M,N} which matches from M to N repetitions of whatever immediately precedes it. You can also use the shortcut \{-} to mean the same thing as *? (0 or more matches) in other programs.
Using that, I built up a series of regexp substitutes to clean up that unrtf mess in vim:
:%s/<\/\{0,1}font.\{-}>//g :%s/<b><\/b>//g :%s/<\/b><b>//g :%s/<\/i><i>//g :%s/<td><\/td>/<td><br><\/td>/g :%s/<\/\{0,1}span.\{-}>//g :%s/<\/\{0,1}center>//g
That took care of 90% of the work, leaving me with hardly any cleanup I needed to do by hand. I'll definitely keep that list around for the next time I need to do this.
[ 23:02 Mar 29, 2010 More linux/editors | permalink to this entry | ]