Shallow Thoughts : tags : html

Akkana's Musings on Open Source Computing, Science, and Nature.

Mon, 29 Jul 2013

Mutt: Show HTML for sites that send broken HTML mail

Increasingly I'm seeing broken sites that send automated HTML mail with headers claiming it's plain text.

To understand what's happening, you have to know about something called MIME multipart/alternative.

MIME stands for Multipurpose Internet Mail Extensions: it's the way mail encodes different types of attachments, so you can attach images, music, PDF documents or whatever with your email.

If you send a normal plain text mail message, you don't need MIME. But as soon as you send anything else -- like an HTML message where you've made a word bold, changed color or inserted images -- you need it. MIME adds a Content-Type to the message saying "This is HTML mail, so you need to display it as HTML when you receive it" or "Here's a PDF attachment, so you need to display it in a PDF viewer". The headers for these two cases would look like this:

Content-Type: text/html
Content-Type: application/pdf

A lot of mail programs, for reasons that have never been particularly clear, like to send two copies of every mail message: one in plain text, one in HTML. They're two copies of the same message -- it's just that one version has fancier formatting than the other. The MIME header that announces this is

Content-Type: multipart/alternative
because the two versions, text and HTML, are alternative versions of the same message. The recipient need only read one, not both. Inside the multipart/alternative section there will be further MIME headers, one saying Content-Type: text/plain, where it puts the text of your message, and one Content-Type: text/html, where it puts HTML source code.

This mostly works fine for real mail programs (though it's a rather silly waste of bandwidth, sending double copies of everything for no particularly good reason, and personally I always configure the mailers I use to send only one copy at a time). But increasingly I'm seeing automated mail robots that send multipart/alternative mail, but do it wrong: they send HTML for both parts, or they send a valid HTML part and a blank text part.

Why don't the site owners notice the problem?

You wouldn't ever notice a problem if you use the default configuration on most mailers, to show the HTML part if at all possible. But most mail programs give you an option to show the text part if there is one. That way, you don't have to worry about those people who like to send messages in pink blinking text on a plaid background -- all you see is the text.

If your mailer is configured to show plain text, for most messages you'll see just text -- no colors, no blinking, no annoyances. But for mail sent by these misconfigured mail robots, what you'll see is HTML source code.

I've seen this in several places -- lots of spammers do it (who cares? I was going to delete the message anyway), and one of the local astronomy clubs does it so I've long since stopped trying to read their announcements. But the latest place I've seen this is one that ought to know better: Coursera. They apparently reconfigured their notification system recently, and I started getting course notifications that look like this:

/* Client-specific Styles */
#outlook a{padding:0;} /* Force Outlook to provide a "view in browser" button.
*/
body{width:100% !important;} .ReadMsgBody{width:100%;}
.ExternalClass{width:100%;} /* Force Hotmail to display emails at full width */
body{-webkit-text-size-adjust:none;} /* Prevent Webkit platforms from changing
default text sizes. */
/* Reset Styles */
body{margin:0; padding:0;}
img{border:0; height:auto; line-height:100%; outline:none;
text-decoration:none;}
table td{border-collapse:collapse;}
#backgroundTable{height:100% !important; margin:0; padding:0; width:100%
!important;}
p {margin-top: 14px; margin-bottom: 14px;}
/* /\/\/\/\/\/\/\/\/\/\ STANDARD STYLING: PREHEADER /\/\/\/\/\/\/\/\/\/\ */
.preheaderContent div a:link, .preheaderContent div a:visited, /* Yahoo! Mail
Override */ .preheaderContent div a .yshortcuts /* Yahoo! Mail Override */{
  color: #3b6e8f;
... and on and on like that. You get the idea. It's unreadable, even by a geek who knows HTML pretty well. It would be fine in the HTML part of the message -- but this is what they're sending in the text/plain part.

I filed a bug, but Coursera doesn't have a lot of staff to respond to bug reports and it might be quite some time before they fix this. Meanwhile, I don't want to miss notifications for the algorithms course I'm currently taking. So I needed a workaround.

How to work around the problem in mutt

I found one for mutt at alternative_order and folder-hook. When in my "classes" folder, I use a folder hook to tell mutt to prefer text/html format over text/plain, even though my default is text/plain. Then you also need to add a default folder hook to set the default back for every other folder -- mutt folder hooks are frustrating in that way.

The two folder hooks look like this:

folder-hook . 'set unalternative_order *; alternative_order text/plain text'

# Prefer the HTML part but only for Coursera,
# since it sends HTML in the text part.
folder-hook =in/coursera 'unalternative_order *; alternative_order text/html'

alternative_order specifies which types you'd most like to read. unalternative_order is a lot less clear; the documentation says it "removes a mime type from the alternative_order list", but doesn't say anything more than that. What's the syntax? What's the difference between using unalternative_order or just re-setting alternative_order? Why do I have to specify it with * in both places? No one seems to know.

So it's a little unsatisfying, and perhaps not the cleanest way. But it does work around the bug for sites where you really need a way to read the mail.

Update: I also found this discussion of alternative_order which gives a nice set of key bindings to toggle interactively between the various formats. It was missing some backslashes, so I had to fiddle with it slightly to get it to work. Put this in .muttrc:

macro pager ,@aoh= "\
<enter-command> unalternative_order *; \
alternative_order text/enriched text/html text/plain text;\
macro pager A ,@aot= 'toggle alternative order'<enter>\
<exit><display-message>"

macro pager ,@aot= "\
<enter-command> unalternative_order *; \
alternative_order text/enriched text/plain text/html text;\
macro pager A ,@aoh= 'toggle alternative order'<enter>\
<exit><display-message>"

macro pager A ,@aot= "toggle alternative order"

Then just type A (capital A) to toggle between formats. If it doesn't change the first time you type A, type another one and it should redisplay. I've found it quite handy.

Tags: , ,
[ 14:13 Jul 29, 2013    More tech/email | permalink to this entry | comments ]

Wed, 05 Jun 2013

Stop Emacs from invoking a browser

After upgrading my OS (in this case, to Debian sid), I noticed that my browser window kept being replaced with an HTML file I was editing in emacs. I'd hit Back or close the tab, and the next time I checked, there it was again, my HTML source.

I'm sure it's a nice feature that emacs can show me my HTML in a browser. But it's not cool to be replacing my current page without asking. How do I turn it off? A little searching revealed that this was html-autoview-mode, which apparently at some point started defaulting to ON instead of OFF. Running M-x html-autoview-mode toggles it back off for the current session -- but that's no help if I want it off every time I start emacs.

I couldn't find any documentation for this, and the obvious (html-autoview-mode nil) in .emacs didn't work -- first, it gives a syntax error because the function isn't defined until after you've loaded html-mode, but even if you put it in your html-mode hook, it still doesn't work.

I had to read the source of sgml-mode.el. (M-x describe-function html-autoview-mode also would have told me, if I had already loaded html-mode, but I didn't realize that until later.) Turns out html-autoview-mode turns off if its argument is negative, not nil. So I added it to my html derived mode:

(define-derived-mode html-wrap-mode html-mode "HTML wrap mode"
  (auto-fill-mode)
  ;; Don't call an external browser every time you save an html file:
  (html-autoview-mode -1)
)

Tags: , ,
[ 21:48 Jun 05, 2013    More linux/editors | permalink to this entry | comments ]

Tue, 03 Apr 2012

Displaying equations on the web

How do you show equations on a web page? Every now and then, I write an article that involves math, and I wrestle with that problem.

The obvious (but wrong) approach: MathML

It was nearly fifteen years ago that MathML was recommended as a standard for embedding equations inside an HTML page. I remember being excited about it back then. There were a few problems -- like the availability of fonts including symbols for integrals, summations and so forth -- but they seemed minor. That was 1998.

Now, in 2012, I found myself wanting to write an article involving an integral, so I looked into the state of MathML. I found that even now, all these years later, it wasn't widely supported.

In Firefox I could show some simple equations, like x = 0 x x and x = b ± b 2 4 a c 2 a

But when I tried them in Chromium, I learned that webkit-based browsers don't support MathML. At all. The exception is Safari: apparently Apple has added some MathML support into their browser but hasn't contributed that code back to webkit (yet?)

Besides that, MathML is ridiculously hard to use. Here's the code for that little integral:

<math xmlns="http://www.w3.org/1998/Math/MathML">
<mrow>
<semantics>
  <mrow>
    <msubsup>
      <mo>&int;</mo>
      <mn>x = 0</mn>
      <mi>&#x221E;</mi>
    </msubsup>
    <mfrac>
      <mrow>
        <mo>&dd;</mo>
        <mi>x</mi>
      </mrow>
      <mi>x</mi>
    </mfrac>
  </mrow>
</semantics>
</mrow>
</math>

Ugh! You can't even specify infinity without using an HTML numeric entity. And the code for the quadratic equation is even worse (use View Source if you want to see it).

Good ol' tables

Several years ago, I wrote about the Twelve Days of Christmas and how to calculate the total number of gifts represented in the song.

I needed summations, and I was rather proud of working out a way to use HTML tables to display all the sums and line up everything correctly. It wasn't exactly publication-quality graphics, but it was readable.

More recently, I worked out a way to do exponentials that way, and found a hint about how to do integrals:

now
P (t)  dt
P0 =————
1 + t
0

Looks a little better than the tiny MathML version. But the code isn't any easier to read:

<table border="0" cellpadding="0" cellspacing="0">
<tr><td><td align="center"><small><i>now</i></small></td><td></td><td></td></tr>
<tr>
 <td>
 <td rowspan="3" valign="middle"><font size="6" style="font-size:3em" class="bigsym">&#8747;</font>
 <td align="center"><i>P</i>&nbsp;(<i>t</i>)</td>

 <td rowspan="3" valign="middle">&nbsp;<i>dt</i></td></tr>
<tr><td>P<sub>0</sub> =<td align="center">&mdash;&mdash;&mdash;&mdash;</td></tr>
<tr><td><td align="center">1 + <i>t</i></td></tr>
<tr><td><td valign="top"><small><i>0</i></small></td><td></td><td></td></tr>
</table>

The solution: MathJax

And then I discovered MathJAX. It was added recently to the Udacity forums, and I think it's also what MITx is using for their courses.

MathJax is fantastic. It's an open-source library that lets you specify equations in readable ways -- you can use MathML, but you can also use LaTEX or even ASCII math like `x = (-b +- sqrt(b^2-4ac))/(2a) .`

It uses Javascript: you put your equations in the text of the page with delimiters like $$ around them (you can control the delimiters), then run a function that scans the page content and replaces any equations it sees with pretty graphics. (Viewers using NoScript or similar extensions will need to allow mathjax.org to see the equations, unless you make a local copy of the mathjax.org libraries, which you probably should anyway if you're using a lot of equations.)

For displaying those graphics, MathJax might use MathML, HTML and CSS, or whatever, depending on the user's browser ... but you don't have to worry about that. (Alas, even in Firefox, MathML rendering isn't up to par so MathJax doesn't use it by default, though you can specify it as an option if you know your equations render well.)

Here's that integral again, using LaTeX format: $$ P_0 =\int_0^\infty \frac {P(t) dt}{1 + t} $$ and $$ x = {-b \pm \sqrt{b^2-4ac} \over 2a} $$

It's beautiful! And although I don't know LaTex at all -- been wanting an excuse to learn it -- I put together that integral with five minutes of web searching. (The quadratic code came from a MathJax demo page.) Here's what the code looks like:

$$ P_0 =\int_0^\infty \frac {P(t) dt}{1 + t} $$

$$ x = {-b \pm \sqrt{b^2-4ac} \over 2a} $$

MathJax is even smart enough to notice the code there is in a <pre> tag, so I didn't have to find a way to escape it.

I'm sold! The MathJax team has really put together a nice package, and I think we'll be seeing it on a lot more websites. If you want to try it, start here: Getting Started with MathJAX.

Tags: , , , ,
[ 15:45 Apr 03, 2012    More science | permalink to this entry | comments ]

Thu, 12 Jan 2012

HTML and Javascript Presentations

When I give talks that need slides, I've been using my Slide Presentations in HTML and JavaScript for many years. I uploaded it in 2007 -- then left it there, without many updates.

But meanwhile, I've been giving lots of presentations, tweaking the code, tweaking the CSS to make it display better. And every now and then I get reminded that a few other people besides me are using this stuff.

For instance, around a year ago, I gave a talk where nearly all the slides were just images. Silly to have to make a separate HTML file to go with each image. Why not just have one file, img.html, that can show different images? So I wrote some code that lets you go to a URL like img.html?pix/whizzyphoto.jpg, and it will display it properly, and the Next and Previous slide links will still work.

Of course, I tweak this software mainly when I have a talk coming up. I've been working lately on my SCALE talk, coming up on January 22: Fun with Linux and Devices (be ready for some fun Arduino demos!) Sometimes when I overload on talk preparation, I procrastinate by hacking the software instead of the content of the actual talk. So I've added some nice changes just in the past few weeks.

For instance, the speaker notes that remind me of where I am in the talk and what's coming next. I didn't have any way to add notes on image slides. But I need them on those slides, too -- so I added that.

Then I decided it was silly not to have some sort of automatic reminder of what the next slide was. Why should I have to put it in the speaker notes by hand? So that went in too.

And now I've done the less fun part -- collecting it all together and documenting the new additions. So if you're using my HTML/JS slide kit -- or if you think you might be interested in something like that as an alternative to Powerpoint or Libre Office Presenter -- check out the presentation I have explaining the package, including the new features.

You can find it here: Slide Presentations in HTML and JavaScript

Tags: , , , , ,
[ 20:08 Jan 12, 2012    More speaking | permalink to this entry | comments ]

Sun, 08 Jan 2012

Parsing HTML in Python

I've been having (mis)adventures learning about Python's various options for parsing HTML.

Up until now, I've avoided doing any HTMl parsing in my RSS reader FeedMe. I use regular expressions to find the places where content starts and ends, and to screen out content like advertising, and to rewrite links. Using regexps on HTML is generally considered to be a no-no, but it didn't seem worth parsing the whole document just for those modest goals.

But I've long wanted to add support for downloading images, so you could view the downloaded pages with their embedded images if you so chose. That means not only identifying img tags and extracting their src attributes, but also rewriting the img tag afterward to point to the locally stored image. It was time to learn how to parse HTML.

Since I'm forever seeing people flamed on the #python IRC channel for using regexps on HTML, I figured real HTML parsing must be straightforward. A quick web search led me to Python's built-in HTMLParser class. It comes with a nice example for how to use it: define a class that inherits from HTMLParser, then define some functions it can call for things like handle_starttag and handle_endtag; then call self.feed(). Something like this:

from HTMLParser import HTMLParser

class MyFancyHTMLParser(HTMLParser):
  def fetch_url(self, url) :
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    link = response.geturl()
    html = response.read()
    response.close()
    self.feed(html)   # feed() starts the HTMLParser parsing

  def handle_starttag(self, tag, attrs):
    if tag == 'img' :
      # attrs is a list of tuples, (attribute, value)
      srcindex = self.has_attr('src', attrs)
      if srcindex < 0 :
        return   # img with no src tag? skip it
      src = attrs[srcindex][1]
      # Make relative URLs absolute
      src = self.make_absolute(src)
      attrs[srcindex] = (attrs[srcindex][0], src)

    print '<' + tag
    for attr in attrs :
      print ' ' + attr[0]
      if len(attr) > 1 and type(attr[1]) == 'str' :
        # make sure attr[1] doesn't have any embedded double-quotes
        val = attr[1].replace('"', '\"')
        print '="' + val + '"')
    print '>'

  def handle_endtag(self, tag):
    self.outfile.write('</' + tag.encode(self.encoding) + '>\n')

Easy, right? Of course there are a lot more details, but the basics are simple.

I coded it up and it didn't take long to get it downloading images and changing img tags to point to them. Woohoo! Whee!

The bad news about HTMLParser

Except ... after using it a few days, I was hitting some weird errors. In particular, this one:
HTMLParser.HTMLParseError: bad end tag: ''

It comes from sites that have illegal content. For instance, stories on Slate.com include Javascript lines like this one inside <script></script> tags:
document.write("<script type='text/javascript' src='whatever'></scr" + "ipt>");

This is technically illegal html -- but lots of sites do it, so protesting that it's technically illegal doesn't help if you're trying to read a real-world site.

Some discussions said setting self.CDATA_CONTENT_ELEMENTS = () would help, but it didn't.

HTMLParser's code is in Python, not C. So I took a look at where the errors are generated, thinking maybe I could override them. It was easy enough to redefine parse_endtag() to make it not throw an error (I had to duplicate some internal strings too). But then I hit another error, so I redefined unknown_decl() and _scan_name(). And then I hit another error. I'm sure you see where this was going. Pretty soon I had over 100 lines of duplicated code, and I was still getting errors and needed to redefine even more functions. This clearly wasn't the way to go.

Using lxml.html

I'd been trying to avoid adding dependencies to additional Python packages, but if you want to parse real-world HTML, you have to. There are two main options: Beautiful Soup and lxml.html. Beautiful Soup is popular for large projects, but the consensus seems to be that lxml.html is more error-tolerant and lighter weight.

Indeed, lxml.html is much more forgiving. You can't handle start and end tags as they pass through, like you can with HTMLParser. Instead you parse the HTML into an in-memory tree, like this:

  tree = lxml.html.fromstring(html)

How do you iterate over the tree? lxml.html is a good parser, but it has rather poor documentation, so it took some struggling to figure out what was inside the tree and how to iterate over it.

You can visit every element in the tree with

for e in tree.iter() :
  print e.tag

But that's not terribly useful if you need to know which tags are inside which other tags. Instead, define a function that iterates over the top level elements and calls itself recursively on each child.

The top of the tree itself is an element -- typically the <html></html> -- and each element has .tag and .attrib. If it contains text inside it (like a <p> tag), it also has .text. So to make something that works similarly to HTMLParser:

def crawl_tree(tree) :
  handle_starttag(tree.tag, tree.attrib)
  if tree.text :
    handle_data(tree.text)
  for node in tree :
    crawl_tree(node)
  handle_endtag(tree.tag)

But wait -- we're not quite all there. You need to handle two undocumented cases.

First, comment tags are special: their tag attribute, instead of being a string, is <built-in function Comment> so you have to handle that specially and not assume that tag is text that you can print or test against.

Second, what about cases like <p>Here is some <i>italicised</i> text.</p> ? in this case, you have the p tag, and its text is "Here is some ". Then the p has a child, the i tag, with text of "italicised". But what about the rest of the string, " text."?

That's called a tail -- and it's the tail of the adjacent i tag it follows, not the parent p tag that contains it. Confusing!

So our function becomes:

def crawl_tree(tree) :
  if type(tree.tag) is str :
    handle_starttag(tree.tag, tree.attrib)
    if tree.text :
      handle_data(tree.text)
    for node in tree :
      crawl_tree(node)
    handle_endtag(tree.tag)
  if tree.tail :
    handle_data(tree.tail)

See how it works? If it's a comment (tree.tag isn't a string), we'll skip everything -- except the tail. Even a comment might have a tail:
<p>Here is some <!-- this is a comment --> text we want to show.</p>
so even if we're skipping comment we need its tail.

I'm sure I'll find other gotchas I've missed, so I'm not releasing this version of feedme until it's had a lot more testing. But it looks like lxml.html is a reliable way to parse real-world pages. It even has a lot of convenience functions like link rewriting that you can use without iterating the tree at all. Definitely worth a look!

Tags: , ,
[ 14:04 Jan 08, 2012    More programming | permalink to this entry | comments ]

Fri, 25 Jun 2010

Use Firefox User CSS to make LinkedIn discussions scroll normally

Several groups I'm in insist on using LinkedIn for discussions, instead of a mailing list. No idea why -- it's so much harder to use -- but for some reason that's where the community went.

Which is fine except for happens just about every time I try to view a discussion: I get a notice of a thread that sounds interesting, click on the link to view it, read the first posting, hit the space bar to scroll down ... whoops! Focus was in that silly search field at the top right of the page, so it won't scroll.

It's even more fun if I've already scrolled down a bit with the mousewheel -- in that case, hitting spacebar jumps back up to the top of the page, losing any context I have as well as making me click in the page before I can actually scroll.

Setting focus to search fields is a good thing on some pages. Google does it, which makes terrific sense -- if you go to google.com, your main purpose is to type something in that search box.

It doesn't, however, make sense on a page whose purpose is to let people read through a long discussion thread.

Since I never use that search field anyway, though, I came up with a solution using Firefox's user css. It seems there's no way to make an input field un-focusable or read-only using pure CSS (of course, you could use Javascript and Greasemonkey for that); but as long as you don't need to use it, you can make it disappear entirely.

Add this line to chrome/userContent.css inside your Firefox profile (create it if it doesn't already exist):

form#global-search span#autocomplete-container input#main-search-box {
  visibility:hidden;
}

Then restart Firefox and load a discussion page. The search box should be hidden, and spacebar should scroll the page just like it does on most web pages.

Of course, this will need to be updated the next time LinkedIn changes their page layout. And it's vaguely possible that somewhere else on the web is a page with that hierarchy of element names. But that's easy enough to fix: run a View Page Source on the LinkedIn page and add another level or two to the CSS rule. The concept is the important thing.

Tags: , , , ,
[ 16:17 Jun 25, 2010    More tech/web | permalink to this entry | comments ]

Sun, 20 Jun 2010

Scaling HTML slides for different projectors with full-page zoom

Regular readers probably know that I use HTML for the slides in my talks, and I present them either with Firefox in fullscreen mode, or with my own Python preso tool based on webkit.

Most of the time it works great. But there's one situation that's always been hard to deal with: low-resolution projectors. Most modern projectors are 1024x768, and have been for quite a few years, so that's how I set up my slides. And then I get asked to give a talk at a school, or local astronomy club, or some other group that has a 10-year-old projector that can only handle 800x600. Of course, you never find out about this ahead of time, only when you plug in right before the talk. Disaster!

Wait -- before you object that HTML pages shouldn't use pixel values and should work regardless of the user's browser window size: I completely agree with you. I don't specify absolute font sizes or absolute positioning on web pages -- no one should. But presentation slides are different: they're designed for a controlled environment where everyone sees the same thing using the same software and hardware.

I can maintain a separate stylesheet -- that works for making the font size smaller but it doesn't address the problem of pictures too large to fit (and we all like to use lots of pictures in presentations, right?) I can maintain two separate copies of the slides for the two sizes, but that's a lot of extra work and they're bound to get out of sync.

Here's a solution I should have thought of years ago: full-page zoom. Most major browsers have offered that capability for years, so the only trick is figuring out how to specify it in the slides.

IE and the Webkit browsers (Safari, Konqueror, etc.) offer a wonderful CSS property called zoom. It works like this:

body {
  zoom: 78.125%;
}

78.125% is the ratio between an 800-pixel projector and a 1024-pixel one. Just add this line, and your whole page will be scaled down to the right size. Lovely!

Lovely, except it doesn't work on Firefox (bug 390936). Fortunately, Firefox has another solution: the more general and not yet standardized CSS transform, which Mozilla has implemented as the Mozilla-specific property -moz-transform. So add these lines:

body {
  position: absolute; left: 0px; top: 0px;
  -moz-transform: scale(.78125, .78125);
}

The position: absolute is needed because when Firefox scales with -moz-transform, it also centers whatever it scaled, so the slide ends up in the top center of the screen. On my laptop, at least, it's the upper left part of the screen that gets sent to the projector, so slides must start in the upper left corner.

The good news is that these directives don't conflict; you can put both zoom and -moz-transform in the same rule and things will work fine. So I've added this to the body rule in my slides.css:

  /* If you get stuck on an 800x600 projector, use these:
  zoom: 78.125%;
  position: absolute; left: 0px; top: 0px;
  -moz-transform: scale(.78125, .78125);
   */

Uncomment in case of emergency and all will be well. (Unless you use Opera, which doesn't seem to understand either version.)

Tags: , , , , ,
[ 11:14 Jun 20, 2010    More tech/web | permalink to this entry | comments ]

Mon, 29 Mar 2010

Non-greedy regular expressions to clean up crappy autogenerated HTML

I maintain the websites for several clubs. No surprise there -- anyone with even a moderate knowledge of HTML, or just a lack of fear of it, invariably gets drafted into that job in any non-computer club.

In one club, the person in charge of scheduling sends out an elaborate document every three months in various formats -- Word, RTF, Excel, it's different each time. The only regularity is that it's always full of crap that makes it hard to turn it into a nice simple HTML table.

This quarter, the formats were Word and RTF. I used unrtf to turn the RTF version into HTML -- and found a horrifying page full of lines like this:

<body><font size=3></font><font size=3><br>
</font><font size=3></font><b><font size=4></font></b><b><font size=4><table border=2>
</font></b><b><font size=4><tr><td><b><font size=4><font face="Arial">Club Schedule</font></font></b><b><font size=4></font></b><b><font size=4></font></b></td>
<font size=3></font><font size=3><td><font size=3><b><font face="Arial">April 13</font></b></font><font size=3></font><font size=3><br>
</font><font size=3></font><font size=3><b></b></font></td>
I've put the actual page content in bold; the rest is just junk, mostly doing nothing, mostly not even legal HTML, that needs to be eliminated if I want the page to load and display reasonably.

I didn't want to clean up that mess by hand! So I needed some regular expressions to clean it up in an editor. I tried emacs first, but emacs makes it hard to try an expression then modify it a little when the first try doesn't work, so I switched to vim.

The key to this sort of cleanup is non-greedy regular expressions. When you have a bad tag sandwiched in the middle of a line containing other tags, you want to remove everything from the <font through the next > -- but no farther, or else you'll delete real content. If you have a line like

<td><font size=3>Hello<font> world</td>
you only want to delete through the <font>, not through the </td>.

In general, you make a regular expression non-greedy by adding a ? after the wildcard -- e.g. <font.*?>. But that doesn't work in vim. In vim, you have to use \{M,N} which matches from M to N repetitions of whatever immediately precedes it. You can also use the shortcut \{-} to mean the same thing as *? (0 or more matches) in other programs.

Using that, I built up a series of regexp substitutes to clean up that unrtf mess in vim:

:%s/<\/\{0,1}font.\{-}>//g
:%s/<b><\/b>//g
:%s/<\/b><b>//g
:%s/<\/i><i>//g
:%s/<td><\/td>/<td><br><\/td>/g
:%s/<\/\{0,1}span.\{-}>//g
:%s/<\/\{0,1}center>//g

That took care of 90% of the work, leaving me with hardly any cleanup I needed to do by hand. I'll definitely keep that list around for the next time I need to do this.

Tags: , , ,
[ 22:02 Mar 29, 2010    More linux/editors | permalink to this entry | comments ]

Syndicated on:
LinuxChix Live
Ubuntu Women
Women in Free Software
Graphics Planet
DevChix
Ubuntu California
Planet Openbox
Devchix
Planet LCA2009

Friends' Blogs:
Morris "Mojo" Jones
Jane Houston Jones
Dan Heller
Long Live the Village Green
Ups & Downs
DailyBBG

Other Blogs of Interest:
DevChix
Scott Adams
Dave Barry
BoingBoing

Powered by PyBlosxom.