Shallow Thoughts : : editors

Akkana's Musings on Open Source, Science, and Nature.

Mon, 29 Mar 2010

Non-greedy regular expressions to clean up crappy autogenerated HTML

I maintain the websites for several clubs. No surprise there -- anyone with even a moderate knowledge of HTML, or just a lack of fear of it, invariably gets drafted into that job in any non-computer club.

In one club, the person in charge of scheduling sends out an elaborate document every three months in various formats -- Word, RTF, Excel, it's different each time. The only regularity is that it's always full of crap that makes it hard to turn it into a nice simple HTML table.

This quarter, the formats were Word and RTF. I used unrtf to turn the RTF version into HTML -- and found a horrifying page full of lines like this:

<body><font size=3></font><font size=3><br>
</font><font size=3></font><b><font size=4></font></b><b><font size=4><table border=2>
</font></b><b><font size=4><tr><td><b><font size=4><font face="Arial">Club Schedule</font></font></b><b><font size=4></font></b><b><font size=4></font></b></td>
<font size=3></font><font size=3><td><font size=3><b><font face="Arial">April 13</font></b></font><font size=3></font><font size=3><br>
</font><font size=3></font><font size=3><b></b></font></td>
I've put the actual page content in bold; the rest is just junk, mostly doing nothing, mostly not even legal HTML, that needs to be eliminated if I want the page to load and display reasonably.

I didn't want to clean up that mess by hand! So I needed some regular expressions to clean it up in an editor. I tried emacs first, but emacs makes it hard to try an expression then modify it a little when the first try doesn't work, so I switched to vim.

The key to this sort of cleanup is non-greedy regular expressions. When you have a bad tag sandwiched in the middle of a line containing other tags, you want to remove everything from the <font through the next > -- but no farther, or else you'll delete real content. If you have a line like

<td><font size=3>Hello<font> world</td>
you only want to delete through the <font>, not through the </td>.

In general, you make a regular expression non-greedy by adding a ? after the wildcard -- e.g. <font.*?>. But that doesn't work in vim. In vim, you have to use \{M,N} which matches from M to N repetitions of whatever immediately precedes it. You can also use the shortcut \{-} to mean the same thing as *? (0 or more matches) in other programs.

Using that, I built up a series of regexp substitutes to clean up that unrtf mess in vim:

:%s/<\/\{0,1}font.\{-}>//g
:%s/<b><\/b>//g
:%s/<\/b><b>//g
:%s/<\/i><i>//g
:%s/<td><\/td>/<td><br><\/td>/g
:%s/<\/\{0,1}span.\{-}>//g
:%s/<\/\{0,1}center>//g

That took care of 90% of the work, leaving me with hardly any cleanup I needed to do by hand. I'll definitely keep that list around for the next time I need to do this.

Tags: , , ,
[ 22:02 Mar 29, 2010    More linux/editors | permalink to this entry ]

Tue, 09 Feb 2010

Make scroll lock stop beeping in Emacs 23

I haven't been using the spare machine much lately. So I hadn't noticed until last week that since upgrading to the emacs 23.1.1 on Ubuntu Karmic koala, every time I press the Scroll Lock key -- the key my KVM uses to switch to the other computer -- with focus in an emacs window, emacs beeps and complains that the key is unbound.

That was a problem I thought I'd solved long ago, an easy fix in .emacs:

(global-set-key [scroll-lock] 'ignore)
But in emacs 23, it wasn't working any more. Emacs listed the key as "<Scroll_Lock>", but using that directly in global-set-key doesn't work.

The friendly and helpful (really!) crew at #emacs found me a solution, after some fiddling around.

(global-set-key (kbd "<Scroll_Lock>") 'ignore)

Tags: , , ,
[ 22:47 Feb 09, 2010    More linux/editors | permalink to this entry ]

Wed, 13 Jan 2010

Taming Emacs' text mode wrapping and indenting

To wrap long lines, or not to wrap? It's always a dilemma. Automatic wrapping is great when you're hammering away typing lots of text. But it's infuriating when you're trying to format something yourself and the editor decides it wants to line-wrap a little too early.

Although of course you can set the wrapping width, Emacs has a tendency to wrap early -- especially when you hit return. All too often, I'll be typing away at a long line, get to the end of the sentence and paragraph with the last word on the same line with the rest -- then realize that as soon as I hit return, Emacs is going to move that last word to a line by itself. Drives me nuts!

And the solution turns out to be so simple. The Return key, "\C-m". was bound to the (newline) function (you can find out by typing M-x, then describe-key, then hitting Return). Apparently (newline) re-wraps the current line before inserting a line break. But I just wanted it to insert a line break.

No problem -- just bind "C-m" to (insert "\n").

But there's a second way, too, if you don't want to rebind: there's a magic internal emacs table you can change.

(set-char-table-range auto-fill-chars 10 nil)

But wait -- there's one other thing I want to fix in text mode.

Automatic indent is another one of those features that's very convenient ... except when it's not.

If I have some text like:

First point:
  - subpoint a
  - subpoint b
then it's handy if, when I hit Return after subpoint a, emacs indents to the right level for subpoint b. But what happens when I get to the end of that list?
First point:
  - subpoint a
  - subpoint b

Second point:
  - subpoint c

When I hit Return after subpoint b, Emacs quite reasonably indents two spaces. If I immediately type another Return, Emacs sensibly deletes the two spaces it just inserted, opens a new line -- but then it indents that new line another two spaces.

After a blank line, I always want to start at the beginning, not indented at all.

Here's how to fix that. Define a function that will be called whenever you hit return in text mode. That function tests whether the caret comes immediately after a blank line, or at the beginning of the file. It indents except in those two cases; and in neither case does it re-wrap the current line.

;; In text mode, I don't want it auto-indenting for the first
;; line in the file, or lines following blank lines.
;; Everywhere else is okay.
(defun newline-and-text-indent ()
  "Insert a newline, then indent the next line sensibly for text"
  (interactive)
  (cond
   ;; Beginning of buffer, or beginning of an existing line, don't indent:
   ((or (bobp) (bolp)) (newline))

   ;; If we're on a whitespace-only line,
   ((and (eolp)
         (save-excursion (re-search-backward "^\\(\\s \\)*$"
                                             (line-beginning-position) t)))
    ;; ... delete the whitespace, then add another newline:
    (kill-line 0)
    (newline))

   ;; Else (not on whitespace-only) insert a newline,
   ;; then add the appropriate indent:
   (t (insert "\n")
      (indent-according-to-mode))
   ))

Then tell emacs to call that function when it sees the Return key in text mode:

(defun text-indent-hook ()
  (local-set-key "\C-m" 'newline-and-text-indent)
  )
(setq text-mode-hook 'text-indent-hook)

Finally, this is great for HTML mode too, if you get irritated at not being able to put an <a href="longurl"> all on one line:

(defun html-hook ()
  (local-set-key "\C-m" (lambda () (interactive) (insert "\n")))
  )
(setq sgml-mode-hook 'html-hook)

Tags: , ,
[ 10:29 Jan 13, 2010    More linux/editors | permalink to this entry ]

Wed, 29 Jul 2009

Emacs: Insert tags in HTML Mode

Wouldn't it be nice if Emacs HTML mode had a way to insert HTML tags, so you didn't have to type <b></b> all the time? Sort of like what's described in this page -- except that page describes an HTML mode that clearly isn't the one that's installed on Ubuntu, since none of those bindings actually work?

I've been meaning to figure out a way to do that for ages, and finally got around to it. Turns out Emacs SGML mode (which is really what Ubuntu installs and uses for HTML files) doesn't have functions for specific HTML tags like <b>, but it does have a general tag-inserting function.

Type C-c C-t -- emacs prompts you for the tag, so type b or whatever, and hit return -- and you get the tag, with the cursor correctly positioned for you to type your new bold text.

But that's four keystrokes. What if you want shorter bindings for particular tags, like C-b C-b to insert a bold?

For that, you need to use a lambda and a mode hook. In your .emacs it looks like this:

;; Define keys for inserting tags in HTML mode:
(defun html-hook ()
  (local-set-key "\C-c\C-b" (lambda () (interactive) (sgml-tag "b")))
  )
(setq sgml-mode-hook 'html-hook)

There's apparently also supposed to be a command bound to C-c / that closes the current tag, but my version of sgml-mode doesn't bind anything to that key, and the only likely-looking function name, sgml-maybe-end-tag, doesn't end the current tag. Such is life!

But one more don't-miss feature that I'd missed all along is C-c C-n: type it before a special character like < or & and emacs will insert the appropriate &lt; or &amp; for you. Nice!

(Thanks to bojohan on #emacs for the tips!)

Tags: , ,
[ 20:31 Jul 29, 2009    More linux/editors | permalink to this entry ]

Fri, 27 Mar 2009

Emacs bookmarks -- a huge time-saver

Oh, wow. I can't believe I've used Emacs all these years without knowing about bookmarks.

I wanted something in Emacs akin to the "Open Recent" menu that a lot of GUI apps have. Except, well, I didn't want it to need a menu (I don't normally show a menubar in Emacs) and I didn't want it limited only to recently accessed files. So ... just like Open Recent, only completely different.

What I really wanted was a way to nickname files I access regularly, so I don't have to type ~/foo/bar/blaz/route-66/dufus/velociraptor/archaeopteryx/filename every time. Even with tab completion, remembering long paths gets old. Of course emacs must have a way to do that; it has everything. The trick was guessing what it might be called in order to search for it.

The answer is emacs bookmarks and they're super easy to use.

C-x r m sets a bookmark for the current location in the current file. It prompts for a bookmark name; give it a nickname, or hit return to default it to the current filename.

C-x r b bookmark-name jumps back to a bookmark, opening the file if it isn't already. Of course, tab completion works for the bookmark name.

Bookmarks are saved in ~/.emacs.bmk so they're persistent.

It's perfect. I just wish I'd thought to look for it years ago.

(Of course, Emacs can do recent files too.)

Tags: , ,
[ 09:21 Mar 27, 2009    More linux/editors | permalink to this entry ]

Sun, 22 Mar 2009

Vim tip: fixing the light-background color schemes

I use a light background for my X terminals (xterm and rxvt): not white, but a light grey that I find easy on the eyes. Long ago, I spent the time to set up a custom vim color scheme that works with the light background.

But sometimes I need to run vim somewhere where I don't have access to my custom scheme. It always starts up with a lot of the words displayed in yellow, completely unreadable against a light background. :set background=light doesn't help -- the default colorscheme is already intended for a light background, yet it still uses yellow characters.

I tried all the colorschemes installed with ubuntu's vim (you can get a list of them with ls /usr/share/vim/vim71/colors). The only light-background vim schemes that don't use yellow all have their primary text color as red. Do a lot of people really want to edit red text? Maybe the same people who think that yellow is a dark color?

Curiously, it turns out that if you use one of these light color schemes on a Linux console (with its black background), the yellow text isn't yellow (which would show up fine against black), but orange (which would be better on a light background).

Mikael knew the answer:

:set t_Co=88

This tells vim to use 88-color mode instead of its default of 8, and the yellow text turns light blue. Not terrifically readable but much better than yellow. Or, instead, try

:set t_Co=256
and the yellow/light blue text turns an ugly, but readable, orange (probably the same orange as the console used).

So, vim users with dark-on-light terminal schemes: add set t_Co=256 in your .vimrc (no colon) and you'll be much happier.

Update: Pádraig Brady has a great page explaining more about terminal colour highlights, including a TERM=xterm-256color setting to get vim to use 256 colors automatically. There's also a lot of good advice there on enabling colors in other console apps.

The only catch: on Ubuntu you do have to install the ncurses-term package, which will get you xterm-256color as well as 256color variants for lots of other terminal types. Here's useful page on 256-Color XTerms in Ubuntu.

Tags: , , ,
[ 21:29 Mar 22, 2009    More linux/editors | permalink to this entry ]

Sun, 12 Oct 2008

More fun with regexps: Adding "[no output]" in shell logs

Someone on LinuxChix' techtalk list asked whether she could get tcsh to print "[no output]" after any command that doesn't produce output, so that when she makes logs to help her co-workers, they will seem clearer.

I don't know of a way to do that in any shell (the shell would have to capture the output of every command; emacs' shell-mode does that but I don't think any real shells do) but it seemed like it ought to be straightforward enough to do as a regular expression substitute in vi. You're looking for lines where a line beginning with a prompt is followed immediately by another line beginning with a prompt; the goal is to insert a new line consisting only of "[no output]" between the two lines.

It turned out to be pretty easy in vim. Here it is:

:%s/\(^% .*$\n\)\(% \)/\1[no results]\r\2/

Explanation:

:
starts a command
%
do the following command on every line of this file
s/
start a global substitute command
\(
start a "capture group" -- you'll see what it does soon
^
match only patterns starting at the beginning of a line
%
look for a % followed by a space (your prompt)
.*
after the prompt, match any other characters until...
$
the end of the line, after which...
\n
there should be a newline character
\)
end the capture group after the newline character
\(
start a second capture group
%
look for another prompt. In other words, this whole
expression will only match when a line starting with a prompt
is followed immediately by another line starting with a prompt.
\)
end the second capture group
/
We're finally done with the mattern to match!
Now we'll start the replacement pattern.
\1
Insert the full content of the first capture group
(this is also called a "backreference" if you want
to google for a more detailed explanation).
So insert the whole first command up to the newline
after it.
[no results]
After the newline, insert your desired string.
\r
insert a carriage return here (I thought this should be
\n for a newline, but that made vim insert a null instead)
\2
insert the second capture group (that's just the second prompt)
/
end of the substitute pattern

Of course, if you have a different prompt, substitute it for "% ". If you have a complicated prompt that includes time of day or something, you'll have to use a slightly more complicated match pattern to match it.

Tags: , , , ,
[ 13:34 Oct 12, 2008    More linux/editors | permalink to this entry ]

Thu, 20 Dec 2007

Smart Wrapping with Greedy and Non-Greedy Regular Expressions

I had a chance to spend a day at the AGU conference last week. The American Geophysical Union is a fabulous conference -- something like 14,000 different talks over the course of the week, on anything related to earth or planetary sciences -- geology, solar system astronomy, atmospheric science, geophysics, geochemistry, you name it.

I have no idea how regular attendees manage the information overload of deciding which talks to attend. I wasn't sure how I would, either, but I started by going through the schedule for the day I'd be there, picking out a (way too long) list of potentially interesting talks, and saving them as lines in a file.

Now I had a file full of lines like:

1020      U22A    MS 303  Terrestrial Impact Cratering: New Insights Into the Cratering Process From Geophysics and Geochemistry II
Fine, except that I couldn't print out something like that -- printers stop at 80 columns. I could pass it through a program like "fold" to wrap the long lines, but then it would be hard to scan through quickly to find the talk titles and room numbers. What I really wanted was to wrap it so that the above line turned into something like:
1020      U22A    MS 303  Terrestrial Impact Cratering: New Insights
                          Into the Cratering Process From Geophysics
                          and Geochemistry II
But how to do that? I stared at it for a while, trying to figure out whether there was a clever vim substitute that could handle it. I asked on a couple of IRC channels, just in case there was some amazing Linux smart-wrap utility I'd never heard of. I was on the verge of concluding that the answer was no, and that I'd have to write a python script to do the wrapping I wanted, when Mikael emitted a burst of line noise:
%s/\(.\{72\}\)\(.*\)/\1^M^I^I^I\2/

Only it wasn't line noise. Seems Mikael just happened to have been reading about some of the finer points of vim regular expressions earlier that day, and he knew exactly the trick I needed -- that .\{72\}, which matches lines that are at least 72 characters long. And amazingly, that expression did something very close to what I wanted.

Or at least the first step of it. It inserts the first line break, turning my line into

1020      U22A    MS 303  Terrestrial Impact Cratering: New Insights
                          Into the Cratering Process From Geophysics and Geochemistry II
but I still needed to wrap the second and subsequent lines.

But that was an easier problem -- just do essentially the same thing again, but limit it to only lines starting with a tab. After some tweaking, I arrived at exactly what I wanted:

%s/^\(.\{,65\}\) \(.*\)/\1^M^I^I^I\2/

%g/^^I^I^I.\{58\}/s/^\(.\{,55\}\) \(.*\)/\1^M^I^I^I\2/
I had to run the second line two or three times to wrap the very long lines.

Devdas helpfully translated the second one into English: "You have 3 tabs, followed by 58 characters, out of which you match the first 55 and put that bit in $1, and the capture the remaining in $2, and rewrite to $1 newline tab tab tab $2."

Here's a more detailed breakdown:

Line one:
% Do this over the whole file
s/ Begin global substitute
^ Start at the beginning of the line
\( Remember the result of the next match
.\{,65\}_ Look for up to 65 characters with a space at the end
\) \( End of remembered pattern #1, skip a space, and start remembered pattern #2
.*\) Pattern #2 includes everything to the end of the line
/ End of matched pattern; begin replacement pattern
\1^M Insert saved pattern #1 (the first 65 lines ending with a space) followed by a newline
^I^I^I\2 On the second line, insert three tabs then saved pattern #2
/ End replacement pattern

Line two:
%g/ Over the whole file, only operate on lines with this pattern
^^I^I^I Lines starting with three tabs
.\{58\}/ After the tabs, only match lines that still have at least 58 characters (this guards against wrapping already wrapped lines when it's run repeatedly)
s/ Begin global substitute
^ Start at the beginning of the line
\( Remember the result of the next match
.\{,55\} Up to 55 characters
\) \( End of remembered pattern #1, skip a space, and start remembered pattern #2
.*\) Pattern #2 includes everything to the end of the line
/ End of matched pattern; begin replacement pattern
\1^M The first pattern (up to 55 chars) is one line
^I^I^I\2 Three tabs then the second pattern
/ End replacement pattern

Greedy and non-greedy brace matches

The real key is those curly-brace expressions, \{,65\} and \{58\} -- that's how you control how many characters vim will match and whether or not the match is "greedy". Here's how they work (thanks to Mikael for explaining).

The basic expression is {M,N} -- it means between M and N matches of whatever precedes it. (Vim requires that the first brace be escaped -- \{}. Escaping the second brace is optional.) So .{M,N} can match anything between M and N characters but "prefer" N, i.e. try to match as many as possible up to N. To make it "non-greedy" (match as few as possible, "preferring" M), use .{-M,N}

You can leave out M, N, or both; M defaults to 0 while N defaults to infinity. So {} is short for {0,∞} and is equivalent to *, while {-} means {-0,∞}, like a non-greedy version of *.

Given the string: one, two, three, four, five
,.\{}, matches , two, three, four,
,.\{-}, matches , two,
,.\{5,}, matches , two, three, four,
,.\{-5,}, matches , two, three,
,.\{,2}, matches nothing
,.\{,7}, matches , two,
,.\{5,7}, matches , three,

Of course, this syntax is purely for vim; regular expressions are unfortunately different in sed, perl and every other program. Here's a fun table of regexp terms in various programs.

Tags: , ,
[ 11:44 Dec 20, 2007    More linux/editors | permalink to this entry ]