Shallow Thoughts

Akkana's Musings on Open Source, Science, and Nature.

Thu, 20 Dec 2007

Smart Wrapping with Greedy and Non-Greedy Regular Expressions

I had a chance to spend a day at the AGU conference last week. The American Geophysical Union is a fabulous conference -- something like 14,000 different talks over the course of the week, on anything related to earth or planetary sciences -- geology, solar system astronomy, atmospheric science, geophysics, geochemistry, you name it.

I have no idea how regular attendees manage the information overload of deciding which talks to attend. I wasn't sure how I would, either, but I started by going through the schedule for the day I'd be there, picking out a (way too long) list of potentially interesting talks, and saving them as lines in a file.

Now I had a file full of lines like:

1020      U22A    MS 303  Terrestrial Impact Cratering: New Insights Into the Cratering Process From Geophysics and Geochemistry II
Fine, except that I couldn't print out something like that -- printers stop at 80 columns. I could pass it through a program like "fold" to wrap the long lines, but then it would be hard to scan through quickly to find the talk titles and room numbers. What I really wanted was to wrap it so that the above line turned into something like:
1020      U22A    MS 303  Terrestrial Impact Cratering: New Insights
                          Into the Cratering Process From Geophysics
                          and Geochemistry II
But how to do that? I stared at it for a while, trying to figure out whether there was a clever vim substitute that could handle it. I asked on a couple of IRC channels, just in case there was some amazing Linux smart-wrap utility I'd never heard of. I was on the verge of concluding that the answer was no, and that I'd have to write a python script to do the wrapping I wanted, when Mikael emitted a burst of line noise:
%s/\(.\{72\}\)\(.*\)/\1^M^I^I^I\2/

Only it wasn't line noise. Seems Mikael just happened to have been reading about some of the finer points of vim regular expressions earlier that day, and he knew exactly the trick I needed -- that .\{72\}, which matches lines that are at least 72 characters long. And amazingly, that expression did something very close to what I wanted.

Or at least the first step of it. It inserts the first line break, turning my line into

1020      U22A    MS 303  Terrestrial Impact Cratering: New Insights
                          Into the Cratering Process From Geophysics and Geochemistry II
but I still needed to wrap the second and subsequent lines.

But that was an easier problem -- just do essentially the same thing again, but limit it to only lines starting with a tab. After some tweaking, I arrived at exactly what I wanted:

%s/^\(.\{,65\}\) \(.*\)/\1^M^I^I^I\2/

%g/^^I^I^I.\{58\}/s/^\(.\{,55\}\) \(.*\)/\1^M^I^I^I\2/
I had to run the second line two or three times to wrap the very long lines.

Devdas helpfully translated the second one into English: "You have 3 tabs, followed by 58 characters, out of which you match the first 55 and put that bit in $1, and the capture the remaining in $2, and rewrite to $1 newline tab tab tab $2."

Here's a more detailed breakdown:

Line one:
% Do this over the whole file
s/ Begin global substitute
^ Start at the beginning of the line
\( Remember the result of the next match
.\{,65\}_ Look for up to 65 characters with a space at the end
\) \( End of remembered pattern #1, skip a space, and start remembered pattern #2
.*\) Pattern #2 includes everything to the end of the line
/ End of matched pattern; begin replacement pattern
\1^M Insert saved pattern #1 (the first 65 lines ending with a space) followed by a newline
^I^I^I\2 On the second line, insert three tabs then saved pattern #2
/ End replacement pattern

Line two:
%g/ Over the whole file, only operate on lines with this pattern
^^I^I^I Lines starting with three tabs
.\{58\}/ After the tabs, only match lines that still have at least 58 characters (this guards against wrapping already wrapped lines when it's run repeatedly)
s/ Begin global substitute
^ Start at the beginning of the line
\( Remember the result of the next match
.\{,55\} Up to 55 characters
\) \( End of remembered pattern #1, skip a space, and start remembered pattern #2
.*\) Pattern #2 includes everything to the end of the line
/ End of matched pattern; begin replacement pattern
\1^M The first pattern (up to 55 chars) is one line
^I^I^I\2 Three tabs then the second pattern
/ End replacement pattern

Greedy and non-greedy brace matches

The real key is those curly-brace expressions, \{,65\} and \{58\} -- that's how you control how many characters vim will match and whether or not the match is "greedy". Here's how they work (thanks to Mikael for explaining).

The basic expression is {M,N} -- it means between M and N matches of whatever precedes it. (Vim requires that the first brace be escaped -- \{}. Escaping the second brace is optional.) So .{M,N} can match anything between M and N characters but "prefer" N, i.e. try to match as many as possible up to N. To make it "non-greedy" (match as few as possible, "preferring" M), use .{-M,N}

You can leave out M, N, or both; M defaults to 0 while N defaults to infinity. So {} is short for {0,∞} and is equivalent to *, while {-} means {-0,∞}, like a non-greedy version of *.

Given the string: one, two, three, four, five
,.\{}, matches , two, three, four,
,.\{-}, matches , two,
,.\{5,}, matches , two, three, four,
,.\{-5,}, matches , two, three,
,.\{,2}, matches nothing
,.\{,7}, matches , two,
,.\{5,7}, matches , three,

Of course, this syntax is purely for vim; regular expressions are unfortunately different in sed, perl and every other program. Here's a fun table of regexp terms in various programs.

Tags: ,
[ 11:44 Dec 20, 2007    More linux/editors | permalink to this entry ]

Mon, 19 Feb 2007

Emacs with Long Lines

I don't like composing text documents in word processors like Open Office. Call it a quirk if you like, but I find them intrusive: they take up a lot of CPU and memory, they take up a lot of window space for stuff I don't need while I'm writing (all those margins and rulers and toolbars and such) making it hard to compare two documents at once, and they tend to have intrusive focus behavior (like popping windows to the front when I didn't ask for it).

So when I need to write a paper (or a book), I prefer to compose in a text editor like vim or emacs, something that won't get in the way of my train of thought. When it's mostly written and ready to format, then I start up the big heavyweight word processor and import or paste the text into it.

(For those of you who think I'm insane and should just live in Open Office all day, the same problem comes up for people who do a lot of composing for web applications, such as an online blog, gmail, a web forum, or a wiki, and for people who want a choice of editor for their GUI mail app.)

Fine, but that introduces a problem. See, text editors have a fixed line width (typically 80 characters, though of course you can adjust this) and paragraphs are usually separated by blank lines (two newline characters together). Word processors expect each paragraph to be one long line for the whole paragraph, and line breaks are used as paragraph breaks (but you only want one of them, not two). How do you reconcile these two models in order to paste plaintext from an editor into a word processor?

Several years ago when I first encountered this problem, I investigated solutions in both vim and emacs (oddly enough, I'm an editor agnostic and equally happy in either one).

For vim, I never did find a solution to the problem, so that settled the editor choice for me. Perhaps some vim expert can let me know what I missed.

For emacs, I found longlines-mode, a hack which lets long lines appear to be wrapped while you're editing them even though they're really not. Apparently Wikipedia has this issue and some Wikipedia contributors use longlines-mode too. (That page also has brief notes on alternate solutions.)

I used longlines-mode for a long time, and it's more or less functional, but I was never really happy with it. It turns out to have some pretty annoying bugs which I was forever needing to work around, and it doesn't solve the blank-lines problem -- you still need to delete blank lines before or after pasting.

Yesterday I was working on an essay for a class I'm taking and decided I'd had enough of longlines-mode and wanted a better solution. I poked around and chatted with the nice folks on #emacs (hoping that someone had come up with a better solution, but no one knew of one) and based on some ideas they had, I came up with one of my own.

My new method is to edit the text file normally: line breaks where they look good, blank lines to separate paragraphs. When I'm finished writing and ready to paste, I run M-x wp-munge, which calls up a very simple function I wrote and added to my .emacs:

;; For composing in emacs then pasting into a word processor,
;; this un-fills all the paragraphs (i.e. turns each paragraph
;; into one very long line) and removes any blank lines that
;; previously separated paragraphs.
;;
(defun wp-munge () "un-fill paragraphs and remove blank lines" (interactive)
  (let ((save-fill-column fill-column))
    (set-fill-column 1000000)
    (mark-whole-buffer)
    (fill-individual-paragraphs (point-min) (point-max))
    (delete-matching-lines "^$")
    (set-fill-column save-fill-column) ))

So simple! Why didn't I think of doing it that way before?

Tags: ,
[ 20:10 Feb 19, 2007    More linux/editors | permalink to this entry ]

Sun, 14 May 2006

Linkifying with Regular Expressions

I had a page of plaintext which included some URLs in it, like this:
Tour of the Hayward Fault
http://www.mcs.csuhayward.edu/~shirschf/tour-1.html

Technical Reports on Hayward Fault
http://quake.usgs.gov/research/geology/docs/lienkaemper_docs06.htm

I wanted to add links around each of the urls, so that I could make it part of a web page, more like this:

Tour of the Hayward Fault
http://www.mcs.csu hayward.edu/~shirschf/tour-1.html

Technical Reports on Hayward Fault
htt p://quake.usgs.gov/research/geology/docs/lienkaemper_docs06.htm

Surely there must be a program to do this, I thought. But I couldn't find one that was part of a standard Linux distribution.

But you can do a fair job of linkifying just using a regular expression in an editor like vim or emacs, or by using sed or perl from the commandline. You just need to specify the input pattern you want to change, then how you want to change it.

Here's a recipe for linkifying with regular expressions.

Within vim:

:%s_\(https\=\|ftp\)://\S\+_<a href="&">&</a>_

If you're new to regular expressions, it might be helpful to see a detailed breakdown of why this works:

:
Tell vim you're about to type a command.
%
The following command should be applied everywhere in the file.
s_
Do a global substitute, and everything up to the next underscore will represent the pattern to match.
\(
This will be a list of several alternate patterns.
http
If you see an "http", that counts as a match.
s\=
Zero or one esses after the http will match: so http and https are okay, but httpsssss isn't.
\|
Here comes another alternate pattern that you might see instead of http or https.
ftp
URLs starting with ftp are okay too.
\)
We're done with the list of alternate patterns.
://
After the http, https or ftp there should always be a colon-slash-slash.
\S
After the ://, there must be a character which is not whitespace.
\+
There can be any number of these non-whitespace characters as long as there's at least one. Keep matching until you see a space.
_
Finally, the underscore that says this is the end of the pattern to match. Next (until the final underscore) will be the expression which will replace the pattern.
<a href="&">
An ampersand, &, in a substitute expression means "insert everything that was in the original pattern". So the whole url will be inserted between the quotation marks.
&</a>
Now, outside the <a href="..."> tag, insert the matched url again, and follow it with a </a> to close the tag.
_
The final underscore which says "this is the end of the replacement pattern". We're done!

Linkifying from the commandline using sed

Sed is a bit trickier: it doesn't understand \S for non-whitespace, nor = for "zero or one occurrence". But this expression does the trick:
sed -e 's_\(http\|https\|ftp\)://[^ \t]\+_<a href="&">&</a>_' <infile.txt >outfile.html

Addendum: George Riley tells me about VST for Vim 7, which looks like a nice package to linkify, htmlify, and various other useful things such as creating HTML presentations. I don't have Vim 7 yet, but once I do I'll definitely check out VST.

Tags: ,
[ 12:40 May 14, 2006    More linux/editors | permalink to this entry ]

Wed, 29 Mar 2006

Emacs: Typing dashes in html mode

What to do with a few extra hours in a boring motel with no net access? How about digging into fixing one of Emacs' more annoying misfeatures?

Whenever I edit an html file using emacs, I find I have to stay away from double dashes -- I can't add a phrase such as this one. If I forget and type a phrase with a double dash, then as soon as I get to the end of that line and emacs decides it's time to wrap to the next line, it "helpfully" treats the double dashes as a comment, and indents the next line to the level where the dashes were, adding another set of dashes. I've googled, I've asked on emacs IRC help channels, but there doesn't seem to be any way out. (I guess no one else ever uses double dashes in html files?)

It's frustrating: I like using double dashes now and then. And aside from the occasional boneheaded misfeature like this one, I like using emacs. But the dash problem been driving me nuts for a long time now. So I finally dug into the code to cure it.

First, the file is sgml-mode.el, so don't bother searching anything with html in the name. On my system it's /usr/share/emacs/21.4/lisp/textmodes/sgml-model.el. Edit that file and search for "--" and the first thing you'll find (well, after the file's preamble comments) is a comment in the definition of "sgml-specials" saying that if you include ?- in the list of specials, it will hork the typing of double dashes, so that's normally left out.

A clue! Perhaps some Debian or Ubuntu site file has changed sgml-specials for me, and all I need to do is change it back! So I typed

M-x describe-variable sgml-specials
to see the current setting.

Um ... it's set to "34". That's not very helpful. I haven't a clue how that translates to the list of characters I see in sgml-mode.el. Forget about that approach for now.

Searching through the file for the string "comment" got me a few more hits, and I tried commenting out various comment handling lines until the evil behavior went away. (I had to remove sgml-mode.elc first, otherwise emacs wouldn't see any changes I made to sgml-mode.el. If you haven't done much elisp hacking, the .el is the lisp source, while the .elc is a byte-compiled version which loads quicker but isn't intended to be edited by humans. For Java programmers, the .elc is sort of like a .class file.)

Commenting out these four lines did the trick:

  (set (make-local-variable 'font-lock-syntactic-keywords)
       '(("\\(<\\)! *--.*-- *\\(>\\)" (1 "!") (2 "!"))))
  ;; This will allow existing comments within declarations to be
  ;; recognized.
  (set (make-local-variable 'comment-start-skip) "\\(?:\\)?")

To regenerate the .elc file so sgml-mode will load faster, I ran emacs as root from the directory sgml-mode.el was in, and typed:

M-x byte-compile-file sgml-mode.el

All better! And now I know where to find documentation for all those useful-looking, but seemingly undocumented, keyboard shortcuts that go along with emacs' html mode. Just search in the file for html-mode-map, and you'll find all sorts of useful stuff.

For instance, that typing Ctrl-C Ctrl-C followed by various letters: u gets you an unordered list, h gets you an href tag, i an image tag, and so on, with the cursor positioned where you want to type next.

It doesn't seem to offer any basic inline formatting (like <i> or <em>), alas; but of course that's easy to add by editing the file (or maybe even in .emacs). To add an <em> tag, add this line to html-mode-map:

    (define-key map "\C-c\C-ce" 'html-em)
then add this function somewhere near where html-headline-1 and friends are defined:
(define-skeleton html-em
  "HTML emphasis tags."
  nil
  "" _ "")

Of course, you can define any set of tags you use often, not just <em>.

HTML mode in emacs should be much more fun and less painful now!

Update: If you don't want to modify the files as root, it also works fine to copy sgml-mode.el to wherever you keep personal elisp files. For instance, put them in a directory called ~/.emacs-lisp then add this to your .emacs:
(setq load-path (cons "~/.emacs-lisp/" load-path))

Tags: ,
[ 21:48 Mar 29, 2006    More linux/editors | permalink to this entry ]

Wed, 22 Jun 2005

Helpful Vim Tip: Finding Syntax for Colors

An upgrade from woody to sarge introduced a new problem with editing mail messages in vim: Subject lines appeared in yellow, against my light grey background, so they weren't readable any more.

Vim color files have always been a mystery to me. I have one which I adapted from one of the standard color schemes, but I've never been clear what the legal identifiers are or how to find out. But I changed both places where it said "ctermfg=Yellow" to another color, and nothing changed, so this time I had to find out.

Fortunately a nice person on #vim suggested :he synID (he is short for "help", of course) which told me all I needed to know. Put the cursor on the errant line and type: :echo synIDattr(synID(line("."), col("."), 1), "name")

That told me that the Subject line was syntax class "mailSubject". So I tried (copying other lines in my color file) adding this line:

hi mailSubject term=underline ctermfg=Red guifg=Red
and now all is happy again in vim land. I wish I'd learned that synID trick a long time ago!

Tags: ,
[ 09:59 Jun 22, 2005    More linux/editors | permalink to this entry ]

Sat, 19 Feb 2005

Tweaking Emacs' Text Indent: Don't Indent So Aggressively

Encouraged by my success a few days ago at finally learning how to disable vim's ctrl-spacebar behavior, the next day I went back to an emacs problem that's been bugging me for a while: in text mode, newline-and-indent always wants to indent the first line of a text file (something I almost never want), and skips blank lines when calculating indent (so starting a new paragraph doesn't reset the indent back to zero).

I had already googled to no avail, and had concluded that the only way was to write a new text-indent function which could be bound to the return key in the text mode hook.

This went fairly smoothly: I got a little help in #emacs with checking the pattern immediately before the cursor (though I turned out not to need that after all) and for the function called "bobp" (beginning of buffer predicate). Here's what I ended up with:

(defun newline-and-text-indent ()
  "Insert a newline, then indent the next line sensibly for text"
  (interactive)
  (if (or (bobp)
          (looking-at "^$"))
      (newline)
      (newline-and-indent)
  ))
(defun text-indent-hook ()
  (local-set-key "\C-m" 'newline-and-text-indent)
  )
(setq text-mode-hook 'text-indent-hook)

It seems to work fine. For the curious, here's my current .emacs

Tags: ,
[ 13:03 Feb 19, 2005    More linux/editors | permalink to this entry ]

Thu, 17 Feb 2005

Turning off Ctrl-Space in Vim

One of those niggling problems that has plagued me for a long time: in the editor vim, if I'm typing along in insert mode and instead of a space I accidentally hit control-space, vim inserts a bunch of text I didn't want, then exits insert mode. Meanwhile I'm still merrily typing away, typing what are now vim comments which invariably end up deleting the last two paragraphs I typed then doing several more operations which end up erasing the undo buffer so I can't get those paragraphs back.

Ctrl-space inserts a null character (you can verify this by putting it in a file and running od -xc on it). I've done lots of googling in the past, but it's hard to google on strings like " " or even "space" or "null", and nobody I asked had a clue what this function was called (it turns out it re-inserts whatever the last inserted characters were) so I couldn't google on the function name.

Vim's help suggests that <Nul>, <Char-0>, or <C-^V > should do it. I tried them with map, vmap, cmap, and nmap, to no avail. I also tried <C-@> since that's a historical way of referring to the null character, googling found some references to that in vim, and that's how it displays if I type it in vim.

I finally found #vim on freenode, and asked there. Last night nobody knew, but this morning, p0g found the problem: I needed to use imap, not the map/vmap/cmap/nmap I'd been using.

So here, perserved for google posterity in case other people are plagued by this problem, is the answer:

imap <Nul> <Space>

For good measure, I also mapped the character to no-op in all the other modes as well:

map  <Nul> <Nop>
vmap <Nul> <Nop>
cmap <Nul> <Nop>
nmap <Nul> <Nop>

My current .vimrc.

Tags: ,
[ 10:24 Feb 17, 2005    More linux/editors | permalink to this entry ]

Thu, 03 Feb 2005

Emacs Color Themes

A nifty emacs trick I learned about today: ColorThemes.

Instead of the old hacked-together color collection I've been using in emacs, I can load color-theme.el and choose from lots of different color schemes.

I added these lines to .emacs:

(require 'font-lock)
(if (fboundp 'global-font-lock-mode) (global-font-lock-mode 1))
(load "~/.emacs-lisp/color-theme.el")
(color-theme-ramangalahy)  ;; pick a favorite theme

The disadvantage is that color-theme.el is fifteen thousand lines long! So I'll probably make a local version that strips out all but the theme I actually use (then I can customize that).

The (global-font-lock-mode 1) tells emacs to use syntax highlighting on every file, not just certain types. So now I get at least some highlighting even in html files, though it still doesn't seem to be able to highlight like vim does (e.g. different colors for text inside <b> or <b> tags).

Tags: ,
[ 17:57 Feb 03, 2005    More linux/editors | permalink to this entry ]