Shallow Thoughts : tags : mutt

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Thu, 15 Oct 2015

Viewer for email attachments in Office formats

I seem to have fallen into a nest of Mac users whose idea of email is a text part, an HTML part, plus two or three or seven attachments (no exaggeration!) in an unholy combination of .DOC, .DOCX, .PPT and other Microsoft Office formats, plus .PDF.

Converting to text in mutt

As a mutt user who generally reads all email as plaintext, normally my reaction to a mess like that would be "Thanks, but no thanks". But this is an organization that does a lot of good work despite their file format habits, and I want to help.

In mutt, HTML mail attachments are easy. This pair of entries in ~/.mailcap takes care of them:

text/html; firefox 'file://%s'; nametemplate=%s.html
text/html; lynx -dump %s; nametemplate=%s.html; copiousoutput
Then in .muttrc, I have
auto_view text/html
alternative_order text/plain text

If a message has a text/plain part, mutt shows that. If it has text/html but no text/plain, it looks for the "copiousoutput" mailcap entry, runs the HTML part through lynx (or I could use links or w3m) and displays that automatically. If, reading the message in lynx, it looks to me like the message has complex formatting that really needs a browser, I can go to mutt's attachments screen and display the attachment in firefox using the other mailcap entry.

Word attachments are not quite so easy, especially when there are a lot of them. The straightforward way is to save each one to a file, then run LibreOffice on each file, but that's slow and tedious and leaves a lot of temporary files behind. For simple documents, converting to plaintext is usually good enough to get the gist of the attachments. These .mailcap entries can do that:

application/msword; catdoc %s; copiousoutput
application/vnd.openxmlformats-officedocument.wordprocessingml.document; docx2txt %s -; copiousoutput
Alternatives to catdoc include wvText and antiword.

But none of them work so well when you're cross-referencing five different attachments, or for documents where color and formatting make a difference, like mail from someone who doesn't know how to get their mailer to include quoted text, and instead distinguishes their comments from the text they're replying to by making their new comments green (ugh!) For those, you really do need a graphical window.

I decided what I really wanted (aside from people not sending me these crazy emails in the first place!) was to view all the attachments as tabs in a new window. And the obvious way to do that is to convert them to formats Firefox can read.

Converting to HTML

I'd used wvHtml to convert .doc files to HTML, and it does a decent job and is fairly fast, but it can't handle .docx. (People who send Office formats seem to distribute their files fairly evenly between DOC and DOCX. You'd think they'd use the same format for everything they wrote, but apparently not.) It turns out LibreOffice has a command-line conversion program, unoconv, that can handle any format LibreOffice can handle. It's a lot slower than wvHtml but it does a pretty good job, and it can handle .ppt (PowerPoint) files too.

For PDF files, I tried using pdftohtml, but it doesn't always do so well, and it's hard to get it to produce a single HTML file rather than a directory of separate page files. And about three quarters of PDF files sent through email turn out to be PDF in name only: they're actually collections of images of single pages, wrapped together as a PDF file. (Mostly, when I see a PDF like that I just skip it and try to get the information elsewhere. But I wanted my program at least to be able to show what's in the document, and let the user choose whether to skip it.) In the end, I decided to open a firefox tab and let Firefox's built-in PDF reader show the file, though popping up separate mupdf windows is also an option.

I wanted to show the HTML part of the email, too. Sometimes there's formatting there (like the aforementioned people whose idea of quoting messages is to type their replies in a different color), but there can also be embedded images. Extracting the images and showing them in a browser window is a bit tricky, but it's a problem I'd already solved a couple of years ago: Viewing HTML mail messages from Mutt (or other command-line mailers).

Showing it all in a new Firefox window

So that accounted for all the formats I needed to handle. The final trick was the firefox window. Since some of these conversions, especially unoconv, are quite slow, I wanted to pop up a window right away with a "converting, please wait..." message. Initially, I used a javascript: URL, running the command:

firefox -new-window "javascript:document.writeln('<br><h1>Translating documents, please wait ...</h1>');"

I didn't want to rely on Javascript, though. A data: URL, which I hadn't used before, can do the same thing without javascript:

firefox -new-window "data:text/html,<br><br><h1>Translating documents, please wait ...</h1>"

But I wanted the first attachment to replace the contents of that same window as soon as it was ready, and then subsequent attachments open a new tab in that window. But it turned out that firefox is inconsistent about what -new-window and -new-tab do; there's no guarantee that -new-tab will show up in the same window you recently popped up with -new-window, and running just firefox URL might open in either the new window or the old, in a new tab or not, or might not open at all. And things got even more complicated after I decided that I should use -private-window to open these attachments in private browsing mode.

In the end, the only way firefox would behave in a repeatable, predictable way was to use -private-window for everything. The first call pops up the private window, and each new call opens a new tab in the private window. If you want two separate windows for two different mail messages, you're out of luck: you can't have two different private windows. I decided I could live with that; if it eventually starts to bother me, I can always give up on Firefox and write a little python-webkit wrapper to do what I need.

Using a file redirect instead

But that still left me with no way to replace the contents of the "Please wait..." window with useful content. Someone on #firefox came up with a clever idea: write the content to a page with a meta redirect.

So initially, I create a file pleasewait.html that includes the header:

<meta http-equiv="refresh" content="2;URL=pleasewait.html">
(other HTML, charset information, etc. as needed). The meta refresh means Firefox will reload the file every two seconds. When the first converted file is ready, I just change the header to redirect to URL=first_converted_file.html. Meanwhile, I can be opening the other documents in additional tabs.

Finally, I added the command to my .muttrc. When I'm viewing a message either in the index or pager screens, F10 will call the script and decode all the attachments.

macro index <F10> "<pipe-message>~/bin/viewmailattachments\n" "View all attachments in browser"
macro pager <F10> "<pipe-message>~/bin/viewmailattachments\n" "View all attachments in browser"

Whew! It was trickier than I thought it would be. But I find I'm using it quite a bit, and it takes a lot of the pain out of those attachment-full emails.

The script is available at: viewmailattachments on GitHub.

Tags: , , , , ,
[ 15:18 Oct 15, 2015    More linux | permalink to this entry | comments ]

Mon, 07 Oct 2013

Viewing HTML mail messages from Mutt (or other command-line mailers)

Command-line mailers like mutt have one disadvantage: viewing HTML mail with embedded images. Without images, HTML mail is no problem -- run it through lynx, links or w3m. But if you want to see images in place, how do you do it?

Mutt can send a message to a browser like firefox ... but only the textual part of the message. The images don't show up.

That's because mail messages include images, not as separate files, but as attachments within the same file, encoded it a format known as MIME (Multipurpose Internet Mail Extensions). An image link in the HTML, instead of looking like <img src="picture.jpg">., will instead look something like <img src="cid:0635428E-AE25-4FA0-93AC-6B8379300161">. (Apple's or <img src="">. (Yahoo's webmail).

CID stands for Content ID, and refers to the ID of the image as it is encoded in MIME inside the image. GUI mail programs, of course, know how to decode this and show the image. Mutt doesn't.

A web search finds a handful of shell scripts that use the munpack program (part of the mpack package on Debian systems) to split off the files; then they use various combinations of sed and awk to try to view those files. Except that none of the scripts I found actually work for messages sent from modern mailers -- they don't decode the CID links properly.

I wasted several hours fiddling with various shell scripts, trying to adjust sed and awk commands to figure out the problem, when I had the usual epiphany that always eventually arises from shell script fiddling: "Wouldn't this be a lot easier in Python?"

Python's email package

Python has a package called email that knows how to list and unpack MIME attachments. Starting from the example near the bottom of that page, it was easy to split off the various attachments and save them in a temp directory. The key is

import email

fp = open(msgfile)
msg = email.message_from_file(fp)

for part in msg.walk():

That left the problem of how to match CIDs with filenames, and rewrite the links in the HTML message accordingly.

The documentation on the email package is a bit unclear, unfortunately. For instance, they don't give any hints what object you'll get when iterating over a message with walk, and if you try it, they're just type 'instance'. So what operations can you expect are legal on them? If you run help(part) in the Python console on one of the parts you get from walk, it's generally class Message, so you can use the Message API, with functions like get_content_type(), get_filename(). and get_payload().

More useful, it has dictionary keys() for the attributes it knows about each attachment. part.keys() gets you a list like

 'Content-Disposition' ]

So by making a list relating part.get_filename() (with a made-up filename if it doesn't have one already) to part['Content-ID'], I'd have enough information to rewrite those links.

Case-insensitive dictionary matching

But wait! Not so simple. That list is from a Yahoo mail message, but if you try keys() on a part sent by Apple mail, instead if will be 'Content-Id'. Note the lower-case d, Id, instead of the ID that Yahoo used.

Unfortunately, Python doesn't have a way of looking up items in a dictionary with the key being case-sensitive. So I used a loop:

    for k in part.keys():
        if k.lower() == 'content-id':
            print "Content ID is", part[k]

Most mailers seem to put angle brackets around the content id, so that would print things like "Content ID is <>". Those angle brackets have to be removed, since the CID links in the HTML file don't have them.

for k in part.keys():
    if k.lower() == 'content-id':
        if part[k].startswith('<') and part[k].endswith('>'):
            part[k] = part[k][1:-1]

But that didn't work -- the angle brackets were still there, even though if I printed part[k][1:-1] it printed without angle brackets. What was up?

Unmutable parts inside email.Message

It turned out that the parts inside an email Message (and maybe the Message itself) are unmutable -- you can't change them. Python doesn't throw an exception; it just doesn't change anything. So I had to make a local copy:

for k in part.keys():
    if k.lower() == 'content-id':
        content_id = part[k]
        if content_id.startswith('<') and content_id.endswith('>'):
            content_id = content_id[1:-1]
and then save content_id, not part[k], in my list of filenames and CIDs.

Then the rest is easy. Assuming I've built up a list called subfiles containing dictionaries with 'filename' and 'Content-Id', I can do the substitution in the HTML source:

    htmlsrc = html_part.get_payload(decode=True)
    for sf in subfiles:
        htmlsrc = re.sub('cid: ?' + sf['Content-Id'],
                         'file://' + sf['filename'],
                         htmlsrc, flags=re.IGNORECASE)

Then all I have to do is hook it up to a key in my .muttrc:

# macro  index  <F10>  "<copy-message>/tmp/mutttmpbox\n<enter><shell-escape>~/bin/\n" "View HTML in browser"
# macro  pager  <F10>  "<copy-message>/tmp/mutttmpbox\n<enter><shell-escape>~/bin/\n" "View HTML in browser"

Works nicely! Here's the complete script: viewhtmlmail.

Tags: , , , , ,
[ 11:49 Oct 07, 2013    More tech/email | permalink to this entry | comments ]

Mon, 29 Jul 2013

Mutt: Show HTML for sites that send broken HTML mail

Increasingly I'm seeing broken sites that send automated HTML mail with headers claiming it's plain text.

To understand what's happening, you have to know about something called MIME multipart/alternative.

MIME stands for Multipurpose Internet Mail Extensions: it's the way mail encodes different types of attachments, so you can attach images, music, PDF documents or whatever with your email.

If you send a normal plain text mail message, you don't need MIME. But as soon as you send anything else -- like an HTML message where you've made a word bold, changed color or inserted images -- you need it. MIME adds a Content-Type to the message saying "This is HTML mail, so you need to display it as HTML when you receive it" or "Here's a PDF attachment, so you need to display it in a PDF viewer". The headers for these two cases would look like this:

Content-Type: text/html
Content-Type: application/pdf

A lot of mail programs, for reasons that have never been particularly clear, like to send two copies of every mail message: one in plain text, one in HTML. They're two copies of the same message -- it's just that one version has fancier formatting than the other. The MIME header that announces this is

Content-Type: multipart/alternative
because the two versions, text and HTML, are alternative versions of the same message. The recipient need only read one, not both. Inside the multipart/alternative section there will be further MIME headers, one saying Content-Type: text/plain, where it puts the text of your message, and one Content-Type: text/html, where it puts HTML source code.

This mostly works fine for real mail programs (though it's a rather silly waste of bandwidth, sending double copies of everything for no particularly good reason, and personally I always configure the mailers I use to send only one copy at a time). But increasingly I'm seeing automated mail robots that send multipart/alternative mail, but do it wrong: they send HTML for both parts, or they send a valid HTML part and a blank text part.

Why don't the site owners notice the problem?

You wouldn't ever notice a problem if you use the default configuration on most mailers, to show the HTML part if at all possible. But most mail programs give you an option to show the text part if there is one. That way, you don't have to worry about those people who like to send messages in pink blinking text on a plaid background -- all you see is the text.

If your mailer is configured to show plain text, for most messages you'll see just text -- no colors, no blinking, no annoyances. But for mail sent by these misconfigured mail robots, what you'll see is HTML source code.

I've seen this in several places -- lots of spammers do it (who cares? I was going to delete the message anyway), and one of the local astronomy clubs does it so I've long since stopped trying to read their announcements. But the latest place I've seen this is one that ought to know better: Coursera. They apparently reconfigured their notification system recently, and I started getting course notifications that look like this:

/* Client-specific Styles */
#outlook a{padding:0;} /* Force Outlook to provide a "view in browser" button.
body{width:100% !important;} .ReadMsgBody{width:100%;}
.ExternalClass{width:100%;} /* Force Hotmail to display emails at full width */
body{-webkit-text-size-adjust:none;} /* Prevent Webkit platforms from changing
default text sizes. */
/* Reset Styles */
body{margin:0; padding:0;}
img{border:0; height:auto; line-height:100%; outline:none;
table td{border-collapse:collapse;}
#backgroundTable{height:100% !important; margin:0; padding:0; width:100%
p {margin-top: 14px; margin-bottom: 14px;}
/* /\/\/\/\/\/\/\/\/\/\ STANDARD STYLING: PREHEADER /\/\/\/\/\/\/\/\/\/\ */
.preheaderContent div a:link, .preheaderContent div a:visited, /* Yahoo! Mail
Override */ .preheaderContent div a .yshortcuts /* Yahoo! Mail Override */{
  color: #3b6e8f;
... and on and on like that. You get the idea. It's unreadable, even by a geek who knows HTML pretty well. It would be fine in the HTML part of the message -- but this is what they're sending in the text/plain part.

I filed a bug, but Coursera doesn't have a lot of staff to respond to bug reports and it might be quite some time before they fix this. Meanwhile, I don't want to miss notifications for the algorithms course I'm currently taking. So I needed a workaround.

How to work around the problem in mutt

I found one for mutt at alternative_order and folder-hook. When in my "classes" folder, I use a folder hook to tell mutt to prefer text/html format over text/plain, even though my default is text/plain. Then you also need to add a default folder hook to set the default back for every other folder -- mutt folder hooks are frustrating in that way.

The two folder hooks look like this:

folder-hook . 'set unalternative_order *; alternative_order text/plain text'

# Prefer the HTML part but only for Coursera,
# since it sends HTML in the text part.
folder-hook =in/coursera 'unalternative_order *; alternative_order text/html'

alternative_order specifies which types you'd most like to read. unalternative_order is a lot less clear; the documentation says it "removes a mime type from the alternative_order list", but doesn't say anything more than that. What's the syntax? What's the difference between using unalternative_order or just re-setting alternative_order? Why do I have to specify it with * in both places? No one seems to know.

So it's a little unsatisfying, and perhaps not the cleanest way. But it does work around the bug for sites where you really need a way to read the mail.

Update: I also found this discussion of alternative_order which gives a nice set of key bindings to toggle interactively between the various formats. It was missing some backslashes, so I had to fiddle with it slightly to get it to work. Put this in .muttrc:

macro pager ,@aoh= "\
<enter-command> unalternative_order *; \
alternative_order text/enriched text/html text/plain text;\
macro pager A ,@aot= 'toggle alternative order'<enter>\

macro pager ,@aot= "\
<enter-command> unalternative_order *; \
alternative_order text/enriched text/plain text/html text;\
macro pager A ,@aoh= 'toggle alternative order'<enter>\

macro pager A ,@aot= "toggle alternative order"

Then just type A (capital A) to toggle between formats. If it doesn't change the first time you type A, type another one and it should redisplay. I've found it quite handy.

Tags: , ,
[ 15:13 Jul 29, 2013    More tech/email | permalink to this entry | comments ]

Sat, 23 Aug 2008

Forwarding a message with attachments in Mutt

I've wanted to know forever how to forward a message with all or some of its attachments from mutt.

You can set the variable mime_forward so that when you forward a message, it includes the entire message, headers, attachments and all, as a single attachment. You can't edit this or change it in any way. If you want to trim the original message, or omit one of the attachments, you're out of luck.

I've found two ways to do it.

First: type v to get to the attachments screen. Type t repeatedly to tag all the attachments, including the initial small text/plain attachment (that's the original message body). When they're all tagged, type ;f (forward all tagged attachments). After you fill in the To: prompt, you'll be able to edit the message body, and when you leave the editor, you'll have the attachment list there to edit as you see fit.

If that doesn't work (I haven't tried it on HTML messages), there's a slightly more elaborate procedure: use
Esc-e   resend-message   use the current message as a template for a new one.
This calls up an editor on the current message, including headers. Change the From to your name, the To to your intended recipient, and edit the message body to your heart's content. When you're done, you're sent to the Compose screen, where you can adjust the attachment list and send the message.

Forwarding is pretty clearly not what Esc-E was intended for ... but it does the job and might be a handy trick to know.

Tags: , , ,
[ 23:14 Aug 23, 2008    More linux | permalink to this entry | comments ]