Viewer for email attachments in Office formats (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Thu, 15 Oct 2015

Viewer for email attachments in Office formats

Update, December 2022:
viewmailattachments has been integrated with another mutt helper, viewhtmlmail.py, which can show HTML messages complete with embedded images. It's described in the article View Mail Attachments from Mutt and the script is at viewmailattachments.py. It no longer uses the "please wait" screen described in this article, but the rest of the discussion still applies.

I seem to have fallen into a nest of Mac users whose idea of email is a text part, an HTML part, plus two or three or seven attachments (no exaggeration!) in an unholy combination of .DOC, .DOCX, .PPT and other Microsoft Office formats, plus .PDF.

Converting to text in mutt

As a mutt user who generally reads all email as plaintext, normally my reaction to a mess like that would be "Thanks, but no thanks". But this is an organization that does a lot of good work despite their file format habits, and I want to help.

In mutt, HTML mail attachments are easy. This pair of entries in ~/.mailcap takes care of them:

text/html; firefox 'file://%s'; nametemplate=%s.html
text/html; lynx -dump %s; nametemplate=%s.html; copiousoutput
Then in .muttrc, I have
auto_view text/html
alternative_order text/plain text

If a message has a text/plain part, mutt shows that. If it has text/html but no text/plain, it looks for the "copiousoutput" mailcap entry, runs the HTML part through lynx (or I could use links or w3m) and displays that automatically. If, reading the message in lynx, it looks to me like the message has complex formatting that really needs a browser, I can go to mutt's attachments screen and display the attachment in firefox using the other mailcap entry.

Word attachments are not quite so easy, especially when there are a lot of them. The straightforward way is to save each one to a file, then run LibreOffice on each file, but that's slow and tedious and leaves a lot of temporary files behind. For simple documents, converting to plaintext is usually good enough to get the gist of the attachments. These .mailcap entries can do that:

application/msword; catdoc %s; copiousoutput
application/vnd.openxmlformats-officedocument.wordprocessingml.document; docx2txt %s -; copiousoutput
Alternatives to catdoc include wvText and antiword.

But none of them work so well when you're cross-referencing five different attachments, or for documents where color and formatting make a difference, like mail from someone who doesn't know how to get their mailer to include quoted text, and instead distinguishes their comments from the text they're replying to by making their new comments green (ugh!) For those, you really do need a graphical window.

I decided what I really wanted (aside from people not sending me these crazy emails in the first place!) was to view all the attachments as tabs in a new window. And the obvious way to do that is to convert them to formats Firefox can read.

Converting to HTML

I'd used wvHtml to convert .doc files to HTML, and it does a decent job and is fairly fast, but it can't handle .docx. (People who send Office formats seem to distribute their files fairly evenly between DOC and DOCX. You'd think they'd use the same format for everything they wrote, but apparently not.) It turns out LibreOffice has a command-line conversion program, unoconv, that can handle any format LibreOffice can handle. It's a lot slower than wvHtml but it does a pretty good job, and it can handle .ppt (PowerPoint) files too.

For PDF files, I tried using pdftohtml, but it doesn't always do so well, and it's hard to get it to produce a single HTML file rather than a directory of separate page files. And about three quarters of PDF files sent through email turn out to be PDF in name only: they're actually collections of images of single pages, wrapped together as a PDF file. (Mostly, when I see a PDF like that I just skip it and try to get the information elsewhere. But I wanted my program at least to be able to show what's in the document, and let the user choose whether to skip it.) In the end, I decided to open a firefox tab and let Firefox's built-in PDF reader show the file, though popping up separate mupdf windows is also an option.

I wanted to show the HTML part of the email, too. Sometimes there's formatting there (like the aforementioned people whose idea of quoting messages is to type their replies in a different color), but there can also be embedded images. Extracting the images and showing them in a browser window is a bit tricky, but it's a problem I'd already solved a couple of years ago: Viewing HTML mail messages from Mutt (or other command-line mailers).

Showing it all in a new Firefox window

So that accounted for all the formats I needed to handle. The final trick was the firefox window. Since some of these conversions, especially unoconv, are quite slow, I wanted to pop up a window right away with a "converting, please wait..." message. Initially, I used a javascript: URL, running the command:

firefox -new-window "javascript:document.writeln('<br><h1>Translating documents, please wait ...</h1>');"

I didn't want to rely on Javascript, though. A data: URL, which I hadn't used before, can do the same thing without javascript:

firefox -new-window "data:text/html,<br><br><h1>Translating documents, please wait ...</h1>"

But I wanted the first attachment to replace the contents of that same window as soon as it was ready, and then subsequent attachments open a new tab in that window. But it turned out that firefox is inconsistent about what -new-window and -new-tab do; there's no guarantee that -new-tab will show up in the same window you recently popped up with -new-window, and running just firefox URL might open in either the new window or the old, in a new tab or not, or might not open at all. And things got even more complicated after I decided that I should use -private-window to open these attachments in private browsing mode.

In the end, the only way firefox would behave in a repeatable, predictable way was to use -private-window for everything. The first call pops up the private window, and each new call opens a new tab in the private window. If you want two separate windows for two different mail messages, you're out of luck: you can't have two different private windows. I decided I could live with that; if it eventually starts to bother me, I can always give up on Firefox and write a little python-webkit wrapper to do what I need.

Using a file redirect instead

But that still left me with no way to replace the contents of the "Please wait..." window with useful content. Someone on #firefox came up with a clever idea: write the content to a page with a meta redirect.

So initially, I create a file pleasewait.html that includes the header:

<meta http-equiv="refresh" content="2;URL=pleasewait.html">
(other HTML, charset information, etc. as needed). The meta refresh means Firefox will reload the file every two seconds. When the first converted file is ready, I just change the header to redirect to URL=first_converted_file.html. Meanwhile, I can be opening the other documents in additional tabs.

Finally, I added the command to my .muttrc. When I'm viewing a message either in the index or pager screens, F10 will call the script and decode all the attachments.

macro index <F10> "<pipe-message>~/bin/viewmailattachments\n" "View all attachments in browser"
macro pager <F10> "<pipe-message>~/bin/viewmailattachments\n" "View all attachments in browser"

Whew! It was trickier than I thought it would be. But I find I'm using it quite a bit, and it takes a lot of the pain out of those attachment-full emails.

The script is available at: viewmailattachments.py on GitHub.

Tags: , , , , ,
[ 15:18 Oct 15, 2015    More linux | permalink to this entry | ]

Comments via Disqus:

blog comments powered by Disqus