Shallow Thoughts : tags : html
Akkana's Musings on Open Source Computing, Science, and Nature.
Tue, 03 Apr 2012
How do you show equations on a web page? Every now and then, I write
an article that involves math, and I wrestle with that problem.
The obvious (but wrong) approach: MathML
It was nearly fifteen years ago that MathML was recommended as a
standard for embedding equations inside an HTML page. I remember being
excited about it back then. There were a few problems -- like the
availability of fonts including symbols for integrals, summations
and so forth -- but they seemed minor. That was 1998.
Now, in 2012, I found myself wanting to write an article involving an
integral, so I looked into the state of MathML. I found that even now,
all these years later, it wasn't widely supported.
In Firefox I could show some simple equations, like
and
But when I tried them in Chromium, I learned that webkit-based
browsers don't support MathML. At all. The exception is Safari:
apparently Apple has added some MathML support into their browser
but hasn't contributed that code back to webkit (yet?)
Besides that, MathML is ridiculously hard to use. Here's the code for
that little integral:
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mrow>
<semantics>
<mrow>
<msubsup>
<mo>∫</mo>
<mn>x = 0</mn>
<mi>∞</mi>
</msubsup>
<mfrac>
<mrow>
<mo>ⅆ</mo>
<mi>x</mi>
</mrow>
<mi>x</mi>
</mfrac>
</mrow>
</semantics>
</mrow>
</math>
Ugh! You can't even specify infinity without using an HTML numeric
entity. And the code for the quadratic equation is even worse (use View
Source if you want to see it).
Good ol' tables
Several years ago, I wrote about the
Twelve
Days of Christmas and how to calculate the total number of gifts
represented in the song.
I needed summations, and I was rather proud of working out a way to
use HTML tables to display all the sums and line up everything correctly.
It wasn't exactly publication-quality graphics, but it was readable.
More recently, I worked out a way to do exponentials that way,
and found a hint about
how to do integrals:
| now | | |
|
| ∫
| P (t) |
dt |
| P0 = | ———— |
| 1 + t |
| 0 | | |
Looks a little better than the tiny MathML version. But the code isn't
any easier to read:
<table border="0" cellpadding="0" cellspacing="0">
<tr><td><td align="center"><small><i>now</i></small></td><td></td><td></td></tr>
<tr>
<td>
<td rowspan="3" valign="middle"><font size="6" style="font-size:3em" class="bigsym">∫</font>
<td align="center"><i>P</i> (<i>t</i>)</td>
<td rowspan="3" valign="middle"> <i>dt</i></td></tr>
<tr><td>P<sub>0</sub> =<td align="center">————</td></tr>
<tr><td><td align="center">1 + <i>t</i></td></tr>
<tr><td><td valign="top"><small><i>0</i></small></td><td></td><td></td></tr>
</table>
The solution: MathJax
And then I discovered MathJAX.
It was added recently to the Udacity
forums, and I think it's also what MITx
is using for their courses.
MathJax is fantastic. It's an open-source library that lets you
specify equations in readable ways -- you can use MathML, but you
can also use LaTEX or even ASCII math like `x = (-b +- sqrt(b^2-4ac))/(2a) .`
It uses Javascript: you put your equations in the text of the page
with delimiters like $$ around them (you can control the delimiters),
then run a function that scans the page content and replaces any
equations it sees with pretty graphics. (Viewers using NoScript
or similar extensions will need to allow mathjax.org to see the
equations, unless you make a local copy of the mathjax.org libraries,
which you probably should anyway if you're using a lot of equations.)
For displaying those graphics,
MathJax might use MathML, HTML and CSS, or whatever, depending on the
user's browser ... but you don't have to worry about that.
(Alas,
even
in Firefox, MathML rendering isn't up to par so MathJax doesn't
use it by default, though you can
specify it as
an option if you know your equations render well.)
Here's that integral again, using LaTeX format:
$$ P_0 =\int_0^\infty \frac {P(t) dt}{1 + t} $$
and
$$ x = {-b \pm \sqrt{b^2-4ac} \over 2a} $$
It's beautiful! And although I don't know LaTex at all -- been wanting an
excuse to learn it -- I put together that integral with five minutes
of web searching. (The quadratic code came from a MathJax demo page.)
Here's what the code looks like:
$$ P_0 =\int_0^\infty \frac {P(t) dt}{1 + t} $$
$$ x = {-b \pm \sqrt{b^2-4ac} \over 2a} $$
MathJax is even smart enough to notice the code there is in a
<pre> tag, so I didn't have to find a way to escape it.
I'm sold! The MathJax team has really put together a nice package, and
I think we'll be seeing it on a lot more websites.
If you want to try it, start here:
Getting Started
with MathJAX.
Tags: math, science, html, web, mathml
[
15:45 Apr 03, 2012
More science |
permalink to this entry |
comments
]
Thu, 12 Jan 2012
When I give talks that need slides, I've been using my
Slide
Presentations in HTML and JavaScript for many years.
I uploaded it in 2007 -- then left it there, without many updates.
But meanwhile, I've been giving lots of presentations, tweaking the code,
tweaking the CSS to make it display better. And every now and then I get
reminded that a few other people besides me are using this stuff.
For instance, around a year ago, I gave a talk where nearly all the
slides were just images. Silly to have to make a separate HTML file
to go with each image. Why not just have one file, img.html, that
can show different images? So I wrote some code that lets you go to
a URL like img.html?pix/whizzyphoto.jpg, and it will display
it properly, and the Next and Previous slide links will still work.
Of course, I tweak this software mainly when I have a talk coming up.
I've been working lately on my SCALE talk, coming up on January 22:
Fun
with Linux and Devices (be ready for some fun Arduino demos!)
Sometimes when I overload on talk preparation, I procrastinate
by hacking the software instead of the content of the actual talk.
So I've added some nice changes just in the past few weeks.
For instance, the speaker notes that remind me of where I am in
the talk and what's coming next. I didn't have any way to add notes on
image slides. But I need them on those slides, too -- so I added that.
Then I decided it was silly not to have some sort of automatic
reminder of what the next slide was. Why should I have to
put it in the speaker notes by hand? So that went in too.
And now I've done the less fun part -- collecting it all together and
documenting the new additions. So if you're using my HTML/JS slide
kit -- or if you think you might be interested in something like that
as an alternative to Powerpoint or Libre Office Presenter -- check
out the presentation I have explaining the package, including the
new features.
You can find it here:
Slide
Presentations in HTML and JavaScript
Tags: speaking, javascript, html, web, programming, tech
[
20:08 Jan 12, 2012
More speaking |
permalink to this entry |
comments
]
Sun, 08 Jan 2012
I've been having (mis)adventures learning about Python's various
options for parsing HTML.
Up until now, I've avoided doing any HTMl parsing
in my RSS reader FeedMe.
I use regular expressions to find the places where content starts and
ends, and to screen out content like advertising, and to rewrite links.
Using regexps on HTML is generally considered to be a no-no, but it
didn't seem worth parsing the whole document just for those modest goals.
But I've long wanted to add support for downloading images, so you
could view the downloaded pages with their embedded images if you so chose.
That means not only identifying img tags and extracting their src
attributes, but also rewriting the img tag afterward to point to the
locally stored image. It was time to learn how to parse HTML.
Since I'm forever seeing people flamed on the #python IRC channel for
using regexps on HTML, I figured real HTML parsing must be straightforward.
A quick web search led me to
Python's built-in
HTMLParser class. It comes with a nice example for how to use it:
define a class that inherits from HTMLParser, then define
some functions it can call for things like handle_starttag and
handle_endtag; then call self.feed(). Something like this:
from HTMLParser import HTMLParser
class MyFancyHTMLParser(HTMLParser):
def fetch_url(self, url) :
request = urllib2.Request(url)
response = urllib2.urlopen(request)
link = response.geturl()
html = response.read()
response.close()
self.feed(html) # feed() starts the HTMLParser parsing
def handle_starttag(self, tag, attrs):
if tag == 'img' :
# attrs is a list of tuples, (attribute, value)
srcindex = self.has_attr('src', attrs)
if srcindex < 0 :
return # img with no src tag? skip it
src = attrs[srcindex][1]
# Make relative URLs absolute
src = self.make_absolute(src)
attrs[srcindex] = (attrs[srcindex][0], src)
print '<' + tag
for attr in attrs :
print ' ' + attr[0]
if len(attr) > 1 and type(attr[1]) == 'str' :
# make sure attr[1] doesn't have any embedded double-quotes
val = attr[1].replace('"', '\"')
print '="' + val + '"')
print '>'
def handle_endtag(self, tag):
self.outfile.write('</' + tag.encode(self.encoding) + '>\n')
Easy, right? Of course there are a lot more details, but the
basics are simple.
I coded it up and it didn't take long to get it downloading images
and changing img tags to point to them. Woohoo!
Whee!
The bad news about HTMLParser
Except ... after using it a few days, I was hitting some weird errors.
In particular, this one:
HTMLParser.HTMLParseError: bad end tag: ''
It comes from sites that have illegal content. For instance, stories
on Slate.com include Javascript lines like this one inside
<script></script> tags:
document.write("<script type='text/javascript' src='whatever'></scr" + "ipt>");
This is
technically illegal html -- but lots of sites do it, so protesting
that it's technically illegal doesn't help if you're trying to read a
real-world site.
Some discussions said setting
self.CDATA_CONTENT_ELEMENTS = () would help, but it didn't.
HTMLParser's code is in Python, not C. So I took a look at where the
errors are generated, thinking maybe I could override them.
It was easy enough to redefine parse_endtag() to make it not throw
an error (I had to duplicate some internal strings too). But then I
hit another error, so I redefined unknown_decl() and
_scan_name().
And then I hit another error. I'm sure you see where this was going.
Pretty soon I had over 100 lines of duplicated code, and I was still
getting errors and needed to redefine even more functions.
This clearly wasn't the way to go.
Using lxml.html
I'd been trying to avoid adding dependencies to additional Python packages,
but if you want to parse real-world HTML, you have to.
There are two main options: Beautiful Soup and lxml.html.
Beautiful Soup is popular for large projects, but the consensus seems
to be that lxml.html is more error-tolerant and lighter weight.
Indeed, lxml.html is much more forgiving. You can't handle start and
end tags as they pass through, like you can with HTMLParser. Instead
you parse the HTML into an in-memory tree, like this:
tree = lxml.html.fromstring(html)
How do you iterate over the tree? lxml.html is a good parser, but it
has rather poor documentation, so it took some struggling to figure out
what was inside the tree and how to iterate over it.
You can visit every element in the tree with
for e in tree.iter() :
print e.tag
But that's not terribly useful if you need to know which
tags are inside which other tags. Instead, define a function that iterates
over the top level elements and calls itself recursively on each child.
The top of the tree itself is an element -- typically the
<html></html> -- and each element has .tag and .attrib.
If it contains text inside it (like a <p> tag), it also has
.text. So to make something that works similarly to HTMLParser:
def crawl_tree(tree) :
handle_starttag(tree.tag, tree.attrib)
if tree.text :
handle_data(tree.text)
for node in tree :
crawl_tree(node)
handle_endtag(tree.tag)
But wait -- we're not quite all there. You need to handle two
undocumented cases.
First, comment tags are special: their tag attribute,
instead of being a string, is <built-in function Comment>
so you have to handle that specially and not assume that tag
is text that you can print or test against.
Second, what about cases like
<p>Here is some <i>italicised</i> text.</p>
? in this case, you have the p tag, and its text is
"Here is some ".
Then the p has a child, the i tag, with text of "italicised".
But what about the rest of the string, " text."?
That's called a tail -- and it's the tail of the adjacent i tag it follows,
not the parent p tag that contains it. Confusing!
So our function becomes:
def crawl_tree(tree) :
if type(tree.tag) is str :
handle_starttag(tree.tag, tree.attrib)
if tree.text :
handle_data(tree.text)
for node in tree :
crawl_tree(node)
handle_endtag(tree.tag)
if tree.tail :
handle_data(tree.tail)
See how it works? If it's a comment (tree.tag isn't a string),
we'll skip everything -- except the tail. Even a comment
might have a tail:
<p>Here is some <!-- this is a comment --> text we want to show.</p>
so even if we're skipping comment we need its tail.
I'm sure I'll find other gotchas I've missed, so I'm not releasing
this version of feedme until it's had a lot more testing. But it
looks like lxml.html is a reliable way to parse real-world pages.
It even has a lot of convenience functions like link rewriting
that you can use without iterating the tree at all. Definitely worth
a look!
Tags: python, programming, html
[
14:04 Jan 08, 2012
More programming |
permalink to this entry |
comments
]
Fri, 25 Jun 2010
Several groups I'm in insist on using LinkedIn for discussions,
instead of a mailing list. No idea why -- it's so much harder to use
-- but for some reason that's where the community went.
Which is fine except for happens just about every time I try to view
a discussion:
I get a notice of a thread that sounds interesting, click on the link
to view it, read the first posting, hit the space bar to scroll down
... whoops! Focus was in that silly search field at the top right of the page,
so it won't scroll.
It's even more fun if I've already scrolled down a bit with the
mousewheel -- in that case, hitting spacebar jumps back up to the
top of the page, losing any context I have as well as making me
click in the page before I can actually scroll.
Setting focus to search fields is a good thing on some pages.
Google does it, which makes terrific sense -- if you go to
google.com, your main purpose is to type something in that search box.
It doesn't, however, make sense on a page whose purpose is to
let people read through a long discussion thread.
Since I never use that search field anyway, though, I came up with
a solution using Firefox's user css.
It seems there's no way to make an input field un-focusable or
read-only using pure CSS (of course, you could use Javascript and
Greasemonkey for that); but as long as you don't need to use it,
you can make it disappear entirely.
Add this line to chrome/userContent.css inside your Firefox profile
(create it if it doesn't already exist):
form#global-search span#autocomplete-container input#main-search-box {
visibility:hidden;
}
Then restart Firefox and load a discussion page.
The search box should be hidden,
and spacebar should scroll the page just like it does on most web pages.
Of course, this will need to be updated the next time
LinkedIn changes their page layout. And it's vaguely possible that
somewhere else on the web is a page with that hierarchy of element names.
But that's easy enough to fix: run a View Page Source
on the LinkedIn page and add another level or two to the CSS rule.
The concept is the important thing.
Tags: html, web, css, tips, annoyances
[
16:17 Jun 25, 2010
More tech/web |
permalink to this entry |
comments
]
Sun, 20 Jun 2010
Regular readers probably know that I use
HTML
for the slides in my talks, and I present them either with Firefox
in fullscreen mode, or with my own Python
preso
tool based on webkit.
Most of the time it works great. But there's one situation that's
always been hard to deal with: low-resolution projectors.
Most modern projectors are 1024x768, and have been for quite a few years,
so that's how I set up my slides. And then I get asked to give a talk
at a school, or local astronomy club, or some other group that
has a 10-year-old projector that can only handle 800x600. Of course,
you never find out about this ahead of time, only when you plug in
right before the talk. Disaster!
Wait -- before you object that HTML pages shouldn't use pixel values and
should work regardless of the user's browser window size: I completely
agree with you. I don't specify absolute font sizes or absolute
positioning on web pages -- no one should.
But presentation slides are different: they're designed for
a controlled environment where everyone sees the same thing using the
same software and hardware.
I can maintain a separate stylesheet -- that works for making the
font size smaller but it doesn't address the problem of pictures too
large to fit (and we all like to use lots of pictures in presentations,
right?) I can maintain two separate copies of the slides for the two sizes,
but that's a lot of extra work and they're bound to get out of sync.
Here's a solution I should have thought of years ago: full-page zoom.
Most major browsers have offered that capability for years, so the
only trick is figuring out how to specify it in the slides.
IE and the Webkit browsers (Safari, Konqueror, etc.) offer a wonderful
CSS property called zoom. It works like this:
body {
zoom: 78.125%;
}
78.125% is the ratio between an 800-pixel projector and a 1024-pixel one.
Just add this line, and your whole page will be scaled down to the
right size. Lovely!
Lovely, except it doesn't work on Firefox
(bug 390936).
Fortunately, Firefox has another solution: the more general and not yet
standardized CSS transform, which Mozilla has implemented as the
Mozilla-specific property
-moz-transform.
So add these lines:
body {
position: absolute; left: 0px; top: 0px;
-moz-transform: scale(.78125, .78125);
}
The position: absolute is needed because when Firefox scales
with -moz-transform, it also centers whatever it scaled, so the
slide ends up in the top center of the screen.
On my laptop, at least, it's the upper left part of the screen that
gets sent to the projector, so slides must start in the upper left corner.
The good news is that these directives don't conflict; you can put
both zoom and -moz-transform in the same rule and things
will work fine. So I've added this to the body rule in my slides.css:
/* If you get stuck on an 800x600 projector, use these:
zoom: 78.125%;
position: absolute; left: 0px; top: 0px;
-moz-transform: scale(.78125, .78125);
*/
Uncomment in case of emergency and all will be well.
(Unless you use Opera, which doesn't seem to understand either version.)
Tags: speaking, html, css, browsers, firefox, mozilla
[
11:14 Jun 20, 2010
More tech/web |
permalink to this entry |
comments
]
Mon, 29 Mar 2010
I maintain the websites for several clubs. No surprise there -- anyone
with even a moderate knowledge of HTML, or just a lack of fear of
it, invariably gets drafted into that job in any non-computer club.
In one club, the person in charge of scheduling sends out an elaborate
document every three months in various formats -- Word, RTF, Excel, it's
different each time. The only regularity is that it's always full of
crap that makes it hard to turn it into a nice simple HTML table.
This quarter, the formats were Word and RTF. I used unrtf to turn
the RTF version into HTML -- and found a horrifying page full of
lines like this:
<body><font size=3></font><font size=3><br>
</font><font size=3></font><b><font size=4></font></b><b><font size=4><table border=2>
</font></b><b><font size=4><tr><td><b><font size=4><font face="Arial">Club Schedule</font></font></b><b><font size=4></font></b><b><font size=4></font></b></td>
<font size=3></font><font size=3><td><font size=3><b><font face="Arial">April 13</font></b></font><font size=3></font><font size=3><br>
</font><font size=3></font><font size=3><b></b></font></td>
I've put the actual page content in bold; the rest is just junk,
mostly doing nothing, mostly not even legal HTML,
that needs to be eliminated if I want
the page to load and display reasonably.
I didn't want to clean up that mess by hand! So I needed some regular
expressions to clean it up in an editor.
I tried emacs first, but emacs makes it hard to try an expression then
modify it a little when the first try doesn't work, so I switched to vim.
The key to this sort of cleanup is non-greedy regular expressions.
When you have a bad tag sandwiched in the middle of a line containing
other tags, you want to remove everything from the <font
through the next > -- but no farther, or else you'll delete
real content. If you have a line like
<td><font size=3>Hello<font> world</td>
you only want to delete through the <font>, not through the </td>.
In general, you make a regular expression non-greedy by adding a ?
after the wildcard -- e.g. <font.*?>. But that doesn't work
in vim. In vim, you have to use \{M,N} which matches
from M to N repetitions of whatever immediately precedes it.
You can also use the shortcut \{-} to mean the same thing
as *? (0 or more matches) in other programs.
Using that, I built up a series of regexp substitutes to clean up
that unrtf mess in vim:
:%s/<\/\{0,1}font.\{-}>//g
:%s/<b><\/b>//g
:%s/<\/b><b>//g
:%s/<\/i><i>//g
:%s/<td><\/td>/<td><br><\/td>/g
:%s/<\/\{0,1}span.\{-}>//g
:%s/<\/\{0,1}center>//g
That took care of 90% of the work, leaving me with hardly any cleanup
I needed to do by hand. I'll definitely keep that list around for
the next time I need to do this.
Tags: regexp, html, editors, vim
[
22:02 Mar 29, 2010
More linux/editors |
permalink to this entry |
comments
]