HTML Parsing with JavaScript

| 22 Comments

The project I've been working on in Mozilla's research group is dom.js: an implementation of the browser DOM in JavaScript. I've gotten much of the core DOM 4 document tree functionality implemented, and have started on DOM features defined by the HTML spec.

One of those features is the innerHTML property. And to implement it, I need an HTML parser. So I dove in and implemented one. Here are some of the things I learned:

  • Long ago, browser vendors made the choice that browsers would not display error messages when given malformed HTML, but instead attempt to display the malformed input in some reasonable way. For web compatibility, all browsers today must do that. XHTML was a attempt at imposing a strict syntax on the web, and it failed.
  • So for web compatibility, browsers must all parse HTML liberally and accept malformed input. And for web interoperability, browsers must all parse malformed HTML in the same way.
  • The HTML parsing specification is long and complicated. This is because in addition to specifying exactly how to parse valid HTML inputs, it also specifies exactly how to parse all possible invalid inputs. A standardized parsing algorithm is, in fact, one of the most important (but least appreciated) features of HTML 5 and represents a huge amount of work by Ian Hickson.
  • The HTML spec defines the parser in terms of a tokenizer stage and a tree builder stage. And both stages are defined as state machines (which you may remember from undergrad CS classes). The prose description of a state machine is verbose, but the translation to JavaScript code is fairly mechanical (in fact, I wrote a simple script to help me translate most of the tokenizer state machine from HTML prose to JavaScript code).
  • The advantage of a state machine specification is that it can account for every possible input. What happens, for example, when the tokenizer sees <a href=>, for example, or reaches the end-of-file while in the middle of a comment? Since the tokenizer is specified as a state machine unexpected inputs like these are just state transitions and have a well-defined behavior.
  • The flip side is that by specifying the HTML parser in terms of state machines, the HTML spec never explicitly says how the parser is supposed to behave when it is given <a href=> or other malformed input. The behavior is implicit in the transitions of the state machine, but to figure out what the parser does for any given input, you have to read through the spec, simulating the state machines in your head. There is not a formal grammar like you'd see when reading a programming language specification, just an algorithmic description of how the parser behaves.
  • The HTML specification includes specific instructions on how to handle the </sarcasm> end tag. I think its a joke, though, since the instructions say to treat it like any other end tag. :-)
  • The HTML specification integrates the SVG and MathML grammars, and as part of that, HTML adopts all the named character entities (things like &lt; and &quot;) of MathML. There are now 2125 named character references you can use, including &CounterClockwiseContourIntegral; and &InvisibleComma;!
  • An HTML parser implementation cannot be separated from the HTML and DOM implementation of which it is a part. I had always assumed that an HTML parser could be written generically to build up a DOM parse tree using public DOM APIs like document.createElement() and Element.appendChild(). It turns out that there are a number of places where the parser must manipulate DOM nodes in ways that are not possible with the public API. These include setting the compatMode property of the Document object and setting the form property of HTML form elements like input fields and buttons.
  • There is a really nice HTML parsing test suite (the tokenizer and tree-construction directories are the most important ones) and I gather that the major browser vendors all use it. Strangely, the documentation for using the test data is at a completely different site.

It took me a couple of weeks, but I learned a lot about HTML parsing and ended up with a working parser! The source code weighs in at over a quarter megabyte and 6300 sloc. (Though to be fair, I should note that the object literal that holds those 2125 named character references takes up over 1000 of those lines of code.) It passes the tokenizer and tree construction tests mentioned above, and it can parse the HTML specification (over 6 megabytes of HTML) in under 8 seconds.

If I can brag a bit, this 8 second speed easily beats the two other JavaScript-based HTML parsers that I know of. The Node-based parser at https://github.com/aredridel/html5 displays some kind of quadratic behavior (doubling the size of the HTML to be parsed quadruples the parsing time) and as far as I can tell (I gave up after 5 minutes or so) it is just not capable of parsing the HTML spec, at least not when used with the Node-based DOM library at https://github.com/tmpvar/jsdom.

The other JavaScript HTML parser I know about is the Validator.nu parser by my Mozilla colleague Henri Sivonen. This is a Java-based parser, but it includes a custom Java-to-C++ translator, and when translated to C++, it is used as Firefox's HTML parser. The same codebase can be translated to JavaScript using Google's GWT parser, but the result is not terribly efficient. Parsing the HTML specification takes 18 to 20 seconds with the GWT version of Henri's parser. (I thought seriously about modifying Henri's Java-to-C++ translator to output JavaScript instead of C++, but in the end, I decided that it would take a similar amount of time and would be more fun and educational to just write my parser from scratch.)

If you've read this far, here is where I confess that my parser isn't even done yet. It does not yet execute scripts, which means that it doesn't handle document.write() calls. That's next on my list, but its kind of scary. You'd think that executing a script in a JavaScript-based parser would just amount to an eval() call. But there's a lot more to it than that. Another colleague of mine described these sections of the spec as "brain-meltingly" complex. We'll see. I'm keeping my fingers crossed that none of the strange cases where document.write() inserts new scripts require me to refactor the parsing code in major ways!

22 Comments

Congratulations, I think I'll be having a dig through the source sometime soon.

Awesome, awesome write-up!

I had no idea about &InvisibleComma; ... and &CounterClockwiseContourIntegral; is pretty rad.

Looking forward to seeing where this goes.

I'd say &pi; is more rad than &CounterClockwiseContourIntegral;, myself -- and 2&pi; even moreso.

(I'll be here all week.)

"I had always assumed that an HTML parser could be written generically to build up a DOM parse tree using public DOM APIs like document.createElement() and Element.appendChild(). It turns out that there are a number of places where the parser must manipulate DOM nodes in ways that are not possible with the public API."
=> Indeed. I have always wondered why in this benchmark (http://www.quirksmode.org/dom/innerhtml.html ), the innerHTML were faster than native methods while it certainly require more work (parsing the string) than manually create all relevant elements. The explanation i came up with was that the innerHTML method was doing differently than just call document.createElement/appendChild() when needed.

Another explanation is that one .innerHTML call is less JS C++DOM round-trips.

You compare performance with other JS implementation, but how does your lib compare to a C++ implementation?

Anyway, that's some great work! Thanks!

The amount of code in this parser is blowing my mind :) Occasionally, I want to fool around with simple parser coding for some DSL (domain-specific language) translation; but the amount of code some of this takes is daunting.

Congrats, parsing the HTML5 spec (both literally and figuratively) is no small feat.

Bad-ass!

Speaking of which, this sounds like a good entry for badassjs.com...

Curious: have you thought of trying to take Henri's generated C++ code, compiling it with clang, and compiling /that/ with Emscripten?

Dave

Jeff,

I've got only one response to puns like yours: ↠

(Apparently my blog allows HTML character references in comments.) Jeff, my response to you is &Rarr;

David,

I'm not sure exactly how to measure the parsing performance of the C++ parser in Firefox... I tried using the new performance.timing properties (which are very cool, by the way). When I load the HTML spec into an iframe, it tells me that it takes 4.5 seconds to parse, but that is doing some rendering, too, so its not a fair measure. (I removed all scripts and external links from the spec so there wouldn't be script execution or network activity slowing things down.). If I do the same test using an iframe with style="display:none", it tells me that the parsing only takes 0.5 seconds, which seems blazing fast!

Ben,

DSL parsers are generally simple and kind of fun. If you've got a strict syntax and can just report an error on invalid input, then your code will be much simpler than this.

Dave,

I never occured to me to try emscripten on the C++ translation of Henri's parser. That would be bad-ass! I don't know enough about the design and API of the C++ version to know where to begin with that (questions like how do I get it to accept a string of UTF-16 characters as input) but it would be a really interesting experiment...

Wow, nice work!

I'm hoping to fix up the speed on mine. Thanks for highlighting that so clearly!

Good luck on those brain-melting edge cases. They're a doozy. (Also, double-check your table reparenting -- I've not tried yours yet, but those tests are a real headache)

David,

That's impressive work and thanks for sharing. One comment about scripts and document.write(): I think you might just luck out and not have to implement those, because browsers skip script tags for innerHTML. If innerHTML is all you plan to support, that is.

"An HTML parser implementation cannot be separated from the HTML and DOM implementation of which it is a part. I had always assumed that an HTML parser could be written generically to build up a DOM parse tree using public DOM APIs like document.createElement() and Element.appendChild(). It turns out that there are a number of places where the parser must manipulate DOM nodes in ways that are not possible with the public API. These include setting the compatMode property of the Document object and setting the form property of HTML form elements like input fields and buttons. "

Is there a rationale for why this is the case? I also assumed that the DOM API was intended to provide a way to programatically construct what can be declaratively expressed in HTML. In seems like a bug if this is not the case and perhaps one that we should pursue getting corrected in the specification.

If there is a sound rationale for limiting access to setting these property it would still seem appropriate to have a "privileged" API that can be used by parsers. This would then decouple the DOM and HTML parser implementations and that seems like a very good goal. Perhaps we need to propose that API

Allen

Toby,

innerHTML was the immediate justification for writing the parser, but I'll be using it for full documents, too, so I do need to deal with scripts...

As best as I can tell, the rationale is that the DOM and the parsing algorithm are part of the same spec and it was never a goal to make them independent.

At first, I thought that this issue only existed for doctypes and the document.compatMode attribute, and asked for a public API for setting the compatMode. You can follow the email thread here: http://lists.w3.org/Archives/Public/www-dom/2011OctDec/0014.html

The discussion didn't completely convince me that the parser and DOM shouldn't be decoupled. But they did convince me that there isn't any enthusiasm for such a project among spec writers and DOM implementors. So I dropped it :-).

One possible application of a JavaScript DOM builder like yours would be as a replacement for JSP: Run the DOM builder server-side in Node or Rhino and use it to inject the dynamic data into the document, then serialize it out to the client. The result? A fully-rendered web page that appears immediately without any FOUJUI (Flash Of Uninitialized JavaScript UI). This makes a site appear fast partly because it *is* faster, and partly because the page is already complete the moment it appears on the screen. Are you folks thinking of doing something like this?

Well, that discussion certainly didn't convince me. The respondents don't seem to be thinking outside of their native browser implementer boxes.

Just to start, it is poor engineering practice to define two complex subsystem that are (unnecessarily) required to be tightly coupled.

I can also think of a number of use cases:

A parser for an alternative HTML wire encoding that still generates the native DOM.

Sandbox HTML parser that imposes restrictions on the loaded HTML

A HTML parser that recognizes experimental or non-standard tags.

A "live editing" tool for HTML that fully supports all quirks modes.

Various sorts of dynamic HTML rewritters.

These are all things that are similar to things that are routinely done by tool builders for other languages.

etc.

So here's where you go and create a simple web interface and do some wicked static analysis on html ;)

Not sure how XHTML failed, every web designer I know, myself included, prefer strict syntax.

Just because HTML5 won out against XHTML 2 doesn't mean XHTML failed, I think most people writing HTML5 today will be using the strict syntax.

Ignoring that, I'm thoroughly impressed with the job you've done on the parse, extremely impressed :)

Extraordinary work David.

Writing any kind of parser is brain-crushingly burdensome.
You cannot foresee which problems you will face until you dive into code.
Writing an HTML parser, using "JavaScript" -- requires a lot of time, non-distracted Ninja attention, and nerves of steel.
From the github repo, it seems that you've achived a lot.

Really impressive; keep it up!

Books

ECMAScript 5 & HTML5!

"A must-have reference"
Brendan Eich,
creator of JavaScript

JavaScript graphics makes web programming fun again!

Read Less, Learn More

Comprehensive coverage of Ruby 1.8 and 1.9

"The New Most Important Ruby Book"
Peter Cooper,
rubyinside.com

The classic Java quick-reference

About

Advertising

Pages

Hosted By

Powered by Movable Type 4.21-en