Hi, there!

As you might have guessed from the title, today's topic is HTML parsing.


Once I got an X idea, but its implementation required a calculated DOM with all its styles and goodies. Googling retrieved nothing useful. There are all kinds of bindings for WebKit, but they work not on all platforms, being very crippled to boot. In some projects, WebKit is wrapped in a frontend you work with via JavaScript. Something was tried out — with a poor outcome though. Resource consumption alone required a lot.

Wants and Wishes

What was wanted didn't seemed much:

And so I joined the unequal battle!

I studied existing HTML and CSS parsers. They all fell into 3 conventional categories:

Given No. 3, the subject seems to be dropped, doesn't it?! Nope, and here's why: all existing parsers are made on the principle 'Parse and Die'. It's when you give the program complete HTML, the program returns a result, but all subsequent manipulations except reading are impossible. This fact limits the operation of parsers. Remarkably, some push operating DOM off a level up. Here's the principle: we parse with a C parser and then — via bindings — try to work with DOM on, say, Python, which is a bit absurd.

Further on, nobody allowed for wedging into the thread (HTML meant here) during parsing. This is critical for fitting a JavaScript engine. It's a long story — I'd better show:

HTML document fragment:
<script>document.write("<div cl");</script>ass="future"></div>
Outcome of any browser with JS:
<div class="future"></div>

So, a fully featured DIV element will come out. By the way, SCRIPT tag tokenization is a hell of an effort. I had to draw a graph.

Script data tokenization. HTML 5

After all that had been seen, the decision was to code from scratch on C. And requirements to code appeared at once:

Why so tough — on C?! The solution had to be embedded so as to enable relatively easy framing for an external programming language.

Here's what was managed to be drafted hit and miss:

The renderer may deserve a long description, as the short phrase 'Renderer of inline elements' conceals a lot: handling fonts under specs, calculating text size, computing 'vertical-align', building an auxiliary tree to draw text and a whole lot more.

As a result of 2–3 years of unhurried development, I started rewriting the draft copy into a production version. The first was — quite logically — the HTML parser.

Now it has the following capabilities:

Next in turn are the CSS parser and Renderer. I'm writing them all by myself, still full of energy.

Any help is very welcome!

Thanks for attention! Hope you'll enjoy it!

The parser itself

See next article: Benchmark. Analyzing and Testing Current HTML Parsers