MyHTML — HTML Parser on Pure C with POSIX Threads Support by Alexander Borisov

Hi, there!

As you might have guessed from the title, today's topic is HTML parsing.

Preface

Once I got an X idea, but its implementation required a calculated DOM with all its styles and goodies. Googling retrieved nothing useful. There are all kinds of bindings for WebKit, but they work not on all platforms, being very crippled to boot. In some projects, WebKit is wrapped in a frontend you work with via JavaScript. Something was tried out — with a poor outcome though. Resource consumption alone required a lot.

Wants and Wishes

What was wanted didn't seemed much:

HTML renderer without super dependencies. Only a renderer, without the network layer. In other words, complete HTML computation till drawing in a window
Ability to fit the envelop to a JavaScript engine
Ability to easily make bindings for other programming languages

And so I joined the unequal battle!

I studied existing HTML and CSS parsers. They all fell into 3 conventional categories:

Those parsing at random, with their own approach to HTML tokenization
Those parsing somehow under specs
Those parsing strictly under specs

Given No. 3, the subject seems to be dropped, doesn't it?! Nope, and here's why: all existing parsers are made on the principle 'Parse and Die'. It's when you give the program complete HTML, the program returns a result, but all subsequent manipulations except reading are impossible. This fact limits the operation of parsers. Remarkably, some push operating DOM off a level up. Here's the principle: we parse with a C parser and then — via bindings — try to work with DOM on, say, Python, which is a bit absurd.

Further on, nobody allowed for wedging into the thread (HTML meant here) during parsing. This is critical for fitting a JavaScript engine. It's a long story — I'd better show:

HTML document fragment:

<script>document.write("<div cl");</script>ass="future"></div>

Outcome of any browser with JS:

<div class="future"></div>

So, a fully featured DIV element will come out. By the way, SCRIPT tag tokenization is a hell of an effort. I had to draw a graph.

After all that had been seen, the decision was to code from scratch on C. And requirements to code appeared at once:

C99 support
Capability to separate the HTML parser from the renderer, to be used standalone
No external dependencies

Why so tough — on C?! The solution had to be embedded so as to enable relatively easy framing for an external programming language.

Here's what was managed to be drafted hit and miss:

HTML parser
CSS parser
Selectors
Renderer of 'inline', 'inline-block', 'block', 'table'...

The renderer may deserve a long description, as the short phrase 'Renderer of inline elements' conceals a lot: handling fonts under specs, calculating text size, computing 'vertical-align', building an auxiliary tree to draw text and a whole lot more.

As a result of 2–3 years of unhurried development, I started rewriting the draft copy into a production version. The first was — quite logically — the HTML parser.

Now it has the following capabilities:

Parsing HTML asynchronously, processing tokens, building the tree
Full HTML 5 support under specs html.spec.whatwg.org/multipage
Having 2 APIs: high and low-level. The former is a public API having a description and everything it should, but unable to see structures. The latter is using sources directly
Ability to manipulate elements: addition, removal, modification
Ability to manipulate element attributes: addition, removal, modification
Support of 34 input encodings. Output and all internal work is in UTF-8 only
Ability to define text encoding Unicode now available: UTF-8, UTF-16LE, UTF-16BE (+ definition by BOM), and Russian ones: windows-1251, koi8-r, iso-8859-5, x-mac-Cyrillic, ibm866
Ability to run in single mode — without threads
Parsing HTML fragments
Parsing chunks. Parsing HTML cuts (broken in arbitrary places) without prebuffering
No external dependencies
C99 support
Passes all tree construction tests from html5lib-tests
Advanced memory management. Memory is cashed, allocated in chunks and for objects. For example, removing 10 elements and then adding other 10 won't eat away memory for the new ones
+ a whole lot more small but useful features to be described for long.

Next in turn are the CSS parser and Renderer. I'm writing them all by myself, still full of energy.

Any help is very welcome!

Thanks for attention! Hope you'll enjoy it!

The parser itself