The web in 1000 lines of C

16 points by smlckz


jcs

The spec spells out how to properly close specific tags that were never closed, or when certain ones are opened before others are closed (and some tags behave differently than others).

It ends up needing a ridiculously lengthy state machine that you are supposed to run every character through. The upside is it can parse some really shitty HTML, but the downside is that the state machine is supposed to emit tokens that build up a tree that can be modified by later tokens, so you're supposed to wait until you have the whole tree before turning tokens into elements.

But as this article suggests, it's usually ok just to print them as you get them. In my System 6 browser I did the same because sometimes it couldn't even fit the entire tree into memory or it took too long to parse while the user was looking at a blank screen. It's hard to do things at 8Mhz with ~2MB of memory.