Porting an HTML5 Parser to Swift
9 points by juri
9 points by juri
They say they've swapped out character iteration for byte iteration, but didn't mention whether this still handles Unicode properly. I assume since the tests pass it does, but it would be nice to have some actual discussion of where these improvements are applicable.
Since all special characters in HTML syntax are ASCII, you can treat any non-ASCII characters as a black box.
For example, <p>المحتوى</p> can be parsed by byte iteration because < and > are ASCII characters so the parser can confidently recognize them even when handling the utf8 string as raw bytes.
The way utf8 is designed, bytes representing valid ASCII characters can’t appear in multi-byte Unicode scalars.
So you can treat anything that’s not part of the HTML spec as a black box. You can accumulate the stuff in between delimiters as raw bytes like المحتوى and they’ll remain valid utf8 sequences without having to explicitly identify the boundaries of each Unicode scalar within them at lexing time.