RFC 9839 and Bad Unicode

43 points by carlana


lor_louis

It looked fun enough, so I wrote an implementation in Rust. https://github.com/lorlouis/rfc9839-rs I think there is some serious potential to make it really fast, possibly by using SIMD.

olliej

The underlying problem with JSON specifically is that it was designed as a subset of what was actually permitted by JS object literals (the ostensible source), because rather than actually parsing it, crockford just wanted to use eval(). That is not safe, so he wanted to validate it with regex.

That’s what caused the restrictions on undefined, nan, infinity, comments, and on and on.

If he’d just opted to actually parse it properly (as every JS engine does - to the extent that I know at least JSC preemptively parses all strings with the JSON first [in a lax mode]) almost all of these restrictions that people complain about today would not have been necessary.

But because he decided to validate with regex rather than parse it, the grammar has to be constrained to what can be “validated” as safe for eval.

Things like unpaired surrogates, similarly fall out of that (you can’t easily verify correct unicode with a regex).

To that extent I do blame Crockford for this, but I also acknowledge that at the time he probably did not expect the long term consequences.

llimllib

I thought YAML was supposed to be a superset of JSON - how can it restrict characters that JSON allows?