RFC 9839 and Bad Unicode
43 points by carlana
43 points by carlana
It looked fun enough, so I wrote an implementation in Rust. https://github.com/lorlouis/rfc9839-rs I think there is some serious potential to make it really fast, possibly by using SIMD.
The underlying problem with JSON specifically is that it was designed as a subset of what was actually permitted by JS object literals (the ostensible source), because rather than actually parsing it, crockford just wanted to use eval(). That is not safe, so he wanted to validate it with regex.
That’s what caused the restrictions on undefined, nan, infinity, comments, and on and on.
If he’d just opted to actually parse it properly (as every JS engine does - to the extent that I know at least JSC preemptively parses all strings with the JSON first [in a lax mode]) almost all of these restrictions that people complain about today would not have been necessary.
But because he decided to validate with regex rather than parse it, the grammar has to be constrained to what can be “validated” as safe for eval.
Things like unpaired surrogates, similarly fall out of that (you can’t easily verify correct unicode with a regex).
To that extent I do blame Crockford for this, but I also acknowledge that at the time he probably did not expect the long term consequences.
I thought YAML was supposed to be a superset of JSON - how can it restrict characters that JSON allows?
Most YAML parsers implement YAML 1.1 which doesn’t claim to be a superset of JSON. YAML 1.2 documents need a directive to opt in to non-1.1 behaviour which means that a JSON document can’t be YAML 1.2.
YAML 1.2 documents need a directive to opt in to non-1.1 behaviour
That is not what the specification says:
A version 1.2 YAML processor must accept documents with an explicit “%YAML 1.2” directive, as well as documents lacking a “YAML” directive. Such documents are assumed to conform to the 1.2 version specification.
YAML 1.2 is intended to be JSON superset and as far as I know, it achieves that. (Except for not allowing duplicate keys in objects which however as a restriction that was removed a few years back)
As for the original question:
how can it restrict characters that JSON allows?
As far as I can see it does not:
To ensure JSON compatibility, YAML processors must allow all non-C0 characters inside quoted scalars. To ensure readability, non-printable characters should be escaped on output, even inside such scalars.