HTML spec change: escaping < and > in attributes
37 points by jmillikin
37 points by jmillikin
mXSS is a terrible flaw in the syntax. The interaction between HTML, XML, and error recovery is mind-bogglingly complex, in more evil ways than it seems. Correctly implemented parsing of HTML can result in a DOM tree that can’t be expressed in valid HTML syntax, and to serialize it you’d have to reason backwards into what invalid states you’d need to introduce earlier to land at this. It’s almost like reversing a hash.
It’s a shame the problem hasn’t been caught when HTML5 added SVG support.
It gets worse. You can create HTML trees with DOM APIs (createElement, append), that are invalid and impossible to achieve with a parser. Like a element with child nodes. (Exception: when in SVG namespaces, a script element can have children and JS needs to be entity-encoded or in CDATA).
Which is why, basically, setting innerHTML is dangerous when if round-tripping from the output of an innerHTML getter.
An upcoming HTML Sanitizer API will avoid this by offering Element.setHTML(). The functionality is best described as “like innerHTML=”, but without XSS.
It can do this by using the builtin browser parser (no ambiguity), and limiting output to known safe elements and attributes).
For clarification, your “Like a <script> element” has been turned into “Like a <!– raw HTML omitted –> element” by lobsters
That’s not our config, I don’t think, probably from CommonMarker. There’s an open issue to update to current, which may have better behavior, and in any case would be a prereq to considering changing that behavior. Maybe you, dear reader, would like to improve this?
It’s a shame the problem hasn’t been caught when HTML5 added SVG support.
I took a look at the original bug linked: https://github.com/whatwg/html/issues/6235
The obvious problem seems to be that it’s possible to escape from in between <svg></svg>
into HTML.
<svg><p></p><style><a title="</style><img src onerror=alert(1)>"></style></svg>
However, after reparsing a different DOM tree is created:
…
Leading to cross-site scripting. The reason for that is in fact that p breaks out foreign content.\
So I’m curious about a few things
<svg>
parsingI’m not an XML fan, but yeah enforcing nesting is not a bad idea … especialy when mixing HTML and SVG !!!
Also, XML enforces quoting of <>
, though yeah I get why that style did not “win”
That issue is from 2020. The syntax has been designed in 2007-2008. https://annevankesteren.nl/2007/10/svg-html
IIRC the big issue back then was interoperability, so whatever browsers agreed on was good. This had to include compatibility or fallback for Internet Explorer. So I bet <p>
breaks out of SVG, because that’s what IE did.
I don’t know if it’s fixable now, since tweaking it would lose interoperability, and could make the problem even worse when old and new parsers disagree.
TIL! Updated my library https://github.com/yawaramin/dream-html/commit/daf0a6919ead32073361462ca29f81277fb4cb24
What? I thought < and > always had to be escaped in attributes. When did that change, and why is it only now being fixed?
Disclaimer: I haven’t done web work as a job since the late 90s.
They never needed to be escaped in attributes. In non-pathological cases you only need to escape the quote char for safety, and escaping the quote char and &
is all you need for correctness.
The problem is that there are edge cases in HTML syntax that can change the meaning of the markup, and end up parsing content of attributes as top-level content. Then you get garbage, but with <
escaped you get slightly harder to exploit garbage.
They never needed to be escaped in attributes.
To be pedantic, they do need to be escaped in unquoted attributes, which are a thing.
mXSS is a terrible flaw in the syntax. The interaction between HTML, XML, and error recovery is mind-bogglingly complex, in more evil ways than it seems. Correctly implemented parsing of HTML can result in a DOM tree that can’t be expressed in valid HTML syntax, and to serialize it you’d have to reason backwards into what invalid states you’d need to introduce earlier to land at this. It’s almost like reversing a hash.
I don’t think I ever recall having to escape them in when I’ve written HTML. But I don’t recall having them in attributes until relatively recently. (ca 2010 or later)
And by my read, this isn’t saying they need to be escaped in HTML source. Only that the DOM apis that let you access HTML source will now escape them, even though the DOM apis that let you access the specific attribute values will continue to not do so.
To be clear, the change discussed here is only about serializing attributes back into strings. No parser change.
Yes, but if you didn’t know that <div data-content="<u>hello</u>"></div>
was valid in the first place, the entire article is quite a surprise :)
Having written an HTML tokenizer in the past (according to the spec), it’s obvious in hindsight, but at the time, I didn’t even think about it. Specifically, if you’re in a quoted attribute state, the only real “error” state is a premature EOF. See sections 13.2.5.36 and 13.2.5.37. The tokenizer basically scans for an escape sequence or the closing quote - nothing else (replacing NULL with U+FFFD is a common action in many states, so I’m not counting it). So obviously, you can shove tags in there!