HTML spec change: escaping < and > in attributes

37 points by jmillikin

kornel

mXSS is a terrible flaw in the syntax. The interaction between HTML, XML, and error recovery is mind-bogglingly complex, in more evil ways than it seems. Correctly implemented parsing of HTML can result in a DOM tree that can’t be expressed in valid HTML syntax, and to serialize it you’d have to reason backwards into what invalid states you’d need to introduce earlier to land at this. It’s almost like reversing a hash.

It’s a shame the problem hasn’t been caught when HTML5 added SVG support.

freddyb

It gets worse. You can create HTML trees with DOM APIs (createElement, append), that are invalid and impossible to achieve with a parser. Like a element with child nodes. (Exception: when in SVG namespaces, a script element can have children and JS needs to be entity-encoded or in CDATA).

Which is why, basically, setting innerHTML is dangerous when if round-tripping from the output of an innerHTML getter.

An upcoming HTML Sanitizer API will avoid this by offering Element.setHTML(). The functionality is best described as “like innerHTML=”, but without XSS.

It can do this by using the builtin browser parser (no ambiguity), and limiting output to known safe elements and attributes).
- FreeFull
  
  For clarification, your “Like a <script> element” has been turned into “Like a <!– raw HTML omitted –> element” by lobsters
  - pushcx
    
    That’s not our config, I don’t think, probably from CommonMarker. There’s an open issue to update to current, which may have better behavior, and in any case would be a prereq to considering changing that behavior. Maybe you, dear reader, would like to improve this?
- andyc
  It’s a shame the problem hasn’t been caught when HTML5 added SVG support.
  
  I took a look at the original bug linked: https://github.com/whatwg/html/issues/6235
  
  The obvious problem seems to be that it’s possible to escape from in between <svg></svg> into HTML.
  
  <svg><p></p><style><a title="</style><img src onerror=alert(1)>"></style></svg>
  
  However, after reparsing a different DOM tree is created:
  
  …
  
  Leading to cross-site scripting. The reason for that is in fact that p breaks out foreign content.\
  
  So I’m curious about a few things
  
  Is it possible to change this now? They “broke” outerHTML, so maybe they can break <svg> parsing
  
  Why was this rule introduced in the first place?
  
  I’m not an XML fan, but yeah enforcing nesting is not a bad idea … especialy when mixing HTML and SVG !!!
  
  Also, XML enforces quoting of <>, though yeah I get why that style did not “win”
  - kornel
    
    That issue is from 2020. The syntax has been designed in 2007-2008. https://annevankesteren.nl/2007/10/svg-html
    
    IIRC the big issue back then was interoperability, so whatever browsers agreed on was good. This had to include compatibility or fallback for Internet Explorer. So I bet <p> breaks out of SVG, because that’s what IE did.
    
    I don’t know if it’s fixable now, since tweaking it would lose interoperability, and could make the problem even worse when old and new parsers disagree.
- yawaramin
  
  TIL! Updated my library https://github.com/yawaramin/dream-html/commit/daf0a6919ead32073361462ca29f81277fb4cb24
- spc476
  
  What? I thought < and > always had to be escaped in attributes. When did that change, and why is it only now being fixed?
  
  Disclaimer: I haven’t done web work as a job since the late 90s.
  - kornel
    
    They never needed to be escaped in attributes. In non-pathological cases you only need to escape the quote char for safety, and escaping the quote char and & is all you need for correctness.
    
    The problem is that there are edge cases in HTML syntax that can change the meaning of the markup, and end up parsing content of attributes as top-level content. Then you get garbage, but with < escaped you get slightly harder to exploit garbage.
    
    bentley
    
    They never needed to be escaped in attributes.
    
    To be pedantic, they do need to be escaped in unquoted attributes, which are a thing.
  - hoistbypetard
    
    mXSS is a terrible flaw in the syntax. The interaction between HTML, XML, and error recovery is mind-bogglingly complex, in more evil ways than it seems. Correctly implemented parsing of HTML can result in a DOM tree that can’t be expressed in valid HTML syntax, and to serialize it you’d have to reason backwards into what invalid states you’d need to introduce earlier to land at this. It’s almost like reversing a hash.
    
    I don’t think I ever recall having to escape them in when I’ve written HTML. But I don’t recall having them in attributes until relatively recently. (ca 2010 or later)
    
    And by my read, this isn’t saying they need to be escaped in HTML source. Only that the DOM apis that let you access HTML source will now escape them, even though the DOM apis that let you access the specific attribute values will continue to not do so.
  - freddyb
    
    To be clear, the change discussed here is only about serializing attributes back into strings. No parser change.
    
    hobbified
    
    Yes, but if you didn’t know that <div data-content="<u>hello</u>"></div> was valid in the first place, the entire article is quite a surprise :)
    
    colejohnson66
    
    Having written an HTML tokenizer in the past (according to the spec), it’s obvious in hindsight, but at the time, I didn’t even think about it. Specifically, if you’re in a quoted attribute state, the only real “error” state is a premature EOF. See sections 13.2.5.36 and 13.2.5.37. The tokenizer basically scans for an escape sequence or the closing quote - nothing else (replacing NULL with U+FFFD is a common action in many states, so I’m not counting it). So obviously, you can shove tags in there!