Remix.run Logo
dmsnell 8 days ago

The other comment explains this, but I think it can also be viewed differently.

It’s helpful to recognize that the inner script tags are not actual script tags. Yes, once entering a script element, the browser switches parsers and wants to skip everything until a closing script tag appears. The STYLE element, TITLE, TEXTAREA, and a few others do this. Once they chop up the HTML like this they send the contents to the separate inner parser (in this case, the JS engine). SCRIPT is unique due to the legacy behavior^1.

HTML5 specifies these “inner” tags as transitions into escape modes. The entire goal is to allow JavaScript to contain the string “</script>” without it leaking to the outer parser. The early pattern of hiding inside an HTML comment is what determined the escaping mechanism rather than making some special syntax (which today does exist as noted in the post).

The opening script tag inside the comment is actually what triggers the escaping mode, and so it’s less an HTML tag and more some kind of pseudo JS syntax. The inner closing tag is therefore the escaped string value and simultaneously closes the escaped mode.

Consider the use of double quotes inside a string. We have to close the outer quote, but if the inner quote is escaped like `\”` then we don’t have to close it — it’s merely data and not syntax.

There is only one level of nesting, and eight opening tags would still be “closed” by the single closing tag.

^1: (edit) This is one reason HTML and XML (XHTML) are incompatible. The content of SCRIPT and STYLE elements are essentially just bytes. In XML they must be well-formed markup. XML parsers cannot parse HTML.

tannhaeuser 8 days ago | parent | next [-]

Whoever the idiot was who came up with piling inline CSS and JS into the already heavy SGML syntax of HTML should've considered his career choices. It would've be perfectly adequate to require script and CSS to be put into external "resources" linked via src/href, especially since the spec proposals operated under the assumption there would be multiple script and styling languages going forward (like, hey, if we have one markup and styling language, why not have two or multiple?). When in fact the rules were quite simple: in SGML, text rendered to the reader goes into content, everything else, including formatting properties, goes into atttibutes. The reason for introducing this inlining misfeature was probably the desire to avoid network roundtrip, which would've later been made bogusly obsolete by Google's withdrawn HTTP/2 push spec, but also the bizarre idea anyone except webdev bloggers would be editing HTML+CSS by hand. To think there was a committee overviewing such blunders as "W3C recommendations" - actually, they screwed up again with CSS when they allowed unencoded inline data URLs such as used for SVG backgrounds and the like. The alarm bells should've been ringing at the latest the moment they seriously considered storing markup within CSS like with the abovementioned misfeature but also with the "content:" CSS property. You know, as in "recommendation" which is how W3C final stage specs were called.

socalgal2 8 days ago | parent | next [-]

All of those are features, not bugs and I'm glad they are there. Uploading and dealing with 1 file is much nicer than dealing with several.

jiggawatts 8 days ago | parent [-]

> much nicer than dealing with several.

"My momentary convenience trumps the man-millenia of effort required to protect billions of people from script injection attacks."

porridgeraisin 8 days ago | parent [-]

Not just his convenience. Man-millenia of convenience, if you will ;) I too love the fact that many things can be single index.html's, no need of a zip file then. It's double-click to view. One of the best things about the web platform.

Edit: and "effort", please. The spec has a simple and clear note:

> The easiest and safest way to avoid the rather strange restrictions described in this section is to always escape an ASCII case-insensitive match for "<!--" as "\x3C!--", "<script" as "\x3Cscript", and "</script" as "\x3C/script" when these sequences appear in literals in scripts (e.g. in strings, regular expressions, or comments), and to avoid writing code that uses such constructs in expressions. Doing so avoids the pitfalls that the restrictions in this section are prone to triggering.

Backwards compatibility is easily and completely worth this small amount of effort. It's a one-liner in most languages.

tannhaeuser 8 days ago | parent [-]

The easiest and safest way to avoid the rather strange restrictions described is to not make use of inline script in a way that makes those restrictions neccessary, though. And a "recommendation" should reflect that (from back when HTML recommendations were actually published rather than random Google shills writing whatever on github). The suggested workaround is also not without criticism (eg [1]).

[1]: https://uploadcare.com/blog/vulnerability-in-html-design/

robocat 7 days ago | parent | prev [-]

> It would've be perfectly adequate to require script and CSS to be put into external "resources" linked via src/href

Bullshit - Navigator and IE didn't have HTTP/2. I'm guessing you didn't use dialup where your external CSS or JavaScript regularly failed to load. You didn't add extra dependencies because IE would only had two concurrent connections to load files.

It's easy to criticize past mistakes from your armchair: but I suggest you try and be a little more fair towards the people that made decisions especially when overall HTML has been a resounding success.

tannhaeuser 7 days ago | parent [-]

I suggest you try and check what the people you're accusing of armchair attitudes in fact were and are doing to solve problems.

Have you done even a single thing in the markup community?

robocat 7 days ago | parent [-]

Sorry - I shouldn't be so flippant.

Engineers hate bad compromises, and the core of engineering is making good compromises. Creating anything makes you your own critic.

edoceo 5 days ago | parent [-]

Time makes a fool of everyone.

dullcrisp 8 days ago | parent | prev [-]

Huh, it’s still confusing to me why they would have this double-escaping behavior only inside an HTML comment. Why not have it always behave one way or the other? At what point did the parsing behavior inside and outside HTML comments split and why?

dmsnell 7 days ago | parent [-]

At some point I think I read a more complete justification, but I can’t find it now. There is evidence that it came about as a byproduct of the interaction of the HTML parser and JS parsers in early browsers.

In this link we can see the expectation that the HTML comment surrounds a call to document.write() which inserts a new SCRIPT element. The tags are balanced.

https://stackoverflow.com/questions/236073/why-split-the-scr...

In this HTML 4.01 spec, it’s noted to use HTML comments to hide the script contents from render, which is where we start to get the notion of using these to hide markup from display.

https://www.w3.org/TR/html401/interact/scripts.html

Some drafts of the HTML standard attempted to escape differently and didn’t have the double escape state.

https://www.w3.org/TR/2016/WD-html52-20161206/semantics-scri...

My guess is that at some point the parsers looked for balanced tags, as evidenced in the note in the last link above, but then practical issues with improperly-generated scripts led to the idea that a single SCRIPT closing tag ends the escaping. Maybe people were attempting to concatenate script contents wrong and getting stacks of opening tags that were never closed. I don’t know, but I suppose it’s recorded somewhere.

Many things in today’s HTML arose because of widespread issues with how people generated the content. The same is true of XML and XHTML by the way. Early XML mailing lists were full of people parsing XML with naive PERL regular expressions and suggesting that when someone wants to “fix” broken markup, that they do it with string-based find-and-replace.

The main difference is that the HTML spec went in the direction of saying, _if we can agree how to handle these errors then in the face of some errors we can display some content_ and we can all do it in the same way. XML is worse in some regards: certain kinds of errors are still ambiguous and up to the parser to determine how to handle, whether they are non-recoverable or recoverable. For those non-recoverable, the presence of a single error destroys the entire document, like being refused a withdrawal at the bank because you didn’t cross a 7.

At least with HTML5, it’s agreed upon what to do when errors are present and all parsers can produce the same output document; XML parsers routinely handle malformed content and do so in different ways (though most at least provide or default to a strict mode). It’s better than the early web, but not that much better.