Remix.run Logo
chrismorgan 3 days ago

• Regular indexing (also charAt and charCodeAt) is by UTF-16 code unit and produces UTF-16 code units.

• codePointAt is indexed by UTF-16 code unit, but produces Unicode code points (normally scalar values, but surrogates where ill-formed).

• String iteration doesn’t need indexing, and thus is Unicody, not UTF-16y.

• Approximately everything that JavaScript interacts with is actually UTF-8 now: URIs have long been UTF-8 (hence encodeURI/decodeURI/encodeURIComponent being UTF-8y).

• Where appropriate, new work favours UTF-8 semantics.

—⁂—

Overall, I’d say it’s most reasonable to frame it this way:

① JavaScript models strings as potentially-ill-formed UTF-16. (I prefer the word “models” to the word “represents” here, because the latter suggests a specific storage, which is not actually necessary.)

② Old parts of JavaScript depend on indexing, and use potentially-ill-formed UTF-16 code unit semantics.

③ New parts of JavaScript avoid indexing, and use Unicode semantics.