Remix.run Logo
dzaima 5 days ago

It's not cleanly one or the other, really. It's UCS-2-y by `str.length` or `str[i]`, but UTF-16-y by `str.codePointAt(i)` or by iteration (`[...str]` or `for (x of str)`).

Generally though JS's strings are just a list of 16-bit values, being intrinsically neither UCS-2 nor UTF-16. But, practically speaking, UTF-16 is the description that matters for everything other than writing `str.length`/`str[i]`.

chrismorgan 3 days ago | parent [-]

• Regular indexing (also charAt and charCodeAt) is by UTF-16 code unit and produces UTF-16 code units.

• codePointAt is indexed by UTF-16 code unit, but produces Unicode code points (normally scalar values, but surrogates where ill-formed).

• String iteration doesn’t need indexing, and thus is Unicody, not UTF-16y.

• Approximately everything that JavaScript interacts with is actually UTF-8 now: URIs have long been UTF-8 (hence encodeURI/decodeURI/encodeURIComponent being UTF-8y).

• Where appropriate, new work favours UTF-8 semantics.

—⁂—

Overall, I’d say it’s most reasonable to frame it this way:

① JavaScript models strings as potentially-ill-formed UTF-16. (I prefer the word “models” to the word “represents” here, because the latter suggests a specific storage, which is not actually necessary.)

② Old parts of JavaScript depend on indexing, and use potentially-ill-formed UTF-16 code unit semantics.

③ New parts of JavaScript avoid indexing, and use Unicode semantics.