▲ | dzaima 5 days ago | |
It's not cleanly one or the other, really. It's UCS-2-y by `str.length` or `str[i]`, but UTF-16-y by `str.codePointAt(i)` or by iteration (`[...str]` or `for (x of str)`). Generally though JS's strings are just a list of 16-bit values, being intrinsically neither UCS-2 nor UTF-16. But, practically speaking, UTF-16 is the description that matters for everything other than writing `str.length`/`str[i]`. | ||
▲ | chrismorgan 3 days ago | parent [-] | |
• Regular indexing (also charAt and charCodeAt) is by UTF-16 code unit and produces UTF-16 code units. • codePointAt is indexed by UTF-16 code unit, but produces Unicode code points (normally scalar values, but surrogates where ill-formed). • String iteration doesn’t need indexing, and thus is Unicody, not UTF-16y. • Approximately everything that JavaScript interacts with is actually UTF-8 now: URIs have long been UTF-8 (hence encodeURI/decodeURI/encodeURIComponent being UTF-8y). • Where appropriate, new work favours UTF-8 semantics. —⁂— Overall, I’d say it’s most reasonable to frame it this way: ① JavaScript models strings as potentially-ill-formed UTF-16. (I prefer the word “models” to the word “represents” here, because the latter suggests a specific storage, which is not actually necessary.) ② Old parts of JavaScript depend on indexing, and use potentially-ill-formed UTF-16 code unit semantics. ③ New parts of JavaScript avoid indexing, and use Unicode semantics. |