Remix clone Hacker News

new | show | ask | jobs Github

	▲	chrismorgan 3 days ago
		• Regular indexing (also charAt and charCodeAt) is by UTF-16 code unit and produces UTF-16 code units. • codePointAt is indexed by UTF-16 code unit, but produces Unicode code points (normally scalar values, but surrogates where ill-formed). • String iteration doesn’t need indexing, and thus is Unicody, not UTF-16y. • Approximately everything that JavaScript interacts with is actually UTF-8 now: URIs have long been UTF-8 (hence encodeURI/decodeURI/encodeURIComponent being UTF-8y). • Where appropriate, new work favours UTF-8 semantics. —⁂— Overall, I’d say it’s most reasonable to frame it this way: ① JavaScript models strings as potentially-ill-formed UTF-16. (I prefer the word “models” to the word “represents” here, because the latter suggests a specific storage, which is not actually necessary.) ② Old parts of JavaScript depend on indexing, and use potentially-ill-formed UTF-16 code unit semantics. ③ New parts of JavaScript avoid indexing, and use Unicode semantics.