Remix.run Logo
osmsucks 5 days ago

JavaScript is: https://mathiasbynens.be/notes/javascript-encoding

demurgos 5 days ago | parent | next [-]

> The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.

The native JS semantics are UCS-2. Saying that it's UTF-16 is misleading and confuses charset, encoding and browser APIs.

Ladybird is probably implementing support properly but it's annoying that they keep spreading the confusion in their article.

dzaima 5 days ago | parent [-]

It's not cleanly one or the other, really. It's UCS-2-y by `str.length` or `str[i]`, but UTF-16-y by `str.codePointAt(i)` or by iteration (`[...str]` or `for (x of str)`).

Generally though JS's strings are just a list of 16-bit values, being intrinsically neither UCS-2 nor UTF-16. But, practically speaking, UTF-16 is the description that matters for everything other than writing `str.length`/`str[i]`.

chrismorgan 3 days ago | parent [-]

• Regular indexing (also charAt and charCodeAt) is by UTF-16 code unit and produces UTF-16 code units.

• codePointAt is indexed by UTF-16 code unit, but produces Unicode code points (normally scalar values, but surrogates where ill-formed).

• String iteration doesn’t need indexing, and thus is Unicody, not UTF-16y.

• Approximately everything that JavaScript interacts with is actually UTF-8 now: URIs have long been UTF-8 (hence encodeURI/decodeURI/encodeURIComponent being UTF-8y).

• Where appropriate, new work favours UTF-8 semantics.

—⁂—

Overall, I’d say it’s most reasonable to frame it this way:

① JavaScript models strings as potentially-ill-formed UTF-16. (I prefer the word “models” to the word “represents” here, because the latter suggests a specific storage, which is not actually necessary.)

② Old parts of JavaScript depend on indexing, and use potentially-ill-formed UTF-16 code unit semantics.

③ New parts of JavaScript avoid indexing, and use Unicode semantics.

grishka 5 days ago | parent | prev [-]

And most mainstream GUI toolkits are, as well. It can be said that UTF-16 is the de-facto standard in-memory representation of unicode strings, even though some runtimes (Rust) prefer UTF-8.

0points 4 days ago | parent [-]

> And most mainstream GUI toolkits are, as well.

No. Windows use UTF-16 internally. Most GUI toolkits do not.

> It can be said that UTF-16 is the de-facto standard in-memory representation of unicode strings, even though some runtimes (Rust) prefer UTF-8.

No, that wouldn't be true at all.

Your technical merit seem to be limited by your Windows experience, and even that is dated.

Microsoft recommends UTF-8 over UTF-16 since 2019 [1].

1: https://learn.microsoft.com/en-us/windows/apps/design/global...

perching_aix 4 days ago | parent | next [-]

> Most GUI toolkits do not.

Why are you guys talking like there were dozens of GUI toolkits in mainstream use? It's basically web stuff, Qt, and then everything else. Web would be UTF-16 as discussed above, Qt is UTF-16, and even if we entertain the admittedly "large just behind-the-scenes" Java/.NET market, that's also all UTF-16. WxWidgets being a fence sitter can do both UTF-8 and UTF-16, depending on the platform.

Which players am I missing? GTK and ImGUI? I don't think they are too big a slices of this pie, certainly not big enough to invalidate the claim.

4 days ago | parent | next [-]
[deleted]
1718627440 3 days ago | parent | prev [-]

Anything that is using the C stdlib at one point.

grishka 4 days ago | parent | prev [-]

Apple also uses some kind of UTF-16 internally, afaik