Remix.run Logo
paulsutter 7 hours ago

Utf8 solved this completely. It works with any length unicode and on average takes up almost as little storage as ascii.

Utf16 is brain dead and an embarrassment

wvenable 7 hours ago | parent | next [-]

Blame the Unicode consortium for not coming up UTF-8 first (or, really, at all). And for assuming that 65526 code points would be enough for everyone.

So many problems could be solved with a time machine.

kstrauser 6 hours ago | parent [-]

The first draft of Unicode was in 1988. Thompson and Pike came up with UTF-8 in 1992, made an RFC in 1998. UTF-16 came along in 1996, made an RFC in 2000.

The time machine would've involved Microsoft saying "it's clear now that USC-2 was a bad idea, so let's start migrating to something genuinely better".

wvenable 2 hours ago | parent | next [-]

I don't think it was clear at the time that UTF-8 would take off. UCS-2 and then UTF-16 was well established by 2000 in both Microsoft technologies and elsewhere (like Java). Linux, despite the existence of UTF-8, would still take years to get acceptable internationalization support. Developing good and secure internationalization is a hard problem -- it took a long time for everyone.

It's now 2026, everything always looks different in hindsight.

kstrauser 14 minutes ago | parent [-]

I don’t remember it quite that way. Localization was a giant question, sure. Are we using C or UTF-8 for the default locale? That had lots of screaming matches. But in the network service world, I don’t remember ever hearing more than a token resistance against choosing UTF-8 as the successor to ASCII. It was a huge win, especially since ASCII text is already valid UTF-8 text. Make your browser default to parsing docs with that encoding and you can still parse all existing ASCII docs with zero changes! That was a huge, enormous selling point.

Windows is far from a niche player, to be sure. Yet it seems like literally every other OS but them was going with one encoding for everything, while they went in a totally different direction that got complaints even then. I truly believe they thought they’d win that battle and eventually everyone else would move to UTF-16 to join them. Meanwhile, every other OS vendor was like, nah, no way we’re rewriting everything from scratch to work with a not-backward compatible encoding.

gpvos 3 hours ago | parent | prev [-]

MS could easily have added proper UTF-8 support in the early 2000s instead of the late 2010s.

kstrauser 3 hours ago | parent [-]

Yep. It would've been a better landing pad than UTF-16 since they had to migrate off UCS-2 anyway.

Dwedit 3 hours ago | parent | prev [-]

It gets worse for UTF-16, Windows will let you name files using unpaired surrogates, now you have a filename that exists on your disk that cannot be represented in UTF-8 (nor compliant UTF-16 for that matter). Because of that, there's yet another encoding called WTF-8 that can represent the arbitrary invalid 16-bit values.