Utf8 solved this completely. It works with any length unicode and on average takes up almost as little storage as ascii.

Utf16 is brain dead and an embarrassment

▲

wvenable 7 hours ago | parent | next [-]

Blame the Unicode consortium for not coming up UTF-8 first (or, really, at all). And for assuming that 65526 code points would be enough for everyone.

So many problems could be solved with a time machine.

▲

kstrauser 6 hours ago | parent [-]

The first draft of Unicode was in 1988. Thompson and Pike came up with UTF-8 in 1992, made an RFC in 1998. UTF-16 came along in 1996, made an RFC in 2000.

The time machine would've involved Microsoft saying "it's clear now that USC-2 was a bad idea, so let's start migrating to something genuinely better".

▲

wvenable 2 hours ago | parent | next [-]

I don't think it was clear at the time that UTF-8 would take off. UCS-2 and then UTF-16 was well established by 2000 in both Microsoft technologies and elsewhere (like Java). Linux, despite the existence of UTF-8, would still take years to get acceptable internationalization support. Developing good and secure internationalization is a hard problem -- it took a long time for everyone.

It's now 2026, everything always looks different in hindsight.

	▲	kstrauser 14 minutes ago \| parent [-]
		I don’t remember it quite that way. Localization was a giant question, sure. Are we using C or UTF-8 for the default locale? That had lots of screaming matches. But in the network service world, I don’t remember ever hearing more than a token resistance against choosing UTF-8 as the successor to ASCII. It was a huge win, especially since ASCII text is already valid UTF-8 text. Make your browser default to parsing docs with that encoding and you can still parse all existing ASCII docs with zero changes! That was a huge, enormous selling point. Windows is far from a niche player, to be sure. Yet it seems like literally every other OS but them was going with one encoding for everything, while they went in a totally different direction that got complaints even then. I truly believe they thought they’d win that battle and eventually everyone else would move to UTF-16 to join them. Meanwhile, every other OS vendor was like, nah, no way we’re rewriting everything from scratch to work with a not-backward compatible encoding.

▲

gpvos 3 hours ago | parent | prev [-]

MS could easily have added proper UTF-8 support in the early 2000s instead of the late 2010s.

	▲	kstrauser 3 hours ago \| parent [-]
		Yep. It would've been a better landing pad than UTF-16 since they had to migrate off UCS-2 anyway.

▲

Dwedit 3 hours ago | parent | prev [-]

It gets worse for UTF-16, Windows will let you name files using unpaired surrogates, now you have a filename that exists on your disk that cannot be represented in UTF-8 (nor compliant UTF-16 for that matter). Because of that, there's yet another encoding called WTF-8 that can represent the arbitrary invalid 16-bit values.