| ▲ | paulsutter 7 hours ago |
| Utf8 solved this completely. It works with any length unicode and on average takes up almost as little storage as ascii. Utf16 is brain dead and an embarrassment |
|
| ▲ | wvenable 7 hours ago | parent | next [-] |
| Blame the Unicode consortium for not coming up UTF-8 first (or, really, at all). And for assuming that 65526 code points would be enough for everyone. So many problems could be solved with a time machine. |
| |
| ▲ | kstrauser 6 hours ago | parent [-] | | The first draft of Unicode was in 1988. Thompson and Pike came up with UTF-8 in 1992, made an RFC in 1998. UTF-16 came along in 1996, made an RFC in 2000. The time machine would've involved Microsoft saying "it's clear now that USC-2 was a bad idea, so let's start migrating to something genuinely better". | | |
| ▲ | wvenable 2 hours ago | parent | next [-] | | I don't think it was clear at the time that UTF-8 would take off. UCS-2 and then UTF-16 was well established by 2000 in both Microsoft technologies and elsewhere (like Java). Linux, despite the existence of UTF-8, would still take years to get acceptable internationalization support. Developing good and secure internationalization is a hard problem -- it took a long time for everyone. It's now 2026, everything always looks different in hindsight. | | |
| ▲ | kstrauser 14 minutes ago | parent [-] | | I don’t remember it quite that way. Localization was a giant question, sure. Are we using C or UTF-8 for the default locale? That had lots of screaming matches. But in the network service world, I don’t remember ever hearing more than a token resistance against choosing UTF-8 as the successor to ASCII. It was a huge win, especially since ASCII text is already valid UTF-8 text. Make your browser default to parsing docs with that encoding and you can still parse all existing ASCII docs with zero changes! That was a huge, enormous selling point. Windows is far from a niche player, to be sure. Yet it seems like literally every other OS but them was going with one encoding for everything, while they went in a totally different direction that got complaints even then. I truly believe they thought they’d win that battle and eventually everyone else would move to UTF-16 to join them. Meanwhile, every other OS vendor was like, nah, no way we’re rewriting everything from scratch to work with a not-backward compatible encoding. |
| |
| ▲ | gpvos 3 hours ago | parent | prev [-] | | MS could easily have added proper UTF-8 support in the early 2000s instead of the late 2010s. | | |
| ▲ | kstrauser 3 hours ago | parent [-] | | Yep. It would've been a better landing pad than UTF-16 since they had to migrate off UCS-2 anyway. |
|
|
|
|
| ▲ | Dwedit 3 hours ago | parent | prev [-] |
| It gets worse for UTF-16, Windows will let you name files using unpaired surrogates, now you have a filename that exists on your disk that cannot be represented in UTF-8 (nor compliant UTF-16 for that matter). Because of that, there's yet another encoding called WTF-8 that can represent the arbitrary invalid 16-bit values. |