▲ | amluto 4 days ago | ||||||||||||||||||||||||||||||||||||||||
> To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though. It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1. I hope we don’t regret this limitation some day. I’m not aware of any other material reason to disallow larger UTF-8 code units. | |||||||||||||||||||||||||||||||||||||||||
▲ | mort96 4 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||
That isn't really a case of UTF-8 sacrificing anything to be compatible with UTF-16. It's Unicode, not UTF-8 that made the sacrifice: Unicode is limited to 21 bits due to UTF-16. The UTF-8 design trivially extends to support 6 byte long sequences supporting up to 31-bit numbers. But why would UTF-8, a Unicode character encoding, support code points which Unicode has promised will never and can never exist? | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
▲ | throw0101d 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
> It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1z Yes, it is 'truncated' to the "UTF-16 accessible range": * https://datatracker.ietf.org/doc/html/rfc3629#section-3 * https://en.wikipedia.org/wiki/UTF-8#History Thompson's original design could handle up to six octets for each letter/symbol, with 31 bits of space: | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
▲ | Analemma_ 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
It's always dangerous to stick one's neck out and say "[this many bits] ought to be enough for anybody", but I think it's very unlikely we'll ever run out of UTF-8 sequences. UTF-8 can represent about 1.1 million code points, of which we've assigned about 160,000 actual characters, plus another ~140,000 in the Private Use Area, which won't expand. And that's after getting nearly all of the world's known writing systems: the last several Unicode updates have added a few thousand characters here and there for very obscure and/or ancient writing systems, but those won't go on forever (and things like emojis rarely only get a handful of new code points per update, because most new emojis are existing code points with combining characters). If I had to guess, I'd say we'll run out of IPv6 addresses before we run out of unassigned UTF-8 sequences. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
▲ | cryptonector 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
> It sacrifices the ability to encode more than 21 bits No, UTF-8's design can encode up to 31 bits of codepoints. The limitation to 21 bits comes from UTF-16, which was then adopted for UTF-8 too. When UTF-16 dies we'll be able to extend UTF-8 (well, compatibility will be a problem). | |||||||||||||||||||||||||||||||||||||||||
▲ | layer8 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
That limitation will be trivial to lift once UTF-16 compatibility can be disregarded. This won’t happen soon, of course, given JavaScript and Windows, but the situation might be different in a hundred or thousand years. Until then, we still have a lot of unassigned code points. In addition, it would be possible to nest another surrogate-character-like scheme into UTF-16 to support a larger character set. | |||||||||||||||||||||||||||||||||||||||||
▲ | 1oooqooq 4 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||
the limitation tomorrow will be today's implementations, sadly. |