I once saw a good byte encoding for Unicode: 7 bit for data, 1 for continuation/stop. This gives 21 bit for data, which is enough for the whole range. ASCII compatible, at most 3 bytes per character. Very simple: the description is sufficient to implement it.

▲

rmunn 4 days ago | parent | next [-]

Probably a good idea, but when UTF-8 was designed the Unicode committee had not yet made the mistake of limiting the character range to 21 bits. (Going into why it's a mistake would make this comment longer than it's worth, so I'll only expound on it if anyone asks me to). And at this point it would be a bad idea to switch away from the format that is now, finally, used in over 99% of all documents online. The gain would be small (not zero, but small) and the cost would be immense.

▲

int_19h 3 days ago | parent [-]

Didn't they limit the range to 21 bits because UTF-16 has that limitation?

▲

rmunn 2 days ago | parent [-]

That is indeed why they limited it, but that was a mistake. I want to call UTF-16 a mistake all on its own, but since it predated UTF-8, I can't entirely do so. But limiting the Unicode range to only what's allowed in UTF-16 was shortsighted. They should, instead, have allowed UTF-8 to continue to address 31 bits, and if the standard grew past 21 bits, then UTF-16 would be deprecated. (Going into depth would take an essay, and at this point nobody cares about hearing it, so I'll refrain).

▲

gpvos 2 days ago | parent [-]

I suppose it's still possible to extend to 31 bits in the future, once UTF-16 has become obsolete enough. How big is the need for it right now?

▲

rmunn a day ago | parent | next [-]

Interestingly, in theory UTF-8 could be extended to 36 bits: the FLAC format uses an encoding similar to UTF-8 but extended to allow up to 36 bits (which takes seven bytes) to encode frame numbers: https://www.ietf.org/rfc/rfc9639.html#section-9.1.5

This means that frame numbers in a FLAC file can go up to 2^36-1, so a FLAC file can have up to 68,719,476,735 frames. If it was recorded at a 48kHz sample rate, there will be 48,000 frames per second, meaning a FLAC file at 48kHz sample rate can (in theory) be 14.3 million seconds long, or 165.7 days long.

So if Unicode ever needs to encode 68.7 billion characters, well, extended seven-byte UTF-8 will be ready and waiting. :-D

	▲	gpvos 15 hours ago \| parent [-]
		See my comment on how Perl stores up to 2^63-1 in a UTF-8-like format: https://news.ycombinator.com/item?id=45227396 .

▲

account42 a day ago | parent | prev [-]

The problem is that now there are a bunch of UTF-8 tools that won't handle code points beyond 21 bits.

	▲	gpvos a day ago \| parent [-]
		Fair enough, it will take some time to weed those out.

▲

restalis 3 days ago | parent | prev [-]

This fits your description: https://en.wikipedia.org/wiki/Variable-length_quantity