Remix.run Logo
happytoexplain 4 days ago

I have a love-hate relationship with backwards compatibility. I hate the mess - I love when an entity in a position of power is willing to break things in the name of advancement. But I also love the cleverness - UTF-8, UTF-16, EAN, etc. To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.

amluto 4 days ago | parent | next [-]

> To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.

It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1.

I hope we don’t regret this limitation some day. I’m not aware of any other material reason to disallow larger UTF-8 code units.

mort96 4 days ago | parent | next [-]

That isn't really a case of UTF-8 sacrificing anything to be compatible with UTF-16. It's Unicode, not UTF-8 that made the sacrifice: Unicode is limited to 21 bits due to UTF-16. The UTF-8 design trivially extends to support 6 byte long sequences supporting up to 31-bit numbers. But why would UTF-8, a Unicode character encoding, support code points which Unicode has promised will never and can never exist?

MyOutfitIsVague 4 days ago | parent | next [-]

In an ideal future (read: fantasy), utf-16 gets formally deprecated and trashed, freeing the surrogate sequences and full range for utf-8.

Or utf-16 is officially considered a second class citizen, and some code points are simply out of its reach.

GuB-42 4 days ago | parent | prev [-]

Is 21 bits really a sacrifice. It is 2 million codepoints, we currently use about a tenth of that.

Even with all Chinese characters, de-unified, all the notable historical and constructed scripts, technical symbols, and all the submitted emoji, including rejections, you are still way short of a million.

We are probably never need more than 21 bits unless we start stretching the definition of what text is.

moefh 4 days ago | parent [-]

It's not 2 million, it's a little over 1 million.

The exact number is 1112064 = 2^16 - 2048 + 16*2^16: in UTF-16, 2 bytes can encode 2^16 - 2048 code points, and 4 bytes can encode 16*2^16 (the 2048 surrogates are not counted because they can never appear by themselves, they're used purely for UTF-16 encoding).

chuckadams 3 days ago | parent [-]

Even with just 1 million codepoints, why did they feel the need for CJK unification? Was it so it would all fit in UCS-2 or something?

rwallace 3 days ago | parent [-]

Yes, that was exactly the reason. CJK unification happened during the few years when we were all trying to convince ourselves that 16 bits would be enough. By the time we acknowledged otherwise, it was too late.

throw0101d 4 days ago | parent | prev | next [-]

> It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1z

Yes, it is 'truncated' to the "UTF-16 accessible range":

* https://datatracker.ietf.org/doc/html/rfc3629#section-3

* https://en.wikipedia.org/wiki/UTF-8#History

Thompson's original design could handle up to six octets for each letter/symbol, with 31 bits of space:

* https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

gpvos 4 days ago | parent [-]

You could even extend UTF-8 to make 0xFE and 0xFF valid starting bytes, with 6 and 7 following bytes each, and get 42 bits of space. I seem to remember Perl allowed that for a while in its v-strings notation.

Edit: just tested this, Perl still allows this, but with an extra twist: v-notation goes up to 2^63-1. From 2^31 to 2^36-1 is encoded as FE + 6 bytes, and everything above that is encoded as FF + 12 bytes; the largest value it allows is v9223372036854775807, which is encoded as FF 80 87 BF BF BF BF BF BF BF BF BF BF. It probably doesn't allow that one extra bit because v-notation doesn't work with negative integers.

Analemma_ 4 days ago | parent | prev | next [-]

It's always dangerous to stick one's neck out and say "[this many bits] ought to be enough for anybody", but I think it's very unlikely we'll ever run out of UTF-8 sequences. UTF-8 can represent about 1.1 million code points, of which we've assigned about 160,000 actual characters, plus another ~140,000 in the Private Use Area, which won't expand. And that's after getting nearly all of the world's known writing systems: the last several Unicode updates have added a few thousand characters here and there for very obscure and/or ancient writing systems, but those won't go on forever (and things like emojis rarely only get a handful of new code points per update, because most new emojis are existing code points with combining characters).

If I had to guess, I'd say we'll run out of IPv6 addresses before we run out of unassigned UTF-8 sequences.

lyu07282 4 days ago | parent [-]

The oldest script in unicode, sumerian cuneiform, is ~5,200 years old, if we were to invent new scripts at the same rate we would hit 1.1 million code points in around 31,000 years. So yeah nothing to worry about, you are absolutely right. Unless we join some intergalactic federation of planets, although they probably already have their own encoding standards we could just adopt.

cryptonector 4 days ago | parent | prev | next [-]

> It sacrifices the ability to encode more than 21 bits

No, UTF-8's design can encode up to 31 bits of codepoints. The limitation to 21 bits comes from UTF-16, which was then adopted for UTF-8 too. When UTF-16 dies we'll be able to extend UTF-8 (well, compatibility will be a problem).

layer8 4 days ago | parent | prev | next [-]

That limitation will be trivial to lift once UTF-16 compatibility can be disregarded. This won’t happen soon, of course, given JavaScript and Windows, but the situation might be different in a hundred or thousand years. Until then, we still have a lot of unassigned code points.

In addition, it would be possible to nest another surrogate-character-like scheme into UTF-16 to support a larger character set.

1oooqooq 4 days ago | parent | prev [-]

the limitation tomorrow will be today's implementations, sadly.

procaryote 4 days ago | parent | prev | next [-]

> I love when an entity in a position of power is willing to break things in the name of advancement.

It's less fun when you have things that need to keep working break because someone felt like renaming a parameter, or that a part of the standard library looks "untidy"

happytoexplain 4 days ago | parent [-]

I agree! And yet I lovingly sacrifice my man-hours to it when I decide to bump that major version number in my dependency manifest.

account42 2 days ago | parent | next [-]

The key words here being "I decide". I'm going to express a lot less love when someone else decides.

procaryote 4 days ago | parent | prev [-]

Or minor versions of python...

Honestly python is probably one of the worst offender in this as they combine happily making breaking changes for low value rearranging of deck chairs with a dynamic language where you might only find out in runtime.

The fact that they've also decided to use an unconventional intepretation of minor version shows how little they care.

chuckadams 3 days ago | parent [-]

The term "semantic versioning" didn't even exist until 2010, which is well after the birth of Python. Sure, it semi-formalized a convention from long before, but it was hardly universal.

account42 2 days ago | parent | next [-]

The ideals behind semantic versioning existed long before the marketing term.

procaryote 3 days ago | parent | prev [-]

They of course get to break their thing however much they like, but it sure sucks

cryptonector 4 days ago | parent | prev | next [-]

> To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.

There were apps that completely rejected non-7-bit data back in the day. Backwards compatibility wasn't the only point. The point of UTF-8 is more (IMO) that UTF-32 is too bulky, UCS-2 was insufficient, UTF-16 was an abortion, and only UTF-8 could have the right trade-offs.

mort96 4 days ago | parent | prev [-]

Yeah I honestly don't know what I would change. Maybe replace some of the control characters with more common characters to save a tiny bit of space, if we were to go completely wild and break Unicode backward compatibility too. As a generic multi byte character encoding format, it seems completely optimal even in isolation.