Remix.run Logo
Aardwolf a day ago

Why do surrogates even exist? UTF-8 is a code to represent roughly 21-bit integers, and UTF-16 is another code that represents roughly 21-bit integers.

Somehow UTF-16 reserves some of those decoded integer values (instead of solving its whatever problem it had in its encoding itself)

The fact that UTF-8 didn't need to also destroy some output integer values to work proves it's not necessary to do that

Encoding and decoded value should be separate concerns

That's like having a mathematical encoding of integers that's like base 10, but for some reason you decide that integer values 100 to 110 are reserved and may never be used by anyone, not even other legit encodings like regular base 10

thristian a day ago | parent | next [-]

The fact that U+D800-U+DFFF are reserved means that it's generally pretty easy to distinguish UTF-16 text from UCS-2 text - if you spot even one 16-bit value in that reserved range, it should be UTF-16.

This property is not true of UTF-8 - if you get a byte-string with bytes between 0x80 and 0xFF, it might be UTF-8, or it might be one of a bunch of other encodings, you need to do a more involved check to be sure.

Granted, the presence of a value between 0xD800 and 0xDFFF does not guarantee that the text is UTF-16, that's why this "WTF-8" encoding exists. But confusion would be a whole lot more likely if the U+D800-U+DFFF range were not reserved.

Leszek a day ago | parent | prev | next [-]

Because Unicode 1.0 had already defined characters both at the start and end of the 16-bit range (https://www.unicode.org/versions/Unicode1.0.0/CodeCharts1.pd...), and UCS-2/UTF-16 had to be compatible with that.

manwe150 a day ago | parent | prev [-]

UTF-8 has many similar problems with malformed sequences, such as overlong encodings. There is a similar scheme to this necessary if you want to handle arbitrary bytes as almost being UTF-8, instead of treating them as an inaccurate Latin-1 as is commonly done (the Julia language strings have such an ability for the basic String type for a reference point)