▲ | thristian a day ago | |
The fact that U+D800-U+DFFF are reserved means that it's generally pretty easy to distinguish UTF-16 text from UCS-2 text - if you spot even one 16-bit value in that reserved range, it should be UTF-16. This property is not true of UTF-8 - if you get a byte-string with bytes between 0x80 and 0xFF, it might be UTF-8, or it might be one of a bunch of other encodings, you need to do a more involved check to be sure. Granted, the presence of a value between 0xD800 and 0xDFFF does not guarantee that the text is UTF-16, that's why this "WTF-8" encoding exists. But confusion would be a whole lot more likely if the U+D800-U+DFFF range were not reserved. |