| ▲ | agus4nas 2 hours ago |
| Great write-up. Do most modern languages handle invalid surrogates gracefully, or is it still a "good luck" situation depending on the runtime? |
|
| ▲ | amluto 2 hours ago | parent | next [-] |
| Modern string libraries largely use UTF-8 [0], and surrogates, regardless of whether they’re paired, are invalid in UTF-8. So, in a modern string library, as built in to most modern languages, you will not encounter surrogates except when translating between encodings. [0] But everyone disagrees as to what indexing a string means, so you need to make an actual choice if you want anything involving indexing to match across languages. |
| |
| ▲ | chuckadams an hour ago | parent [-] | | > surrogates, regardless of whether they’re paired, are invalid in UTF-8 Java did not get the memo. Since the char type is fixed at 16 bits, it uses surrogates to encode everything outside the BMP, regardless of the encoding. |
|
|
| ▲ | georgemandis 2 hours ago | parent | prev [-] |
| The language handled it fine. It will generally just show replacement characters (�) for combos that don't map to anything. It was really `encodeURIComponent` that didn't handle it gracefully. If you just type this into the console (surrogate pair for cowboy smiley face emoji), you see it encodes it ("%F0%9F%A4%A0"): encodeURIComponent("\uD83E\uDD20") If you give it an invalid surrogate pair, it will throw an actual error: encodeURIComponent("\uDD20\uD83E") |
| |
| ▲ | chrismorgan an hour ago | parent [-] | | No, the language did not handle it fine. It allowed an invalid Unicode string to exist. This is basically a UTF-16 affliction—nothing that does UTF-16 validates, whereas almost everything that does UTF-8 does validate. encodeURIComponent deals with UTF-8, so of course it throws. | | |
| ▲ | georgemandis 3 minutes ago | parent [-] | | I'm realizing `encodeURIComponent` is actually part of the ECMA spec! I thought it was something provided by the browser like `window` or `document`. I withdraw my "the language handled it fine" comment, haha. Before I'd looked that up I was going to say: I feel like "don't allow an invalid Unicode string to exist all" feels like a separate/bigger problem to me from "handling it fine" when they do get created. To the extent I can hand JavaScript an invalid combination of code units in a variety of other scenarios, returning a � felt fine. e.g.
// valid
String.fromCodePoint(0xd83e, 0xdd20)
// invalid, but "�" is ... fine?
String.fromCodePoint(0xdd20, 0xd83e) |
|
|