Remix.run Logo
agus4nas 2 hours ago

Great write-up. Do most modern languages handle invalid surrogates gracefully, or is it still a "good luck" situation depending on the runtime?

amluto 2 hours ago | parent | next [-]

Modern string libraries largely use UTF-8 [0], and surrogates, regardless of whether they’re paired, are invalid in UTF-8. So, in a modern string library, as built in to most modern languages, you will not encounter surrogates except when translating between encodings.

[0] But everyone disagrees as to what indexing a string means, so you need to make an actual choice if you want anything involving indexing to match across languages.

chuckadams an hour ago | parent [-]

> surrogates, regardless of whether they’re paired, are invalid in UTF-8

Java did not get the memo. Since the char type is fixed at 16 bits, it uses surrogates to encode everything outside the BMP, regardless of the encoding.

georgemandis 2 hours ago | parent | prev [-]

The language handled it fine. It will generally just show replacement characters (�) for combos that don't map to anything.

It was really `encodeURIComponent` that didn't handle it gracefully.

If you just type this into the console (surrogate pair for cowboy smiley face emoji), you see it encodes it ("%F0%9F%A4%A0"):

encodeURIComponent("\uD83E\uDD20")

If you give it an invalid surrogate pair, it will throw an actual error:

encodeURIComponent("\uDD20\uD83E")

chrismorgan an hour ago | parent [-]

No, the language did not handle it fine. It allowed an invalid Unicode string to exist. This is basically a UTF-16 affliction—nothing that does UTF-16 validates, whereas almost everything that does UTF-8 does validate. encodeURIComponent deals with UTF-8, so of course it throws.

georgemandis 3 minutes ago | parent [-]

I'm realizing `encodeURIComponent` is actually part of the ECMA spec! I thought it was something provided by the browser like `window` or `document`. I withdraw my "the language handled it fine" comment, haha.

Before I'd looked that up I was going to say: I feel like "don't allow an invalid Unicode string to exist all" feels like a separate/bigger problem to me from "handling it fine" when they do get created. To the extent I can hand JavaScript an invalid combination of code units in a variety of other scenarios, returning a � felt fine.

e.g. // valid String.fromCodePoint(0xd83e, 0xdd20) // invalid, but "�" is ... fine? String.fromCodePoint(0xdd20, 0xd83e)