| ▲ | chrismorgan an hour ago | |
A CRDT library working at the code unit level? Ouch. Of course that’s going to go wrong, it was inevitable. As for using extended grapheme clusters, it sounds a little bit iffy—maybe possible to use correctly, maybe not, because they’re not stable over time. That style of thing has created some fascinating bugs, like (a few years ago) index corruption in PostgreSQL due to collation changes. Unicode scalar values are technically-safe: you can’t introduce invalid Unicode. But you can definitely still end up with nonsense. > We made emoji an atomic node type. That avoids problems for emoji, but leaves the underlying hazard untouched. I imagine it could still theoretically occur with other text, probably CJK. But probably only theoretically. > This splits by grapheme clusters rather than code units. No orphaned surrogates, no split emoji. It's what .slice() should have been doing all along, but of course UTF-16 predates emoji by decades. I do not agree that slice() should operate on extended grapheme clusters. Don’t lump the grapheme cluster/scalar value split in with the sins of UTF-16 and its unreliable code point/code unit split. UTF-16 was unforced error (and I still can’t work out why it wasn’t obvious from the start that UCS-2 would never be enough). But the concept of multiple scalars contributing to the logical unit was always inevitable. | ||
| ▲ | rectang 17 minutes ago | parent | next [-] | |
> I still can’t work out why it wasn’t obvious from the start that UCS-2 would never be enough) Surely certain people did know, but those people weren't in a position to do anything about it. Specifically, there were surely people who knew that because historical Chinese place names, Japanese nicknames, and so on, were not included in the original "Unicode" (it wasn't called UCS-2 yet) it was insufficient for complete expression of Asian languages. There were also many people who objected to Han unification, which is a different problem. But all of these objections were discarded because of the overwhelming mandate for a fixed-width encoding. | ||
| ▲ | georgemandis 25 minutes ago | parent | prev [-] | |
>I do not agree that slice() should operate on extended grapheme clusters. Don’t lump the grapheme cluster/scalar value split in with the sins of UTF-16 and its unreliable code point/code unit split. Yeah, I think that's fair. I didn't really think this through as I was writing it. I'm not even so sure "ending up with nonsense" here is the worst outcome. It might be unavoidable with this approach and if that had been the only problem this bug might have been less memorable. The real problem—which I mention didn't articulate/emphasize particularly well—was that these invalid surrogate pairs were getting passed into `encodeURIComponent` somewhere deep in the stack and choking catastrophically on them. That was the "real" bug at the end of the day, but the invalid surrogate pairs and the way they were getting created on the way were a fun journey to untangle. | ||