Remix.run Logo
tralarpa 7 days ago

Fascinating and annoying problem, indeed. In Java, the correct way to iterate over the characters (Unicode scalar values) of a string is to use the IntStream provided by String::codePoints (since Java 8), but I bet 99.9999% of the existing code uses 16-bit chars.

zahlman 6 days ago | parent | next [-]

This does not fix the problem. The emoji consists of multiple Unicode characters (in turn represented 1:1 by the integer "code point" values). There is much more to it than the problem of surrogate pairs.

ivanjermakov 6 days ago | parent | prev [-]

Codepoint is not cluster and cluster is not character. I bet there is "50 falsehoods about Unicode".