| ▲ | bombela 2 hours ago | |||||||
In summary, Unicode code points (characters) are 32 bit. JavaScript manipulates Unicode in utf-16 for historical reasons, because at some point before Unicode, 16 bit was deemed enough (ucs-2). utf-16 run length encodes Unicode 32 codepoints into one or two code units. Splitting in a middle of a codepoints produces one invalid half string, and one semantically different half string. emojies are a sequence of Unicode codepoints producing a single grapheme. Splitting in the middle of a grapheme will produce two valid strings, but with some funky half baked emoji. So for a text editor it makes sense to split between grapheme boundaries. | ||||||||
| ▲ | chrismorgan an hour ago | parent [-] | |||||||
> Unicode code points are 32 bit 21-bit, actually. It was supposed to be 32-bit, but UTF-16 caps out at 21-bit, so they lopped eleven bits of potential from Unicode (and UTF-8, so no more six-byte encoding). > at some point before Unicode No, in the early days of Unicode. > run length encodes Um… what? RLE is a data compression thing, UTF-16 has nothing to do with it. | ||||||||
| ||||||||