▲ | chrismorgan 7 days ago | |
> it doesn't say "codepoints" as an alternative solution. That was just my assumption … On the contrary, the article calls code point indexing “rather useless” in the subtitle. Code unit indexing is the appropriate technique. (“Byte indexing” generally implies the use of UTF-8 and in that context is more meaningfully called code unit indexing. But I just bet there are systems out there that use UTF-16 or UTF-32 and yet use byte indexing.) > The problem will be the same if you have to reconstruct the grapheme clusters eventually. In practice, you basically never do. Only your GUI framework ever does, for rendering the text and for handling selection and editing. Because that’s pretty much the only place EGCs are ever actually relevant. > You don't want that if you e.g. have an index for fulltext search. Your text search won’t be splitting by grapheme clusters, it’ll be doing word segmentation instead. | ||
▲ | torstenvl 5 days ago | parent [-] | |
> On the contrary, the article calls code point indexing “rather useless” in the subtitle. No it doesn't. It says it's "rather useless" that len(str) returns the number of code points, because there's rarely a reason to store the count of code points as the string length. By contrast, storing the number of native code units is useful for storage allocation and concatenation, which are common operations. Code point indexing is still very useful, depending on context. For example, a majority of Korean speakers (~50 million Internet users) prefer deletion by Jaso unit. Korean EGCs are whole syllables, and making someone retype a whole syllable to change one character is bad UX. |