Remix.run Logo
renhanxue 8 hours ago

The article has good tips, but Unicode normalization is just the tip of the iceberg. It is almost always impossible to do what your users expect without locale information (different languages and locales sort and compare the same graphemes differently). "What do we mean when we say two strings are equal" can be a surprisingly difficult question to answer. It's practical too, not philosophical.

By the way, try looking up the standardized Unicode casefolding algorithm sometime, it is a thing to behold.

Groxx 8 hours ago | parent [-]

the normalization doc is interesting too imo: https://unicode.org/reports/tr15/

in particular, the differences between NFC and NFKC are "fun", and rather meaningful in many cases. e.g. NFC says that "fi" and "fi" are different and not equal, though the latter is just a ligature of the former and is literally identical in meaning. this applies to ffi too. half vs full width Chinese characters are also "different" under NFC. NFKC makes those examples equal though... at the cost of saying "2⁵" is equal to "25".

language is fun!