| ▲ | Unicode Footguns in Python(pythonkoans.substack.com) | ||||||||||||||||||||||
| 33 points by meander_water 13 days ago | 11 comments | |||||||||||||||||||||||
| ▲ | renhanxue 7 hours ago | parent | next [-] | ||||||||||||||||||||||
The article has good tips, but Unicode normalization is just the tip of the iceberg. It is almost always impossible to do what your users expect without locale information (different languages and locales sort and compare the same graphemes differently). "What do we mean when we say two strings are equal" can be a surprisingly difficult question to answer. It's practical too, not philosophical. By the way, try looking up the standardized Unicode casefolding algorithm sometime, it is a thing to behold. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | dhosek 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
Grapheme count is not a useful number. Even in a monospaced font, you’ll find that the grapheme count doesn’t give you a measurement of width since emoji will usually not be the same width as other characters. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | OkayPhysicist 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
Frankly, the key takeaway to most problems people run into with Unicode is that there are very, very few operations that are universally well-defined for arbitrary user-provided text. Pretty much the moment you step outside the realm of "receive, copy, save, regurgitate", you're probably going to run into edge cases. | |||||||||||||||||||||||
| ▲ | morshu9001 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
Unicode footguns, in Python | |||||||||||||||||||||||
| ▲ | naIak 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
I’m going to trigger some ptsd with this… UnicodeDecodeError | |||||||||||||||||||||||
| ▲ | o11c 3 hours ago | parent | prev [-] | ||||||||||||||||||||||
I've said this before and have said it again: Python3 got rid of the wrong string type. With `bytes` it was obvious that byte length was not the same as $whatever length, and that was really the only semi-common bug (and was mostly limited to English speakers who are new to programming). All other bugs come from blindly trusting `unicode` whose bugs are far more subtle and numerous. | |||||||||||||||||||||||
| |||||||||||||||||||||||