▲ | jiehong 5 days ago | |||||||||||||||||||||||||||||||
Nice! But, does that work with non-ascii characters? (aka Unicode). | ||||||||||||||||||||||||||||||||
▲ | llimllib 5 days ago | parent | next [-] | |||||||||||||||||||||||||||||||
Kind of! This script is assuming that you're dealing with a byte slice, which means you've already encoded your unicode data. If you just encoded your string to bytes naïvely, it will probably-mostly still work, but it will get some combining characters wrong if they're represented differently in the two sources you're comparing. (eg, e-with-an-accent-character vs. accent-combining-character+e) If you want to be correct-er you'll normalize your UTF string[1], but note that there are four different defined ways to do this, so you'll need to choose the one that is the best tradeoff for your particular application and data sources. [1]: https://en.wikipedia.org/wiki/Unicode_equivalence#Normalizat... | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | codethief 5 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||
I suppose generalizing the approach to UTF-32 should be straightforward, but variable-length encodings like UTF-8 and UTF-16 might be more involved(?) Either way, I'm sure BurntSushi found a solution and built it into ripgrep. | ||||||||||||||||||||||||||||||||
|