▲ | codethief 5 days ago | |||||||
I suppose generalizing the approach to UTF-32 should be straightforward, but variable-length encodings like UTF-8 and UTF-16 might be more involved(?) Either way, I'm sure BurntSushi found a solution and built it into ripgrep. | ||||||||
▲ | burntsushi 5 days ago | parent [-] | |||||||
ripgrep always deals with UTF-8. When it sees a different encoding, like UTF-16, ripgrep first transcodes to UTF-8 and then searches. This is absolutely in part because of all of the byte oriented optimizations that are baked into ripgrep (and its regex engine). Note that I said a part. Making ripgrep (and its regex engine) work on things other than a sequence of bytes is far more difficult than just porting a bunch of SIMD algorithms. There are also many optimizations and architectural constraints in the code based on the alphabet size. That is, with 8-bit integers, its alphabet size is 256. With 16-bit integers, the alphabet size is 65,536. | ||||||||
|