Nice!

But, does that work with non-ascii characters? (aka Unicode).

▲ llimllib 5 days ago | parent | next [-]

Kind of! This script is assuming that you're dealing with a byte slice, which means you've already encoded your unicode data.

If you just encoded your string to bytes naïvely, it will probably-mostly still work, but it will get some combining characters wrong if they're represented differently in the two sources you're comparing. (eg, e-with-an-accent-character vs. accent-combining-character+e)

If you want to be correct-er you'll normalize your UTF string[1], but note that there are four different defined ways to do this, so you'll need to choose the one that is the best tradeoff for your particular application and data sources.

[1]: https://en.wikipedia.org/wiki/Unicode_equivalence#Normalizat...

▲ codethief 5 days ago | parent | next [-]

> If you just encoded your string to bytes naïvely

By "naïvely" I assume you mean you would just plug in UTF-8 bytestrings for haystack & needle, without adjusting the implementation?

Wouldn't the code still need to take into account where characters (code points) begin and end, though, in order to prevent incorrect matches?

▲ burntsushi 5 days ago | parent [-]

IDK what "encoded your string to bytes naively" means personally. There is only one way to correctly UTF-8 encode a sequence of Unicode scalar values.

In any case, no, this works because UTF-8 is self synchronizing. As long as both your needle and your haystack are valid UTF-8, the byte offsets returned by the search will always fall on a valid codepoint boundary.

In terms of getting "combining characters wrong," this is a reference to different Unicode normalization forms.

To be more precise... Consider a needle and a haystack, represented by a sequence of Unicode scalar values (typically represented by a sequence of unsigned 32-bit integers). Now encode them to UTF-8 (a sequence of unsigned 8-bit integers) and run a byte level search as shown by the OP here. That will behave as if you've executed the search on the sequence of Unicode scalar values.

So semantically, a "substring search" is a "sequence of Unicode scalar values search." At the semantic level, this may or may not be what you want. For example, if you always want `office` to find substrings like `oﬃce` in your haystack, then this byte level search will not do what you want.

The standard approach for performing a substring search that accounts for normalization forms is to convert both the needle and haystack to the same normal form and then execute a byte level search.

(One small caveat is when the needle is an empty string. If you want to enforce correct UTF-8 boundaries, you'll need to handle that specially.)

	▲	llimllib 5 days ago \| parent [-]
		By naively, I meant without normalization. You know much more about this than I do though edit: this is what I mean for example, that `tést` != `tést` in rg, because \ue9 (e with accent) != e\u0301 (e followed by combining character accent) `$ printf "t\\u00E9st" > /tmp/a $ xxd /tmp/a 00000000: 74c3 a973 74 t..st $ cat /tmp/a tést $ printf "te\\u0301st" > /tmp/b $ xxd /tmp/b 00000000: 7465 cc81 7374 te..st $ cat /tmp/b tést $ printf "t\\u00E9st" \| rg -f - /tmp/a 1:tést $ printf "t\\u00E9st" \| rg -f - /tmp/b # ed: no result` edit 2: if we normalize the UTF-8, the two strings will match `$ printf "t\\u00E9st" \| uconv -x any-nfc \| xxd 00000000: 74c3 a973 74 t..st $ printf "te\\u0301st" \| uconv -x any-nfc \| xxd 00000000: 74c3 a973 74 t..st` Which you know, and indicate! Just working an example of it that maybe will help people understand, I dunno

▲ jiehong 5 days ago | parent | prev [-]

Thanks for this detailed answer!

▲ codethief 5 days ago | parent | prev [-]

I suppose generalizing the approach to UTF-32 should be straightforward, but variable-length encodings like UTF-8 and UTF-16 might be more involved(?) Either way, I'm sure BurntSushi found a solution and built it into ripgrep.

▲

burntsushi 5 days ago | parent [-]

ripgrep always deals with UTF-8. When it sees a different encoding, like UTF-16, ripgrep first transcodes to UTF-8 and then searches.

This is absolutely in part because of all of the byte oriented optimizations that are baked into ripgrep (and its regex engine). Note that I said a part. Making ripgrep (and its regex engine) work on things other than a sequence of bytes is far more difficult than just porting a bunch of SIMD algorithms. There are also many optimizations and architectural constraints in the code based on the alphabet size. That is, with 8-bit integers, its alphabet size is 256. With 16-bit integers, the alphabet size is 65,536.

	▲	tialaramex 5 days ago \| parent [-]
		I think this is the right choice because in practice UTF-8 "won" just like how the two's complement machine integer won. It's pretty good, Wikipedia has a brief section explaining how Ken Thompson for example made it self-synchronizing, which seems like a "duh" feature today but the concept before Ken touched it didn't have this. It's a Best Common Practice for the Internet, it's the default in most modern systems and places such as Java's virtual machine or Windows which can't easily "just" use UTF-8 have nevertheless gradually shifted toward being very friendly toward it.