> If you just encoded your string to bytes naïvely

By "naïvely" I assume you mean you would just plug in UTF-8 bytestrings for haystack & needle, without adjusting the implementation?

Wouldn't the code still need to take into account where characters (code points) begin and end, though, in order to prevent incorrect matches?

▲ burntsushi 5 days ago | parent [-]

IDK what "encoded your string to bytes naively" means personally. There is only one way to correctly UTF-8 encode a sequence of Unicode scalar values.

In any case, no, this works because UTF-8 is self synchronizing. As long as both your needle and your haystack are valid UTF-8, the byte offsets returned by the search will always fall on a valid codepoint boundary.

In terms of getting "combining characters wrong," this is a reference to different Unicode normalization forms.

To be more precise... Consider a needle and a haystack, represented by a sequence of Unicode scalar values (typically represented by a sequence of unsigned 32-bit integers). Now encode them to UTF-8 (a sequence of unsigned 8-bit integers) and run a byte level search as shown by the OP here. That will behave as if you've executed the search on the sequence of Unicode scalar values.

So semantically, a "substring search" is a "sequence of Unicode scalar values search." At the semantic level, this may or may not be what you want. For example, if you always want `office` to find substrings like `oﬃce` in your haystack, then this byte level search will not do what you want.

The standard approach for performing a substring search that accounts for normalization forms is to convert both the needle and haystack to the same normal form and then execute a byte level search.

(One small caveat is when the needle is an empty string. If you want to enforce correct UTF-8 boundaries, you'll need to handle that specially.)

	▲	llimllib 5 days ago \| parent [-]
		By naively, I meant without normalization. You know much more about this than I do though edit: this is what I mean for example, that `tést` != `tést` in rg, because \ue9 (e with accent) != e\u0301 (e followed by combining character accent) `$ printf "t\\u00E9st" > /tmp/a $ xxd /tmp/a 00000000: 74c3 a973 74 t..st $ cat /tmp/a tést $ printf "te\\u0301st" > /tmp/b $ xxd /tmp/b 00000000: 7465 cc81 7374 te..st $ cat /tmp/b tést $ printf "t\\u00E9st" \| rg -f - /tmp/a 1:tést $ printf "t\\u00E9st" \| rg -f - /tmp/b # ed: no result` edit 2: if we normalize the UTF-8, the two strings will match `$ printf "t\\u00E9st" \| uconv -x any-nfc \| xxd 00000000: 74c3 a973 74 t..st $ printf "te\\u0301st" \| uconv -x any-nfc \| xxd 00000000: 74c3 a973 74 t..st` Which you know, and indicate! Just working an example of it that maybe will help people understand, I dunno