Wouldn't you only need to read backwards at most 3 bytes to see if you were currently at a continuation byte? With a max multi-byte size of 4 bytes, if you don't see a multi-byte start character by then you would know it's a single-byte char.

I wonder if a reason is similar though: error recovery when working with libraries that aren't UTF-8 aware. If you slice naively slice an array of UTF-8 bytes, a UTf-8 aware library can ignore malformed leading and trailing bytes and get some reasonable string out of it.

▲

Sharlin 4 days ago | parent [-]

It’s not always possible to read backwards.

▲

Dylan16807 4 days ago | parent [-]

Okay so you seek by 3 less bytes.

Or you accept that if you're randomly losing chunks, you might lose an extra 3 bytes.

The real problem is that seeking a few bytes won't work with EMBL. If continuation bytes store 8 payload bits, you can get into a situation where every single byte could be interpreted as a multi-byte start character and there are 2 or 3 possible messages that never converge.

▲

Sharlin 4 days ago | parent [-]

The point is that you don’t have a "seek" operation available. You are given a bytestream and aren’t told if you’re at the start, in a valid position between code points, or in the middle of a code point. UTF-8’s self-synchronizing property means that by reading a single byte you immediately know if you’re in the middle of a code point, and that by reading and discarding at most two additional bytes you’re synchronized and can start/return decoding. That wouldn’t be possible if continuation bytes used all the bits for payload.

	▲	Dylan16807 3 days ago \| parent [-]
		Yes, the point is being able to synchronize. But it doesn't matter if it takes 1 byte or 3 bytes to synchronize. And being unable to read backwards is not a problem. (EMBL doesn't synchronize in three bytes but other encodings do.)