UTF-8 the first byte isn't just 1xxxxxxx for continuation, it's either 110xxxxx, 1110xxxx, or 11110xxx depending on how many bytes that character will take up.

▲

necovek 4 days ago | parent [-]

Good point, I didn't pay close attention.

In a sense, a shame this encoding wasn't structured like UTF-8, or even the other way around, a shame UTF-8 wasn't structured in this, more generic way.

▲

advisedwang 4 days ago | parent [-]

ULEB128 has several deficiencies for strings:

* it's not "self-synchronizing". If you jump into the middle of a ULEB128 stream and see a byte starting with 0 you can't tell whether it is a single-byte code point or the last byte of a multi-byte code-point. You have to back up to determine the meaning of the byte, which is often not viable. In fact you can't even be sure of the first 3 bytes!

* Relatedly, some code-points are subsets of other code points. 0AAAAAAA appears inside of 1BBBBBBB 0AAAAAAA), so you can't search for a code point in a string without checking backwards to make sure you are comparing against a whole code point.

* You can't tell the difference between a string cut at the end of a 2 byte code point and one cut part way through a 3 byte code point.

	▲	necovek 3 days ago \| parent [-]
		Sure, it's not perfect, and there are other issues. With UTF-8, you know exactly how many octets you need to read for the rest of the characters. But for issue #2, that seems to not be too bad since you only need to look one byte backwards. At the same time for #3, in the middle of UTF-8 bytestream, you need look backwards as well for anything but the ASCII (7-bit) codepoints too.