Remix.run Logo
advisedwang 4 days ago

ULEB128 has several deficiencies for strings:

* it's not "self-synchronizing". If you jump into the middle of a ULEB128 stream and see a byte starting with 0 you can't tell whether it is a single-byte code point or the last byte of a multi-byte code-point. You have to back up to determine the meaning of the byte, which is often not viable. In fact you can't even be sure of the first 3 bytes!

* Relatedly, some code-points are subsets of other code points. 0AAAAAAA appears inside of 1BBBBBBB 0AAAAAAA), so you can't search for a code point in a string without checking backwards to make sure you are comparing against a whole code point.

* You can't tell the difference between a string cut at the end of a 2 byte code point and one cut part way through a 3 byte code point.

necovek 3 days ago | parent [-]

Sure, it's not perfect, and there are other issues. With UTF-8, you know exactly how many octets you need to read for the rest of the characters.

But for issue #2, that seems to not be too bad since you only need to look one byte backwards.

At the same time for #3, in the middle of UTF-8 bytestream, you need look backwards as well for anything but the ASCII (7-bit) codepoints too.