▲ | advisedwang 4 days ago | |
ULEB128 has several deficiencies for strings: * it's not "self-synchronizing". If you jump into the middle of a ULEB128 stream and see a byte starting with 0 you can't tell whether it is a single-byte code point or the last byte of a multi-byte code-point. You have to back up to determine the meaning of the byte, which is often not viable. In fact you can't even be sure of the first 3 bytes! * Relatedly, some code-points are subsets of other code points. 0AAAAAAA appears inside of 1BBBBBBB 0AAAAAAA), so you can't search for a code point in a string without checking backwards to make sure you are comparing against a whole code point. * You can't tell the difference between a string cut at the end of a 2 byte code point and one cut part way through a 3 byte code point. | ||
▲ | necovek 3 days ago | parent [-] | |
Sure, it's not perfect, and there are other issues. With UTF-8, you know exactly how many octets you need to read for the rest of the characters. But for issue #2, that seems to not be too bad since you only need to look one byte backwards. At the same time for #3, in the middle of UTF-8 bytestream, you need look backwards as well for anything but the ASCII (7-bit) codepoints too. |