▲ | necovek 4 days ago | |||||||
Good point, I didn't pay close attention. In a sense, a shame this encoding wasn't structured like UTF-8, or even the other way around, a shame UTF-8 wasn't structured in this, more generic way. | ||||||||
▲ | advisedwang 4 days ago | parent [-] | |||||||
ULEB128 has several deficiencies for strings: * it's not "self-synchronizing". If you jump into the middle of a ULEB128 stream and see a byte starting with 0 you can't tell whether it is a single-byte code point or the last byte of a multi-byte code-point. You have to back up to determine the meaning of the byte, which is often not viable. In fact you can't even be sure of the first 3 bytes! * Relatedly, some code-points are subsets of other code points. 0AAAAAAA appears inside of 1BBBBBBB 0AAAAAAA), so you can't search for a code point in a string without checking backwards to make sure you are comparing against a whole code point. * You can't tell the difference between a string cut at the end of a 2 byte code point and one cut part way through a 3 byte code point. | ||||||||
|