Remix.run Logo
willtemperley 24 minutes ago

Maybe someone can explain why an encoder would ever create the padding bytes allowed in LEB128. I contributed the parser for LEB128 in apple/swift-binary-parsing and I’m still none the wiser. I’m genuinely mystified.

axod a minute ago | parent | next [-]

Maybe you want to byte align some data, or pack to a certain size but keep compat. I think they're going to be rare cases, but I can see it being used.

cornstalks 9 minutes ago | parent | prev | next [-]

It allows you to fill in padding in a buffer. For example, all data in a buffer will be interpreted by a downstream system, and someone pre-calculated the size of that buffer. Rather than encode everything twice (once to figure out the exact size needed, and a second time to actually populate the buffer) the buffer size was calculated using foreknowledge of how many values would be written to that buffer and then just pessimistically assuming all of them are max-size so writing will never fail. Another situation is when you're rewriting part of an already-encoded file. If you want to change a bit of payload then using padding bytes gives you more flexibility so you can do that without having to do any memcpy into a new buffer.

It's uncommon but I've definitely seen it done (with media containers like Matroska, not actually LEB128) in extremely high-throughput systems that can't spare any cycles.

scottlamb 10 minutes ago | parent | prev | next [-]

I can think of two reasons.

The first is what they describe here: as an attack. It's like why would anyone ever overflow a buffer with shellcode.

The second is that they are implementing a spec that requires appending varint-prefixed field to a buffer but don't really care about the space optimization, don't know the field's length when they start appending it, and don't want to put the field into a second, temporary buffer or slide it down into place. https://github.com/FFmpeg/FFmpeg/blob/468a743af1653a08f47081... vs say my own code which does the slide: https://github.com/scottlamb/retina/blob/6972ac4261ce7bf5b58...

esrauch 16 minutes ago | parent | prev | next [-]

Let's say you are writing into a byte[] and have a LEB128 length-prefix followed by a payload, but that determining the length actually involves nontrivial encoding work. For example, you have a UTF16 string and want to write out a UTF8 string, you want to go over the characters and write them out, but the UTF8 length is not known without doing all of that work.

If you can choose a fixed number of bytes for the length prefix, you can skip that number, do the encoding and find out the length, and then come back and fill in the length-prefix after.

But you actually don't know how many bytes it will take without doing all of the work to know the payload length (since larger payloads take more bytes to represent the length).

If you allow overlong representation you can reserve a few bytes and sometimes it'll just be the effective no-op bytes. If you don't, you won't be able to.

boricj 12 minutes ago | parent | prev | next [-]

Laziness probably. Maybe there's an argument if you want to avoid branches and just blast the integer out in a fixed number of statements/instructions/bytes, but that sounds a bit fringe.

I happen to be guilty of a variant of this, where I don't bother emitting a 16-bit floating point number instead of a 32-bit one in my CBOR encoder even if it can be represented exactly. That one is laziness.

layer8 17 minutes ago | parent | prev | next [-]

The issue is that non-unique encodings are an attack vector, because parsers may in practice behave differently for noncanonical (or nominally invalid) encodings.

Chaosvex 21 minutes ago | parent | prev [-]

You wouldn't. It's a strange argument that can be countered with, "maybe don't do that?"

willtemperley 18 minutes ago | parent [-]

So why does the spec allow it? Like a good engineer I read the spec and tested against the over-wide example encodings given.

Chaosvex 9 minutes ago | parent [-]

Because it's not a real standard and there is no blessed RFC for it. The DWARF spec is as close as you'll get and it says, "The integer zero is a special case, consisting of a single zero byte." So in a way, it doesn't.

Either way, a properly written decoder (and it's like ten lines) should really not have any problems with it. I was agreeing with you.

Edit: to clarify, I was talking about the author's argument being strange, not yours.