Remix.run Logo
UncleEntity 9 days ago

> In other words, a UPB parser is actually configuration for an interpreter VM, which executes Protobuf messages as its bytecode.

This is kind of confusing, the VM is runtime crafted to parse a single protobuf message type and only this message type? The Second Futamura Projection, I suppose...

Or the VM is designed specifically around generic protobuf messages and it can parse any random message but only if it's a protobuf message?

I've been working on the design of a similar system but for general binary parsing (think bison/yacc for binary data) and hadn't even considered doing data over specialized VM vs. bytecode+data over general VM. Honestly, since it's designed around 'maximum laziness' (it just parses/verifies and creates metadata over the input so you only pay for decoding bytes you actually use) and I/O overhead is way greater than the VM dispatching trying this out is probably one of those "premature optimization is the root of all evil" cases but intriguing none the less.

haberman 3 days ago | parent | next [-]

I think I can shed some light on this, as the creator and lead of upb.

Calling a Protobuf Parser an "interpreter VM" is a little bit of rhetorical flourish. It comes from the observation that there are some deep structural similarities between the two, which I first observed in an article a few years back: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...

> It may seem odd to compare interpreter loops to protobuf parsers, but the nature of the protobuf wire format makes them more similar than you might expect. The protobuf wire format is a series of tag/value pairs, where the tag contains a field number and wire type. This tag acts similarly to an interpreter opcode: it tells us what operation we need to perform to parse this field’s data. Like interpreter opcodes, protobuf field numbers can come in any order, so we have to be prepared to dispatch to any part of the code at any time.

This means that the overall structure of a protobuf parser is conceptually a while() loop surrounding a switch() statement, just like a VM interpreter.

The tricky part is that the set of "case" labels for a Protobuf parser is message-specific and defined by the fields in the schema. How do we accommodate that?

The traditional answer was to generate a function per message and use the schema's field numbers as the case labels. You can see an example of that here (in C++): https://github.com/protocolbuffers/protobuf/blob/f763a2a8608...

More recently, we've moved towards making Protobuf parsing more data-driven, where each field's schema is compiled into data that is passed as an argument to a generic Protobuf parser function. We call this "table-driven parsing", and from my read of the blog article, I believe this is what Miguel is doing with hyperpb.

The trick then becomes how to make this table-driven dispatch as fast as possible, to simulate what the switch() statement would have done. That question is what I cover at length in the article mentioned above.

anonymoushn 3 days ago | parent | next [-]

Really great. I wonder, for the "types encoded as code" approach, is there any benefit to fast paths for data with fields in ascending order? For some json parsers with types encoded as code I have observed some speedup from either hard-coding a known key order or assuming keys in some order and providing a fallback in case an unexpected key is encountered. For users who are stuck with protobuf forever because of various services using it and various data being encoded this way, the historical data could plausibly be canonicalized and written back in large chunks when it is accessed, so that one need not pay the entire cost of canonicalizing it all at once. But of course the icache concerns are still just as bad.

haberman 3 days ago | parent [-]

This kind of "expected next field" optimization has a long history in protobuf, but results are mixed.

The generated code in C++ used to check for the expected next field before falling back to the switch() (example here: https://github.com/protocolbuffers/protobuf/blob/460e7dd7c47...) but this was removed in 2016 when load tests found that it hurt more than it helped.

One tricky part of making this optimization work is making a good guess about what the next field should be. Miguel's article alludes to this:

> Each field specifies which fields to try next. This allows the compiler to perform field scheduling, by carefully deciding which order to try fields in based both on their declaration order and a rough estimation of their “hotness”, much like branch scheduling happens in a program compiler. This avoids almost all of the work of looking up the next field in the common case, because we have already pre-loaded the correct guess.

> I haven’t managed to nail down a good algorithm for this yet, but I am working on a system for implementing a type of “branch prediction” for PGO, that tries to provide better predictions for the next fields to try based on what has been seen before.

One delightful thing about the tail-call parser design is that the CPU's branch predictor effectively takes over the job of guessing. With a tail call parser, the dispatch sequence ends up looking like this:

    cmp    QWORD PTR [rdi+0x8],rsi               # Bounds check
    jbe    .fallback
    movzx  r10d,WORD PTR [rsi]                   # Load two bytes of tag
    mov    eax,r10d
    and    eax,0xf8
    mov    r9,QWORD PTR [rcx+rax*2]              # Load table data
    xor    r9,r10
    mov    rax,QWORD PTR [rcx+rax*2+0x8]         # Load field parser function
    jmp    rax                                   # Tail call to field parser
That "jmp rax" instruction is an indirect branch that can be predicted by the CPU. The CPU effectively guesses for us!

And unlike any kind of guess we might have performed ahead-of-time, the branch predictor will constantly adapt to whatever patterns it is seeing in the data. This is good, because statically guessing ahead-of-time is hard.

mananaysiempre 2 days ago | parent [-]

> This kind of "expected next field" optimization has a long history in protobuf

You could probably even trace the history of the idea all the way to Van Jacobson’s 30-instruction TCP fastpath[1]. Or to go a bit closer, I’ve found that an interpreter for a stack+accumulator VM (which, compared to the pure stack option, is prone to blowing up the bytecode count and thus dispatch cost with the constant PUSH-accumulator instructions) goes significantly faster if you change the (non-shared) dispatch from

  return impl[*pc](pc, ...);
to

  if (*pc == PUSH) {
      do_push(...); pc++;
  }
  return impl[*pc](pc, ...);
which feels somewhat analogous to the next-field optimization and avoids polluting the indirect branch predictor with the very common PUSH predictions. (It’s still slower than not having those PUSHes in the first place.)

[1] https://www.pdl.cmu.edu/mailinglists/ips/mail/msg00133.html

mananaysiempre 3 days ago | parent | prev | next [-]

> More recently, we've moved towards making Protobuf parsing more data-driven, where each field's schema is compiled into data that is passed as an argument to a generic Protobuf parser function. We call this "table-driven parsing", and from my read of the blog article, I believe this is what Miguel is doing with hyperpb.

Everything old is new again, I guess—one of the more advertised changes in Microsoft COM as it was maturing (circa 1995) was that you could use data-driven marshalling with “NDR format strings” (bytecode, essentially[1,2]) instead of generating C code. Shortly after there was typelib marshalling (format-compatible but limited), and much later also WinRT’s metadata-driven marshalling (widely advertised but basically completely undocumented).

Fabrice Bellard’s nonfree ASN.1 compiler[3] is also notable for converting schemas into data rather than code, unlike most of its open-source alternatives.

I still can’t help wondering what it is, really, that makes the bytecode-VM approach advantageous. In 1995, the answer seems simple: the inevitable binary bloat was unacceptable for a system that needs to fit into single-digit megabytes of RAM; and of course bytecode as a way to reduce code footprint has plenty of prior art (microcomputer BASICs, SWEET16, Forth, P-code, even as far back as the AGC).

Nowadays, the answer doesn’t seem as straightforward. Sure, the footprint is enormous if you’re Google, but you’re not Google (you’re probably not even Cloudflare), and besides, I hope you’re not Google and can design stuff that’s actually properly adapted to static linking. Sure, the I$ pressure is significant (thinking back to Mike Pall’s explanation[4] why he didn’t find a baseline JIT to be useful), but the bytecode interpreter isn’t going to be a speed demon either.

I don’t get it. I can believe it’s true, but I don’t really feel that I get it.

[1] https://learn.microsoft.com/en-us/windows/win32/rpc/rpc-ndr-...

[2] https://gary-nebbett.blogspot.com/2020/04/rpc-ndr-engine-dce...

[3] https://bellard.org/ffasn1/

[4] https://news.ycombinator.com/item?id=14460027

anonymoushn 2 days ago | parent [-]

In some workloads, you can benefit from paying the latency cost of data dependencies instead of the latency cost of conditional control flow, but there's no general rule here. It's best to try out several options on the actual task and the actual production data distribution.

benreesman 2 days ago | parent | prev | next [-]

Regular expression engines on this model are often called VMs, certainly thats the terminology and layout I used. A big switch/computed-goto loop thing with state codes? Yeah, thats a VM.

I think its a very illuminating way to describe it. Nicely done with the implementation as well.

pjc50 2 days ago | parent | prev [-]

> This means that the overall structure of a protobuf parser is conceptually a while() loop surrounding a switch() statement, just like a VM interpreter.

This is a very insightful design. Encoding the parser in a table is an old technique, it's what YACC uses. There's a tradeoff between using the CPU's stack+registers versus having your own stack+state in this kind of work, and people have found situations where a small VM is faster because it gets to stay in the instruction cache and benefit from branch prediction, while the less-predictable data stays in the d-cache.

tonyarkles 3 days ago | parent | prev [-]

Based on what I know about the structure of protobufs internally and without having looked deep into what UPB is doing... I'd guess it could probably be a stack machine that treats (byte)+ as opcodes. Most of the time I'd think of it as parser -> AST -> bytecode, but I think the "grammar" of protobufs would allow your parser to essentially emit terminals as they're parsed straight to the VM as instructions to execute.

UncleEntity 3 days ago | parent [-]

In the couple days since I posted my confusion (threads merged or something) I consulted the daffy robots and figured out how it all works. Also had them come up with a design document for "a specialized compiler and virtual machine architecture for parsing Protocol Buffer messages that achieves significant performance improvements through a novel compilation pipeline combining protobuf-specific AST optimization, continuation-passing style transformations, and tail call interpreter execution."

Interesting times we live in...