Remix.run Logo
kentonv 7 days ago

Previous discussions:

* https://news.ycombinator.com/item?id=18188519

* https://hn.algolia.com/?q=%22Protobuffers+Are+Wrong%22

I guess I'll, once again, copy/paste the comment I made when this was first posted: https://news.ycombinator.com/item?id=18190005

--------

Hello. I didn't invent Protocol Buffers, but I did write version 2 and was responsible for open sourcing it. I believe I am the author of the "manifesto" entitled "required considered harmful" mentioned in the footnote. Note that I mostly haven't touched Protobufs since I left Google in early 2013, but I have created Cap'n Proto since then, which I imagine this guy would criticize in similar ways.

This article appears to be written by a programming language design theorist who, unfortunately, does not understand (or, perhaps, does not value) practical software engineering. Type theory is a lot of fun to think about, but being simple and elegant from a type theory perspective does not necessarily translate to real value in real systems. Protobuf has undoubtedly, empirically proven its real value in real systems, despite its admittedly large number of warts.

The main thing that the author of this article does not seem to understand -- and, indeed, many PL theorists seem to miss -- is that the main challenge in real-world software engineering is not writing code but changing code once it is written and deployed. In general, type systems can be both helpful and harmful when it comes to changing code -- type systems are invaluable for detecting problems introduced by a change, but an overly-rigid type system can be a hindrance if it means common types of changes are difficult to make.

This is especially true when it comes to protocols, because in a distributed system, you cannot update both sides of a protocol simultaneously. I have found that type theorists tend to promote "version negotiation" schemes where the two sides agree on one rigid protocol to follow, but this is extremely painful in practice: you end up needing to maintain parallel code paths, leading to ugly and hard-to-test code. Inevitably, developers are pushed towards hacks in order to avoid protocol changes, which makes things worse.

I don't have time to address all the author's points, so let me choose a few that I think are representative of the misunderstanding.

> Make all fields in a message required. This makes messages product types.

> Promote oneof fields to instead be standalone data types. These are coproduct types.

This seems to miss the point of optional fields. Optional fields are not primarily about nullability but about compatibility. Protobuf's single most important feature is the ability to add new fields over time while maintaining compatibility. This has proven -- in real practice, not in theory -- to be an extremely powerful way to allow protocol evolution. It allows developers to build new features with minimal work.

Real-world practice has also shown that quite often, fields that originally seemed to be "required" turn out to be optional over time, hence the "required considered harmful" manifesto. In practice, you want to declare all fields optional to give yourself maximum flexibility for change.

The author dismisses this later on:

> What protobuffers are is permissive. They manage to not shit the bed when receiving messages from the past or from the future because they make absolutely no promises about what your data will look like. Everything is optional! But if you need it anyway, protobuffers will happily cook up and serve you something that typechecks, regardless of whether or not it's meaningful.

In real world practice, the permissiveness of Protocol Buffers has proven to be a powerful way to allow for protocols to change over time.

Maybe there's an amazing type system idea out there that would be even better, but I don't know what it is. Certainly the usual proposals I see seem like steps backwards. I'd love to be proven wrong, but not on the basis of perceived elegance and simplicity, but rather in real-world use.

> oneof fields can't be repeated.

(background: A "oneof" is essentially a tagged union -- a "sum type" for type theorists. A "repeated field" is an array.)

Two things:

1. It's that way because the "oneof" pattern long-predates the "oneof" language construct. A "oneof" is actually syntax sugar for a bunch of "optional" fields where exactly one is expected to be filled in. Lots of protocols used this pattern before I added "oneof" to the language, and I wanted those protocols to be able to upgrade to the new construct without breaking compatibility.

You might argue that this is a side-effect of a system evolving over time rather than being designed, and you'd be right. However, there is no such thing as a successful system which was designed perfectly upfront. All successful systems become successful by evolving, and thus you will always see this kind of wart in anything that works well. You should want a system that thinks about its existing users when creating new features, because once you adopt it, you'll be an existing user.

2. You actually do not want a oneof field to be repeated!

Here's the problem: Say you have your repeated "oneof" representing an array of values where each value can be one of 10 different types. For a concrete example, let's say you're writing a parser and they represent tokens (number, identifier, string, operator, etc.).

Now, at some point later on, you realize there's some additional piece of data you want to attach to every element. In our example, it could be that you now want to record the original source location (line and column number) where the token appeared.

How do you make this change without breaking compatibility? Now you wish that you had defined your array as an array of messages, each containing a oneof, so that you could add a new field to that message. But because you didn't, you're probably stuck creating a parallel array to store your new field. That sucks.

In every single case where you might want a repeated oneof, you always want to wrap it in a message (product type), and then repeat that. That's exactly what you can do with the existing design.

The author's complaints about several other features have similar stories.

> One possible argument here is that protobuffers will hold onto any information present in a message that they don't understand. In principle this means that it's nondestructive to route a message through an intermediary that doesn't understand this version of its schema. Surely that's a win, isn't it?

> Granted, on paper it's a cool feature. But I've never once seen an application that will actually preserve that property.

OK, well, I've worked on lots of systems -- across three different companies -- where this feature is essential.

krullin 6 days ago | parent | next [-]

To me, it seems that version-change-safety and the usefulness of the generated code constitute a design tradeoff: If you mark a field as required, then the generated data structures can skip using Option/pointers, and this very common form of validation can be generated for free. If you disallow marking a field as required, then all fields must be checked for existence, even ones required for a system to function, which is quite a burden and will lead to developers having to write their own types anyway as a place to put their validated data into. If data is required to be present for an app to function, then why can't I be given the tools to express this, and benefit from the constraints applied to the data model?

Most of the time when I would like to use a schema-driven, efficient data format and code generation tool, the data contract doesn't change frequently. And when it does, assuming it's a backwards-incompatible change, I think I would be happy to generate a MyDataV2 message along with GetMyDataV2 method, allow existing clients to keep using the original version, and allow new or existing clients to use the newly supported structures at their leisure. Meanwhile, everyone that shares my schema can have much more idiomatic generated code, and in the most common cases won't have to write their own data types or be stuck with a bunch of `if data.x != null {` statements.

Protobufs are an amazing tool, but I think there is a need for a simpler tool which supports a restricted set of use cases cleanly and allows for wider expression of data models.

palata 7 days ago | parent | prev | next [-]

> I guess I'll, once again, copy/paste the comment I made when this was first posted

I had missed it those other times, and it's super interesting. So thank you for copy/pasting it once again :-).

instig007 6 days ago | parent | prev [-]

> Real-world practice has also shown that quite often, fields that originally seemed to be "required" turn out to be optional over time

how often? as practiced by who, and where?

> 2. You actually do not want a oneof field to be repeated!

> How do you make this change without breaking compatibility? Now you wish that you had defined your array as an array of messages, each containing a oneof, so that you could add a new field to that message. But because you didn't, you're probably stuck creating a parallel array to store your new field. That sucks.

Nice, "explain to me how you're going to implement a backward-compatible SUM in the spec-parser that doesn't have the notions needed. Ha! You can't! Told you so!"

> But because you didn't, you're probably stuck creating a parallel array to store your new field. That sucks.

Not really, `oneoff token` is isomorphic to `oneoff (token unit)` and going from the former to the latter doesn't require binary encoding change at all, if the encoding is optimal. Getting from `oneoff (token unit)` to `oneoff (token { linepos })`, depending on the binary encoding format you design, doesn't require you making changes to the parser's runtime, as long as the parser takes into account that `unit` is isomorphic to the zero-arity-product `{}`, and since both `{}` and `{ linepos }` can be presented with a fixed positional addressing, you get your values in a backward-compatible way, but under a specific condition: the parser library API provides `repeated (oneoff <A>)` as a non-materialised stream of values <A>, so that the exact interpretation of <A> happens at a user's calling site, according to the existing stated protocol spec: if it says `<A> = token`, then `list (repeated (oneoff (token { linepos })))` is isomorphic to `list (repeated (oneoff token))` in the deployed version of the protocol that knows nothing about the line positions, so my endpoints can send you either of:

    * Version 0: [len][oneoff_bincode][token_arr]

    * Version 1: [len][oneoff_bincode_sum][token_arr][unit]

    * Version 2: [len][oneoff_bincode_sumprod][token_arr][prod_arr]

    * Version 3: [len][oneoff_bincode_sumprod_sparse][token_arr][presence_arr][prod_arr]
kentonv 6 days ago | parent [-]

> how often? as practiced by who, and where?

This was my experience in Google Search infrastructure circa 2005-2010. This was a system with dozens of teams and hundreds of developers all pushing their data through a common message bus.

It happened all the damned time and caused multiple real outages (from overzealous validation), along with a lot of tech debt involving having to initialize fields with dummy data because they weren't used anymore but still required.

Reports from other large teams at google, e.g. gmail, indicated they had the same problems.

> Nice, "explain to me how you're going to implement a backward-compatible SUM in the spec-parser that doesn't have the notions needed. Ha! You can't! Told you so!"

Sure sure, we could expand the type system to support some way of adding a new tag to every element of the repeated oneof, implement it, teach everyone how that works, etc.

Or we could just tell people to wrap the thing in a `message`. It works fine already, and everyone already understands how to do it. No new cognitive load is created, and no time is wasted chasing theoretical purity that provides no actual real-world benefit.

instig007 4 days ago | parent [-]

> This was my experience in Google Search infrastructure circa 2005-2010 [...]

> Reports from other large teams at google

> teach everyone how that works, etc.

> Or we could just tell people to wrap the thing in a `message`

It really sounds like a self-inflicted internal google issue. Can you address the part where I mention isomorphism of (oneof token) and (oneof (token {})), and clarify what exactly do you think you'd have to teach other engineers to do, if your protocol's encoders and decoders took this property into account?

kentonv 4 days ago | parent [-]

You seem to have merged the required fields issue and the oneof issue, but these are unrelated threads.

> Can you address the part where I mention isomorphism of (oneof token) and (oneof (token {})), and clarify what exactly do you think you'd have to teach other engineers to do, if your protocol's encoders and decoders took this property into account?

What you have written is not a serious proposal in terms of a working way to extend Protocol Buffers to allow repeated oneofs.

What you have written is a very complicated way of saying: "You theoretically could support extensible repeated oneofs, with the right type system and protocol design."

Yes, I know that. With a clean slate, we can do anything. But in the real world (yes I'm going to keep saying that, since you don't seem very familiar with it), you don't get to start from a clean slate every time you don't like how things have turned out.

As it stands, the product type in protobufs is `message`, the sum type is `oneof`, and the vector type is `repeated`. The way `oneof` is encoded on the wire is exactly one of the tags appear. The way `repeated` is encoded on the wire is that the same tag appears many times. The way `message` is encoded is that it's a length-delimited byte blob that contains a series of tag-values inside. Unfortunately, this encoding means if we supported `repeated oneof`, it would not be extensible.

So we ban `repeated oneof`, and say "you need to write a repeated message, where the message contains a `oneof`". This isn't as pretty as people might like but it works just fine in practice and we move on to more important things.

instig007 4 days ago | parent [-]

> What you have written is not a serious proposal in terms of a working way to extend Protocol Buffers to allow repeated oneofs.

I didn't intend to propose a solution for protobuf specifically, I explained why the author of the subject article had a point in calling the authors of protobuf amateurs, given the existing spec, that led to specific implementations of the parsers, and the respective downsides.

> But in the real world (yes I'm going to keep saying that, since you don't seem very familiar with it)

I'll have to repeat that "real-world vs the rest of you" talking point is the specific attitude of (ex-)google folks that make them look amateur or, at least, ignorant.

> This isn't as pretty as people might like but it works just fine in practice and we move on to more important things.

That doesn't explain why you didn't implement it differtently, you just stated that you did. So, why didn't you implement it differently, if you admit that a few lines above with: "Yes, I know that. With a clean slate, we can do anything."