> Protobufs definitely doesn’t solve the problems described. Capnproto may solve it but I’m not 100% sure. JSON/XML/ASN.1 definitely don’t.

I'm not sure you are serious. What open problem do you have in mind? Support for persisting and deserializing optional fields? Mapping across data types? I mean, some JSON deserializers support deserializing sparse objects even to dictionaries. In .NET you can even deserialize random JSON objects to a dynamic type.

Can you be a little more specific about your assertion?

▲

vlovich123 11 hours ago | parent [-]

The space overhead and the overhead of serialization/deserialization. Rkyv is zero overhead - it’s random access without needing to deserialize and can even be memory mapped.

▲

cozzyd 4 hours ago | parent | next [-]

If you care about space, you're almost certainly going to compress your output (unless, like, you're literally storing random noise) and so you'll necessarily have overhead from that.

Unless the reason you care about space is because it's some sort of wire protocol for a slow network (like LoRaWAN or Iridium packets or a binary UART protocol), where compression probably doesn't make sense because the compression overhead is too large. But even here, just defining the data layout makes sense, I think.

Tihs could take the form of a C struct with __attribute__((packed)) but that is fragile if you care about more platforms than one. (I generally don't, so that works for me!).

▲

amluto 9 hours ago | parent | prev [-]

The whole “zero overhead” thing is IMO a red herring. I care about a few things: stability across versions and languages, space efficiency (sometimes) and performance. I do not care about “overhead” — performance trumps overhead every time.

Your deserializer is probably running on a CPU, and that CPU probably has a very fast L1 cache and might be targeted by a compiler that can do scalar replacement of aggregates and such. A non-zero-overhead deserializer can run very quickly and result in the output being streamed efficiently from its source and ending up hot in L1 in a useful format. A zero-overhead deserializer might do messy reads in a bad order without streaming hints and run much slower.

And then to get very very large records, as in the OP, where getting a good on-disk layout may require thought. And, frequently, the right layout isn’t even array-of-structs, which is why there are so many tools designed to query column stores like Parquet efficiently.

	▲	zadikian 6 hours ago \| parent [-]
		Serdes time can be significant. There are use cases for the zero copy formats even though they use more space. Likewise bit-packed asn1 is often slower than byte-aligned.