There are many systems that take a native data structure in your favorite language and, using some sort of reflection, makes an on-disk structure that resembles it. Python pickles and Java’s serialization system are infamous examples, and rkyv is a less alarming one.

I am quite strongly of the opinion that one should essentially never use these for anything that needs to work well at any scale. If you need an industrial strength on-disk format, start with a tool for defining on-disk formats, and map back to your language. This gives you far better safety, portability across languages, and often performance as well.

Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto or even JSON or XML or ASN.1. Note that there are zero programming languages in that list. The right choice is probably not C structs or pickles or some other language’s idea of pickles or even a really cool library that makes Rust do this.

(OMG I just discovered rkyv_dyn. boggle. Did someone really attempt to reproduce the security catastrophe that is Java deserialization in Rust? Hint: Java is also memory-safe, and that has not saved users of Java deserialization from all the extremely high severity security holes that have shown up over the years. You can shoot yourself in the foot just fine when you point a cannon at your foot, even if the cannon has no undefined behavior.)

▲

pjc50 15 hours ago | parent | next [-]

Dotnet used to have BinaryFormatter, which they had to kill for this reason, and they basically recommend that exact set of serializers as a replacement: https://learn.microsoft.com/en-us/dotnet/standard/serializat...

▲

hdjrudni 16 hours ago | parent | prev | next [-]

Don't forget PHP's serialize/unserialize, it's also sketchy. Looks like they at least put up a big warning in their docs: https://www.php.net/manual/en/function.unserialize.php

Not hating on PHP, to be clear. It has its warts, but it has served me well.

	▲	nchmy 5 hours ago \| parent [-]
		igbinary is often a good drop-in replacement for native serialize/unserialize. Faster and smaller.

▲

john01dav 4 hours ago | parent | prev | next [-]

> even a really cool library that makes Rust do this.

The first library that comes to mind when I think of this is `serde` with `#[derive(Serialize, Deserialize)]`, but that gives persistence-format output as you describe is preferable to the former case. I usually use it with JSON.

So, this seems like it may be a false dichotomy.

	▲	amluto an hour ago \| parent [-]
		Maybe a little bit. But serde works with JSON (among other formats), and you can use it to read and write JSON that interoperates with other libraries and languages just fine. Kind of like how SQLAlchemy looks kind of like you’re writing normal Python code, but it interoperates with SQL. rkyv is not like this.

▲

gz09 19 hours ago | parent | prev | next [-]

> Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto

I think parquet and arrow are great formats, but ultimately they have to solve a similar problem that rkyv solves: for any given type that they support, what does the bit pattern look like in serialized form and in deserialized form (and how do I convert between the two).

However, it is useful to point out that parquet/arrow on top of that solve many more problems needed to store data 'at scale' than rkyv (which is just a serialization framework after all): well defined data and file format, backward compatibility, bloom filters, run length encoding, compression, indexes, interoperability between languages, etc. etc.

▲

vlovich123 17 hours ago | parent | prev | next [-]

Protobufs definitely doesn’t solve the problems described. Capnproto may solve it but I’m not 100% sure. JSON/XML/ASN.1 definitely don’t.

It’s like you listed a bunch of serialization technologies without grokking the problem outlined in the post doesn’t have much to do with rkyv itself.

▲

8 hours ago | parent | next [-]

[deleted]

▲

zadikian 6 hours ago | parent | prev | next [-]

Pretty sure protobuf used a header to track field presence within a message, similarly to what this article does. That does have its own overhead you could avoid if you knew all fields were present, but that's not the assumption it makes.

▲

UqWBcuFx6NV4r 10 hours ago | parent | prev | next [-]

I have zero doubt that you’re on some ‘no true Scotsman’-style “you’re not doing Real Development if you are using these technologies to solve these problems” thing. Let’s just drop that. There are myriad ‘real man webscale development’ scenarios where these are more than acceptable.

▲

imtringued 13 hours ago | parent | prev | next [-]

Actually, it's you who is giving that impression with an ultra vague "doesn't solve the problems described".

The only problem in the blog post is efficient coding of optional fields and all they was introduce a bitmap. From that perspective, JSON and XML solve the optional fields problem to perfection, since an absent field costs exactly nothing.

▲

vlovich123 8 hours ago | parent [-]

I guess you missed the part where the size of the data stored on disk and efficient deserialization are also critically important performance characteristics that neither JSON nor XML have?

Capnproto doesn’t support transform on serialize - the optional fields still take up disk space unless you use the packed representation which has some performance drawbacks. Also the generated capnproto rust code is quite heavy on compile times which is probably some consideration that’s important for compiling queries.

	▲	zadikian 4 hours ago \| parent [-]
		Indeed Capnproto is more optimized for serdes time than space usage.

▲

locknitpicker 16 hours ago | parent | prev [-]

> Protobufs definitely doesn’t solve the problems described. Capnproto may solve it but I’m not 100% sure. JSON/XML/ASN.1 definitely don’t.

I'm not sure you are serious. What open problem do you have in mind? Support for persisting and deserializing optional fields? Mapping across data types? I mean, some JSON deserializers support deserializing sparse objects even to dictionaries. In .NET you can even deserialize random JSON objects to a dynamic type.

Can you be a little more specific about your assertion?

▲

vlovich123 8 hours ago | parent [-]

The space overhead and the overhead of serialization/deserialization. Rkyv is zero overhead - it’s random access without needing to deserialize and can even be memory mapped.

▲

cozzyd 2 hours ago | parent | next [-]

If you care about space, you're almost certainly going to compress your output (unless, like, you're literally storing random noise) and so you'll necessarily have overhead from that.

Unless the reason you care about space is because it's some sort of wire protocol for a slow network (like LoRaWAN or Iridium packets or a binary UART protocol), where compression probably doesn't make sense because the compression overhead is too large. But even here, just defining the data layout makes sense, I think.

Tihs could take the form of a C struct with __attribute__((packed)) but that is fragile if you care about more platforms than one. (I generally don't, so that works for me!).

▲

amluto 7 hours ago | parent | prev [-]

The whole “zero overhead” thing is IMO a red herring. I care about a few things: stability across versions and languages, space efficiency (sometimes) and performance. I do not care about “overhead” — performance trumps overhead every time.

Your deserializer is probably running on a CPU, and that CPU probably has a very fast L1 cache and might be targeted by a compiler that can do scalar replacement of aggregates and such. A non-zero-overhead deserializer can run very quickly and result in the output being streamed efficiently from its source and ending up hot in L1 in a useful format. A zero-overhead deserializer might do messy reads in a bad order without streaming hints and run much slower.

And then to get very very large records, as in the OP, where getting a good on-disk layout may require thought. And, frequently, the right layout isn’t even array-of-structs, which is why there are so many tools designed to query column stores like Parquet efficiently.

	▲	zadikian 4 hours ago \| parent [-]
		Serdes time can be significant. There are use cases for the zero copy formats even though they use more space. Likewise bit-packed asn1 is often slower than byte-aligned.

▲

neilyio 18 hours ago | parent | prev | next [-]

Delightful metaphor, I'll be looking everywhere for a chance to use that now!

▲

LtWorf 17 hours ago | parent | prev | next [-]

But if you use complicated serialisation formats you can't mmap a file into memory and use it directly. Which is quite convenient if you don't want to parse the whole file and allocate it to memory because it's too large compared to the amount of memory or time you have.

▲

imtringued 13 hours ago | parent | prev | next [-]

Fully agreed. rkyv looks like something that is hyper optimizing for a very niche case, but doesn't actually admit that it is doing so. The use case here is transient data akin to swapping in-memory data to disk.

"However, while the former have external schemas and heavily restricted data types, rkyv allows all serialized types to be defined in code and can serialize a wide variety of types that the others cannot."

At a first glance, it might sound like rkyv is better, after all, it has less restrictions and external schemas are annoying, but it doesn't actually solve the schema issue by having a self describing format like JSON or CBOR. You won't be able to use the data outside of Rust and you're probably tied to a specific Rust version.

	▲	bombela 41 minutes ago \| parent [-]
		> You won't be able to use the data outside of Rust and you're probably tied to a specific Rust version. This seems false after reading the book, the doc, and a cursory reading of the source code. It is definitely independent of rust version. The code make use of repr(C) on struct (field order follows the source code) and every field gets its own alignment (making it independent from the C ABI alignment). The format is indeed portable. It is also versioned. The schema of the user structs is in Rust code. You can make this work across languages, but that's a lot of work and code to support. And this project appears to be in Rust for Rust. On a side note, I find the code really easy to understand and follow. In my not so humble opinion, it is carefully crafted for performance while being elegant.

▲

userbinator 16 hours ago | parent | prev [-]

and often performance as well

BS. Nothing can be faster than a read()/write() (or even mmap()) into a struct, because everything else would need to do more work.

▲

pjc50 15 hours ago | parent [-]

Sure, if your structure doesn't contain any pointers and you only ever want to support one endianness and you trust your compiler to fix the machine layout of the struct forever.

	▲	zadikian 4 hours ago \| parent \| next [-]
		Mainly the first thing. If your struct is already serial, of course serialization will be easy.
	▲	userbinator 3 hours ago \| parent \| prev [-]
		...which is true for 99.999% of the time anyway, so it's not worth worrying about.