Putting domain separators in the IDL is interesting but you can also avoid the problem by putting the domain separators in-band (e.g. in some kind of "type" field that is always present).

Tangentially, depending on what your input and data model look like, canonicalisation takes O(nlogn) time (i.e. the cost of sorting your fields).

Here I describe an alternative approach that produces deterministic hashes without a distinct canonicalization step, using multiset hashing: https://www.da.vidbuchanan.co.uk/blog/signing-json.html

▲

majormajor 11 hours ago | parent [-]

I think a lot of people assume that the "name" of the type, for protos, will be preserved somewhere in the output such that a TreeRoot couldn't be re-used as a KeyRevoke. It makes sense that it isn't - you generally don't want to send that name every time - but it's non-obvious to people with a object-oriented-language background who just think "ah, different types are obviously different types." The serialization cost objection is generally what I've often seen against in-bound type fields and such, as well, so having a unique identifier that gets used just for signature computation is clever.

What's over my head possibly, from skimming it, about your multiset hashing is how it avoids the "these payloads have the same shape, so one could be re-sent as the other" issue? It seems like a solution to a different problem?

▲

kccqzy 9 hours ago | parent | next [-]

This is just a mismatch between nominal typing and structural typing. Protobuf is basically structural typing. You can serialize a message defined with one schema and deserialize the result to a message with a different schema if the two schemata are compatible enough. Almost all normal programming languages use nominal typing. If you have `struct A {int a; int b};` it is distinct from `struct B {int a; int b};`.

	▲	actionfromafar 9 hours ago \| parent [-]
		C does too as a language, but it’s fairly easy to slip up at link time or runtime. At some point the types melt away and you sit there with pointers and offsets. Again, it’s not strictly the language’s fault (I think, I’m far from a standards lawyer).

▲

Retr0id 10 hours ago | parent | prev [-]

Multiset hashing is not related to the domain separation problem, but it is related to the broader "signing data structures" problem.

(I realise my comment reads a bit unclearly, it's basically two separate comments, split after the first paragraph)