Mastodawn

Signing data structures the wrong way

https://blog.foks.pub/posts/domain-separation-in-idl/

Signing data structures the right way | The FOKS Blog

Putting domain separators in the IDL is interesting but you can also avoid the problem by putting the domain separators in-band (e.g. in some kind of "type" field that is always present).

Tangentially, depending on what your input and data model look like, canonicalisation takes O(nlogn) time (i.e. the cost of sorting your fields).

Here I describe an alternative approach that produces deterministic hashes without a distinct canonicalization step, using multiset hashing: https://www.da.vidbuchanan.co.uk/blog/signing-json.html

Another Way Not to Sign JSON | Blog

Show thread

majormajor 6d ago

I think a lot of people assume that the "name" of the type, for protos, will be preserved somewhere in the output such that a TreeRoot couldn't be re-used as a KeyRevoke. It makes sense that it isn't - you generally don't want to send that name every time - but it's non-obvious to people with a object-oriented-language background who just think "ah, different types are obviously different types." The serialization cost objection is generally what I've often seen against in-bound type fields and such, as well, so having a unique identifier that gets used just for signature computation is clever.

What's over my head possibly, from skimming it, about your multiset hashing is how it avoids the "these payloads have the same shape, so one could be re-sent as the other" issue? It seems like a solution to a different problem?

Show thread

Retr0id 6d ago

Multiset hashing is not related to the domain separation problem, but it is related to the broader "signing data structures" problem.

(I realise my comment reads a bit unclearly, it's basically two separate comments, split after the first paragraph)

Show thread

kccqzy 6d ago

This is just a mismatch between nominal typing and structural typing. Protobuf is basically structural typing. You can serialize a message defined with one schema and deserialize the result to a message with a different schema if the two schemata are compatible enough. Almost all normal programming languages use nominal typing. If you have `struct A {int a; int b};` it is distinct from `struct B {int a; int b};`.

Show thread

actionfromafar 6d ago

C does too as a language, but it’s fairly easy to slip up at link time or runtime. At some point the types melt away and you sit there with pointers and offsets. Again, it’s not strictly the language’s fault (I think, I’m far from a standards lawyer).

Show thread

tantalor 6d ago

Since the example was given in proto, I'll suggest a solution in proto: add a message option.

  extend google.protobuf.MessageOptions {
    optional uint64 domain_separator = 1234;
  }

message TreeRoot { option (domain_separator) = 4567; ... }

Show thread

lukev 6d ago

So, isn't this a rather longwinded way to say that a signature only extends to the scope of the message it contains?

It doesn't matter if I sign the word "yes", if you don't know what question is being asked. The signature needs to included the necessary context for the signature to be meaningful.

Lots of ways of doing that, and you definitely need to be thoughtful about redundant data and storage overhead, but the concept isn't tricky.

Show thread

maxtaco 6d ago

Hi, post author here. Agree that the idea isn't tricky, but it seems like many systems still get it wrong, and there wasn't an available system that had all the necessary features. I've tried many of them over the years -- XDR, JSON, Msgpack, Protobufs. When I sat down to write FOKS using protobufs, I found myself writing down "Context Strings" in a separate text file. There was no place for them to go in the IDL. I had worked on other systems where the same strategy was employed. I got to thinking, whenever you need to write down important program details in something that isn't compiled into the program (in this case, the list of "context strings"), you are inviting potentially serious bugs due to the code and documentation drifting apart, and it means the libraries or tools are inadequate.

I think this system is nice because it gives you compile-time guarantees that you can't sign without a domain separator, and you can't reuse a domain separator by accident. Also, I like the idea of generating these things randomly, since it's faster and scales better than any other alternative I could think of. And it even scales into some world where lots of different projects are using this system and sharing the same private keys (not a very likely world, I grant you).

Show thread

cogman10 6d ago

Why not digest the type as part of the hash? This avoids the problem in the article and keeps the transmission size small.

Show thread

maxtaco 6d ago

It should be possible to change the name of the type, and this happens often in practice. But type renames shouldn't break preexisting signatures. In this scheme you are free change the type name, and preexisting signatures still verify with new code -- of course as long as you never change the domain separator, which you never should do. Also you'd need to worry about two different projects reusing the same type name. Lastly, the transmission size in this scheme remains unaffected since the domain separators do not appear in the serialized data. Rather, both sides agree on it via the protocol specification.

Show thread

actionfromafar 6d ago

That’s easily addressed. We just need a global immutable registry of types, their names, an alias list and revocation list. ;-)

We can let one be managed by ICANN and the others various competing offerings on ETH.

Show thread

tennysont 6d ago

They use a magic number, rather than a digest derived from the schema[1], but otherwise they do as you suggest. The magic number is given to the signing function (sender side) and the validation function (receiver side) but does not increase the size of the transmitted message.

[1]

I think that's what you mean by digest, but maybe you just mean `type` = `magic number`

Show thread

efitz 6d ago

When my data structures are messages to be sent over a network, I always start with msgId and msgLen, both fixed width fields.

This solves the message differentiation problem explicitly, makes security and memory management easier, and reduces routing to:

switch(msg.msgId):
…