-
Notifications
You must be signed in to change notification settings - Fork 65
simple_serialise - not possible to parse the structure without type information #94
Comments
Hi @AlexeyAkhunov, thanks for the input. I think it's fine that you cannot parse a SSZ message without already knowing its intended type. We're sending The message on the network is going to have some byte(s) to identify the message type as a block , attestation record, etc. Off the top of my head, accidentally mislabeling one of these objects and having it de-serialize as a valid, consensus-influencing object of a different type would either be impossible or similar to finding several hash collisions. I think it's important to note that there are at least two different requirements for serialization we're discussing:
What I have been discussing relates to consensus serialization. I think you're discussing p2p serialization @AlexeyAkhunov which could be distinct to consensus serialization, but I don't think there are many gains to this -- especially considering that one of the first things a node will do with some block when they receive it is hash it according its consensus serialization. There's also a serialization format that will be involved with RPC endpoints in the node and I think that should be something like protobuf. However, that's not really being discussed at this stage. |
This is true for now, but probably not in the long term. Additional types from the top of my head:
|
The alternatives we considered all have their trade-offs:
I'm not familiar with the field but network analysis tool should allow plugins, so we could create add that to a bounty/wishlist: SSZ plugins for common open-source network analysis tools. I.e. have a robust base layer and build tooling on top. |
@paulhauner Thank you for your comment! I had an impression from the Eth 2.0 call on 13.09.2018 that simple_serialise had been proposed for p2p serialisation (since p2p serialisation was on the agenda). As for consensus serialisation (serialise things before they get hashed and signed - and I assume you are saying that deserialisation will not actually happen), one of the main requirements there is to be able to avoid excessive dynamic memory allocations during the serialisation. To achieve that, serialisers might want to compute the resulting serialisation size efficiently, before starting the serialisation. Or, alternatively, if we are using sponge-based hash functions (like Keccak) that can consume streams of data as input. In the latter case, prepending the serialisation with the length prefix actually makes it harder, because you cannot start the stream until you have computed the whole serialised string. Therefore, if simple_serialise is to be used for one-way serialisation (input to hash functions, therefore no deserialisation is required), then I suggest dropping length prefixes, because that would allow streaming input to the sponge hash functions like Keccak |
@mratsim Yes, I agree: make schema mandatory serialisation/deserialisation has many problematic consequences. It is not memory efficient, as you said, it precludes generic analysis tools and libraries, it introduces potential for mis-interpretation, and makes the consideration of backwards compatibility much harder. And it also mingles syntax and semantics of messages into one, inseparable into different layers of codes/decoders |
RLP is actually good serialisation format for wire encoding. I do not agree that it is overly complicated - but I have seen some implementations that make it look complicated. I do agree that RLP is not great to generate input for hashing. Therefore, I suggest using RLP for wire messages, but for making inputs for hashes use simple serialisation streams without length prefixes (because length prefixes force us to buffer input instead of simple start hashing it). Since no-one is going to undo the hashes and deserialise, it is not important to have length prefixes |
@paulhauner Do you have the benchmark of SSZ v.s. RLP? |
Do we want to support two serializations? Does the outcome of introducing e.g. SSZ is so big to maintain both RLP and SSZ or ProtoBuf and SSZ? |
@jannikluhn good point about future data types. Thanks. @AlexeyAkhunov no worries, happy to talk about this. With regards to the ssz length prefixes, I don't see a need to remove them before we hash. I think the entire SSZ byte-array should be hashed. @hwwhww we updated the serialization_sandbox notes with RLP message sizes :) @mkalinin I am in favor of using SSZ for both p2p serialization and consensus serialization because it's pretty small and I think one of the very first things we want to do with some block/attestation record when we get it off the wire is hash it. Having two formats would require some translation. I'm totally open to discussion on this though. |
We are currently using RLP as it has already been implemented in EthereumJ. I agree that RLP is over complicated in terms of implementation and open for changing it. But there is no sense to me in supporting several serializations, it sounds even more complicated than implementing RLP. |
@mkalinin I totally agree. |
Hi @AlexeyAkhunov, I didn't make the connection between yourself and TurboGeth earlier. Knowing this, I'll give your proposal much more thought as your knowledge on the subject is more extensive than mine. |
The two arguments in favor of using the same serialization format for networking and hashing are
I'm wondering if both arguments are valid if we go with the libp2p daemon: I assume libp2p has its own transport serialization format and considering that it's implemented in Go this should also not be our bottleneck. It's also fully abstracted away from us, so it doesn't affect simplicity of the Eth2.0 client itself. So the relevant question becomes which serialization format do/should we use to communicate with the daemon? There we have of course different trade offs, in particular we don't need to care about size so much as the messages aren't transmitted via the network. @raulk Has this been specified already? |
@AlexeyAkhunov could you, please, outline your thoughts on what requirements both algorithms should satisfy? Maybe you even have some algorithms to take a look at? It would help to get why SSZ is not suitable for network serialization from your point of view and raise further discussion on alternatives. |
@jannikluhn libp2p is agnostic of the wire format of messages. The control API of the daemon does use protobuf, but that should be well encapsulated inside the binding, e.g. py-libp2p-binding. In other words, it's only used to exchange commands and control signals like: "open connection", "open stream", "new incoming stream", DHT operations, etc. between the binding and the daemon. The application itself (Eth2.0 implementations) can select whatever wire format is best for them. The daemon just pipes raw reads and writes done on the local unix socket representing the stream, into reads and writes on the underlying libp2p stream. |
Thanks for clearing that up, @raulk! Disregard my post then as we still have to decide on a wire format by ourselves. |
IMHO the network should relay consensus objects without re-encoding them. These objects can be represented as byte arrays in any serialization format. A network message defined as a protocol buffer can include a 'block' object as a blob. To solve the issue that started this discussion, it would be best to define consensus objects as an ssz (or rlp) blob with a known header including a version number, e.g. a block would be encoded as "block-v1-..." with ... being the ssz blob. |
Due to the special encoding of 'hash32' and 'address' without any prefix, it will not be possible to parse any simple_serialise messages without knowing the type. This might seem to be OK, but precludes writing of generic parsers that do not depend on the type being parsed. The consequence that any tools/libraries that look at traffic and analyse it has to be type-dependent and will have to be updated any time types change. That will create unnecessarily tight coupling between serialisation format and the tooling.
Also, list and 'type' encodings look like a length-prefixed blob, and without type information it is impossible to see how many elements/fields they have and the boundaries of these elements/fields.
RLP does not have this drawback, for example. But it has its own issues, of course.
Another concern is that serialised messages can mean different things depending on the schema. That is more of a safety/security concern - if we have software of two slightly incompatible version, we can have very hard to find bugs due to misinterpreted messages. Generally, it should be possible to have a high assurance that changing the schema will not successfully parse any messages that were produced by a different schema (or at least a schema of the previous version).
The text was updated successfully, but these errors were encountered: