Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate topology after restart if finality is lagging behind current session #6913

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

alexggh
Copy link
Contributor

@alexggh alexggh commented Dec 16, 2024

There is a small issue on restart, where if finality is lagging across session boundary and the validator restarts, then the validator won't be able to contribute anymore with assginments/approvals and gossiping for the blocks from the previous session, because after restart it builds the Topology only for the new session, so without a topology it won't be able to distribute assignments and approvals because everything in approval-distribution is gated on having a topology for the block.

The fix is to also keep track of the last finalized block and its session and if it is different from the list of encountered sessions, build its topology and send it to the rest of subsystems.

@alexggh alexggh requested review from ordian, sandreim and AndreiEres and removed request for ordian December 16, 2024 16:20
@alindima
Copy link
Contributor

How will this work for subsystems like statement-distribution and bitfield-ditribution which need to keep backing new candidates on the latest session? AFAIU, with this PR, they would receive the topologies of the old session that has some unfinalized blocks, which I expect will break them

@alexggh
Copy link
Contributor Author

alexggh commented Jan 6, 2025

How will this work for subsystems like statement-distribution and bitfield-ditribution which need to keep backing new candidates on the latest session? AFAIU, with this PR, they would receive the topologies of the old session that has some unfinalized blocks, which I expect will break them

Good catch, Thanks! In statement-distribution we keep the topology in a per-session structure, so there is no problem there, however the bitfield-distribution does keep only the last received topology and that is a problem, looking into how to fix that.

@alindima
Copy link
Contributor

alindima commented Jan 6, 2025

Maybe a better approach would be to have approval-distribution request an old topology from the gossip-support subsystem. With the current form it's also a bit confusing that we're sending NewGossipTopology for an old session

@alexggh
Copy link
Contributor Author

alexggh commented Jan 7, 2025

Maybe a better approach would be to have approval-distribution request an old topology from the gossip-support subsystem.

I've looked closer and it is not that simple because the flow right now is like:

  1. Gossip-support sends NetworkBridgeRxMessage::NewGossipTopology
  2. Network bridge rx receives this message adds some useful information and converts it to NetworkBridgeEvent::NewGossipTopology
  3. approval-distribution, statement-distribution and bitfield-distribution receive NetworkBridgeEvent::NewGossipTopology

Requesting would mean reproducing the same flow, but in reverse: approval-distribution -> network-bridge-rx -> gossip-support, where which subsystem blocks waiting for the other to respond, I don't think we should introduce this because it is easy to produce deadlocks with this type of interactions..

With the current form it's also a bit confusing that we're sending NewGossipTopology for an old session

The message has session_index on it that clearly tells you which session the topology is for, so that allow callers to determine if they need the older ones or not.

In the end I opted to ignore older topologies for bitfield-distributions, so let me know what you think about going this route.

@alindima
Copy link
Contributor

alindima commented Jan 8, 2025

Requesting would mean reproducing the same flow, but in reverse: approval-distribution -> network-bridge-rx -> gossip-support, where which subsystem blocks waiting for the other to respond, I don't think we should introduce this because it is easy to produce deadlocks with this type of interactions..

To avoid this multi-hop, we could store them with the already enriched data in network-bridge-rx.

The message has session_index on it that clearly tells you which session the topology is for, so that allow callers to determine if they need the older ones or not.

You do have the session index there but the naming is confusing. Subsystems that use this code could easily overlook this (like a new leaf update containing an old leaf).

In the end I opted to ignore older topologies for bitfield-distributions, so let me know what you think about going this route.

There is still code that is being run in bitfield-distribution even if the update_topologies did nothing.
Maybe a good compromise would be adding a new message type? So that this change is explicit and is only handled by subsystems that need it

@paritytech-workflow-stopper
Copy link

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/12667016328
Failed job name: test-linux-stable-no-try-runtime

Signed-off-by: Alexandru Gheorghe <[email protected]>
@alexggh
Copy link
Contributor Author

alexggh commented Jan 10, 2025

Requesting would mean reproducing the same flow, but in reverse: approval-distribution -> network-bridge-rx -> gossip-support, where which subsystem blocks waiting for the other to respond, I don't think we should introduce this because it is easy to produce deadlocks with this type of interactions..

To avoid this multi-hop, we could store them with the already enriched data in network-bridge-rx.

The message has session_index on it that clearly tells you which session the topology is for, so that allow callers to determine if they need the older ones or not.

You do have the session index there but the naming is confusing. Subsystems that use this code could easily overlook this (like a new leaf update containing an old leaf).

In the end I opted to ignore older topologies for bitfield-distributions, so let me know what you think about going this route.

There is still code that is being run in bitfield-distribution even if the update_topologies did nothing. Maybe a good compromise would be adding a new message type? So that this change is explicit and is only handled by subsystems that need it

As discussed on chat I ended doing a solution where only approval-distribution receives old topologies and no other subsystem receives them because they are not interested in them.

Signed-off-by: Alexandru Gheorghe <[email protected]>
@@ -750,7 +751,7 @@ where
// This is kept sorted, descending, by block number.
let mut live_heads: Vec<ActivatedLeaf> = Vec::with_capacity(MAX_VIEW_HEADS);
let mut finalized_number = 0;

let mut newest_session = u32::MIN;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a fancy way of writing zero :P

polkadot/node/network/bridge/src/rx/tests.rs Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants