Collation fairness in backing subsystem #7114

tdimitrov · 2025-01-10T14:25:58Z

#4880 implements collation fairness in the collator protocol subsystem but it has got some drawbacks:

It doesn't take into account group rotations. The new backing group has no way to know if claims within its view has been claimed by the previous backing group.
It doesn't take into account statements from other validators on the same backing group. If multiple collators are generating candidates via different validators the latter don't know about each other fetches. As a result they might fetch more advertisements than necessary.

To overcome these problems a solution was proposed in #5079 (comment). In nutshell the idea is to move collations tracking in the backing subsystem since it has got a better view on what candidates are being fetched and seconded. How this solves the problems outline above? One of backing subsystem's goals is to keep track of all candidates that are getting backed no matter if they originate from validator's own backing group (if any or not). This fact naturally solves both problems. First we track candidates from group rotations since we have got a single view of the claim queue. Second we receive statements from all backing groups and we can keep the local claim queue state relatively up to date.

Implementation notes

The PR touches collation-protocol/validator side, backing subsystem and ClaimQueueState (introduced with #4880).

`ClaimQueueState`

ClaimQueueState was created on the fly in #4880. This means that each time we wanted to know the state of the claim queue we built ClaimQueueState instance. In this PR we want to keep a single view of the claim queue for each leaf and each core and continuously update it by adding new leaves and candidates and dropping old ones. For this reason ClaimQueueState is extended to keep track of the candidate hash occupying each a slot. This is needed because we want to be able to drop claims for failed fetches, invaid candidates, etc. So claimed is no longer a bool but an enum:

enum ClaimState {
	/// Unclaimed
	Free,
	/// The candidate is pending fetching or validation.
	Pending(CandidateHash),
	/// The candidate is seconded.
	Seconded(CandidateHash),
}

Note that the ordering here is not perfect. Let's say there are candidates A1->A2->A3 for a parachain built on top of each other. If we see them in the order A2, A3, A1 we will insert them in the claim queue in the same (wrong) order. This is fine because (a) we need the candidate hash only to be able to release claims and (b) the exact order of candidates is known by the prospective parachains subsystem.

The second change in this module is related to leaves. The claim queue has got different state for each fork and we need to be able to track this. For this reason PerLeafClaimQueueState which wraps a HashMap of ClaimQueueState for each leaf. The non-trivial logic here is adding a new leaf (add_leaf). We need to handle three different cases:

The new leaf builds on top of old leaf. Or in other words we are just extending a fork with a new element.
The new leaf is a for from a non-leaf block in another fork. In this case we are creating a new fork.
The new leaf is a completely new and separate fork.

❗ All these cases are covered with unit tests but I'd appreciate an extra attention from the reviewers. ❗

backing subsystem

Backing subsystem has got an instance of PerLeafClaimQueueState for each core and updates it by keeping track of new candidates. We have got three different set of candidates:

Candidates which the local validator fetches from a collator.
Seconded candidates from our backing group which we receive from statement distribution.
Backed candidates from other backing groups.

Each group is handled differently:

When we receive an advertisement which we want to fetch first we reserve a claim in the claim queue (handled by collator protocol). When we receive the candidate we validate it. On fetch or validation failure we need to drop the claim.
We import the candidate if we have got a free slot in the claim queue and validate it. We drop the claim if validation fails.
If there is a spot in the claim queue we claim and import it. Once claimed these slots are not supposed to be released.

To achieve all these three new messages are introduced:

CanClaim returns true if a claim can be made for a specific para id. This message is needed to handle collator protocol v1 advertisements which don't contain candidate hash. More on this in the next section.
PendingSlots - returns all pending claims from the claim queue.
DropClaims - drops claims for one or more candidates.

All these messages are used by the collator protocol and their application will be explained there.

In nutshell the following changes are made in backing:

Instantiate PerLeafClaimQueueState and keep it up to date. This includes adding new leaves, removing stale leaves, making claims for candidate and releasing claims for bad candidates. These should be easy to spot in the PR.
Handle the new messages mentioned in the previous paragraph.

collator protocol

Most controversial changes are in this subsystem. Since backing subsystem keeps track of the claim queue state the collator protocol no longer builds ClaimQueueState 'on the fly'. The price to pay is additional messages exchanged with backing.
Generally speaking the collator protocol follows these steps:

An advertisement is received.
If it is over protocol version 2+ CanSecond message is sent to the backing subsystem. The latter gets the leaves where the candidate can be seconded and makes a claim for each leaf. If a claim can't be made - the leaf is dropped.
If it is over protocol version 1 we can't claim a slot in the claim queue since we don't know the candidate hash. I opted not to support generic claims since they complicate the implementation and hopefully we will deprecate v1 at some point. So what we do is send CanClaim to backing subsystem and get a confirmation that at this point there is a free spot for the candidate in the claim queue. There is a potential race here because the spot is not claimed and can be occupied by another candidate while we do the fetch. I think this risk is acceptable. Worst case we might have a few extra candidates.
If we get a confirmation from backing that we can fetch the candidate the actual fetch is initiated. If it is successful the collator protocol sends Seconded to backing and the claim is marked as Seconded. If the fetch is unsuccessful we use DropClaims to notify backing that the claim should be released.
We might have more than one candidate to fetch. If so - we need to make a decision based on the claim queue. The problem is we no longer have the claim queue state locally so we need to get it with yet another message. PendingSlots returns all claims in pending state (these are all pending fetches). We see if we have got a pending candidate for this para id and fetch it. The claims in the result are ordered by priority.

While writing this I just realised we can optimise it for some cases. If there is only one pending candidate in the queue we don't care about the claim queue state. Note that if we receive an advertisement and there is no pending fetch we immediately initiate fetching without adding the candidate in the queue. So in the happy case we don't send PendingSlots.

The collator protocol also keeps track of candidates which are blocked by backing (CanSecond has returned BlockedByBacking error for them). These candidates has got pending claims and if their relay parent goes out of view they are dropped.

❗ Note for the reviewer: This might be abused by making a lot of unbackable advertisements which fill in the claim queue. The alternative is to release claims for these candidates and making a new claim when they are unblocked. This will introduce additional complexity but might be worth doing. I'd love to get some feedback on this. ❗

TODOs

The testing for this PR is still work in progress, but the functionally I consider it done.

Handle the TODOs in the code
Add support for forks in backing tests so that all current tests are green.
Add fairness tests in backing.
Fix collator tests and add cases for the new flows.

…without a candidate hash

…ue state from backing

paritytech-workflow-stopper · 2025-01-10T14:40:24Z

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/12711343478
Failed job name: build-rustdoc

tdimitrov added 27 commits December 20, 2024 10:55

Move ClaimQueueState to subsystem-util

ccade8a

PerLeafClaimQueueState implementation

263b28b

Use VecDeque to keep claims

c526b4c

Track leaves in PerLeafClaimQueueState

6c9768c

Tests in separate modules

6a1a027

Track the candidate hashes we have made claims for in ClaimQueueState

0471aab

ClaimQueueState: Track para id for each candidate and support claims …

67b2ffb

…without a candidate hash

claims handling in backing

3f42f17

claims handling in collator protocol

85d6082

Introduce ClaimState in ClaimQueueState - tests not working

a8cf6b7

ClaimedSlots message in backing + mark seconded slots

4111622

dequeue_next_collation_and_fetchin collator protocol uses claim que…

e4cbc7d

…ue state from backing

Fixes in backing

31029e3

small fixes in collator protocol

0b927b0

Cleanup claim_queue_state

360971f

Simplify collator protocol v1 claim check

a6d58f8

claim queue state: ClaimState always has got the candidate hash

c0a9083

More tests for claim_queue_state

2b34022

Add fork_from_state + some more tests

4875b4f

Fixes in backing

4d172ec

Keep track on claims for all cores

56e8e18

Remove unused function from PerLeafClaimQueueState

12f4bbc

Fix leaf update

32094af

More logs in claim_queue_state

85ece50

More logs in seconding_sanity_check

fad223a

Fix a return value in PerLeafClaimQueueState

2ffbb83

Various small fixes

2a13c0c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collation fairness in backing subsystem #7114

Collation fairness in backing subsystem #7114

tdimitrov commented Jan 10, 2025

paritytech-workflow-stopper bot commented Jan 10, 2025

Collation fairness in backing subsystem #7114

Are you sure you want to change the base?

Collation fairness in backing subsystem #7114

Conversation

tdimitrov commented Jan 10, 2025

Implementation notes

ClaimQueueState

backing subsystem

collator protocol

TODOs

paritytech-workflow-stopper bot commented Jan 10, 2025

`ClaimQueueState`