Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collation fairness in backing subsystem #7114

Draft
wants to merge 27 commits into
base: master
Choose a base branch
from
Draft

Conversation

tdimitrov
Copy link
Contributor

#4880 implements collation fairness in the collator protocol subsystem but it has got some drawbacks:

  1. It doesn't take into account group rotations. The new backing group has no way to know if claims within its view has been claimed by the previous backing group.
  2. It doesn't take into account statements from other validators on the same backing group. If multiple collators are generating candidates via different validators the latter don't know about each other fetches. As a result they might fetch more advertisements than necessary.

To overcome these problems a solution was proposed in #5079 (comment). In nutshell the idea is to move collations tracking in the backing subsystem since it has got a better view on what candidates are being fetched and seconded. How this solves the problems outline above? One of backing subsystem's goals is to keep track of all candidates that are getting backed no matter if they originate from validator's own backing group (if any or not). This fact naturally solves both problems. First we track candidates from group rotations since we have got a single view of the claim queue. Second we receive statements from all backing groups and we can keep the local claim queue state relatively up to date.

Implementation notes

The PR touches collation-protocol/validator side, backing subsystem and ClaimQueueState (introduced with #4880).

ClaimQueueState

ClaimQueueState was created on the fly in #4880. This means that each time we wanted to know the state of the claim queue we built ClaimQueueState instance. In this PR we want to keep a single view of the claim queue for each leaf and each core and continuously update it by adding new leaves and candidates and dropping old ones. For this reason ClaimQueueState is extended to keep track of the candidate hash occupying each a slot. This is needed because we want to be able to drop claims for failed fetches, invaid candidates, etc. So claimed is no longer a bool but an enum:

enum ClaimState {
	/// Unclaimed
	Free,
	/// The candidate is pending fetching or validation.
	Pending(CandidateHash),
	/// The candidate is seconded.
	Seconded(CandidateHash),
}

Note that the ordering here is not perfect. Let's say there are candidates A1->A2->A3 for a parachain built on top of each other. If we see them in the order A2, A3, A1 we will insert them in the claim queue in the same (wrong) order. This is fine because (a) we need the candidate hash only to be able to release claims and (b) the exact order of candidates is known by the prospective parachains subsystem.

The second change in this module is related to leaves. The claim queue has got different state for each fork and we need to be able to track this. For this reason PerLeafClaimQueueState which wraps a HashMap of ClaimQueueState for each leaf. The non-trivial logic here is adding a new leaf (add_leaf). We need to handle three different cases:

  1. The new leaf builds on top of old leaf. Or in other words we are just extending a fork with a new element.
  2. The new leaf is a for from a non-leaf block in another fork. In this case we are creating a new fork.
  3. The new leaf is a completely new and separate fork.

❗ All these cases are covered with unit tests but I'd appreciate an extra attention from the reviewers. ❗

backing subsystem

Backing subsystem has got an instance of PerLeafClaimQueueState for each core and updates it by keeping track of new candidates. We have got three different set of candidates:

  1. Candidates which the local validator fetches from a collator.
  2. Seconded candidates from our backing group which we receive from statement distribution.
  3. Backed candidates from other backing groups.

Each group is handled differently:

  1. When we receive an advertisement which we want to fetch first we reserve a claim in the claim queue (handled by collator protocol). When we receive the candidate we validate it. On fetch or validation failure we need to drop the claim.
  2. We import the candidate if we have got a free slot in the claim queue and validate it. We drop the claim if validation fails.
  3. If there is a spot in the claim queue we claim and import it. Once claimed these slots are not supposed to be released.

To achieve all these three new messages are introduced:

  • CanClaim returns true if a claim can be made for a specific para id. This message is needed to handle collator protocol v1 advertisements which don't contain candidate hash. More on this in the next section.
  • PendingSlots - returns all pending claims from the claim queue.
  • DropClaims - drops claims for one or more candidates.

All these messages are used by the collator protocol and their application will be explained there.

In nutshell the following changes are made in backing:

  1. Instantiate PerLeafClaimQueueState and keep it up to date. This includes adding new leaves, removing stale leaves, making claims for candidate and releasing claims for bad candidates. These should be easy to spot in the PR.
  2. Handle the new messages mentioned in the previous paragraph.

collator protocol

Most controversial changes are in this subsystem. Since backing subsystem keeps track of the claim queue state the collator protocol no longer builds ClaimQueueState 'on the fly'. The price to pay is additional messages exchanged with backing.
Generally speaking the collator protocol follows these steps:

  1. An advertisement is received.
    If it is over protocol version 2+ CanSecond message is sent to the backing subsystem. The latter gets the leaves where the candidate can be seconded and makes a claim for each leaf. If a claim can't be made - the leaf is dropped.
    If it is over protocol version 1 we can't claim a slot in the claim queue since we don't know the candidate hash. I opted not to support generic claims since they complicate the implementation and hopefully we will deprecate v1 at some point. So what we do is send CanClaim to backing subsystem and get a confirmation that at this point there is a free spot for the candidate in the claim queue. There is a potential race here because the spot is not claimed and can be occupied by another candidate while we do the fetch. I think this risk is acceptable. Worst case we might have a few extra candidates.
  2. If we get a confirmation from backing that we can fetch the candidate the actual fetch is initiated. If it is successful the collator protocol sends Seconded to backing and the claim is marked as Seconded. If the fetch is unsuccessful we use DropClaims to notify backing that the claim should be released.
  3. We might have more than one candidate to fetch. If so - we need to make a decision based on the claim queue. The problem is we no longer have the claim queue state locally so we need to get it with yet another message. PendingSlots returns all claims in pending state (these are all pending fetches). We see if we have got a pending candidate for this para id and fetch it. The claims in the result are ordered by priority.
  • While writing this I just realised we can optimise it for some cases. If there is only one pending candidate in the queue we don't care about the claim queue state. Note that if we receive an advertisement and there is no pending fetch we immediately initiate fetching without adding the candidate in the queue. So in the happy case we don't send PendingSlots.

The collator protocol also keeps track of candidates which are blocked by backing (CanSecond has returned BlockedByBacking error for them). These candidates has got pending claims and if their relay parent goes out of view they are dropped.

❗ Note for the reviewer: This might be abused by making a lot of unbackable advertisements which fill in the claim queue. The alternative is to release claims for these candidates and making a new claim when they are unblocked. This will introduce additional complexity but might be worth doing. I'd love to get some feedback on this. ❗

TODOs

The testing for this PR is still work in progress, but the functionally I consider it done.

  • Handle the TODOs in the code
  • Add support for forks in backing tests so that all current tests are green.
  • Add fairness tests in backing.
  • Fix collator tests and add cases for the new flows.

@paritytech-workflow-stopper
Copy link

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/12711343478
Failed job name: build-rustdoc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant