Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FRC: Retrieval Checking Requirements #1089

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

bajtos
Copy link

@bajtos bajtos commented Dec 4, 2024

When we set out to build Spark, a protocol for testing whether payload of Filecoin deals can be retrieved back, we designed it based on how Boost worked at that time (mid-2023). Soon after FIL+ allocator compliance started to use Spark retrieval success score (Spark RSR) in mid-2024, we learned that Venus Droplet, an alternative miner software, is implemented slightly differently and requires tweaks to support Spark. Things evolved quite a bit since then. We need to overhaul most of the Spark protocol to support Direct Data Onboarding deals. We will need all miner software projects (Boost, Curio, Venus) to accommodate the new requirements imposed by the upcoming Spark v2 release.

This FRC has the following goals:

  1. Document the retrieval process based on IPFS/IPLD.
  2. Specify what Spark needs from miner software.
  3. Collaborate with the community to tweak the requirements to work well for all parties involved.
  4. Let this spec and the building blocks like IPNI Reverse Index empower other builders to design & implement their own retrieval-checking networks as alternatives to Spark.

Discussion

#1086

Progress

  • Simple Summary
  • Abstract
  • Change Motivation
  • Specification
  • Design Rationale
  • Backwards Compatibility
  • Test Cases
  • Security Considerations
  • Incentive Considerations
  • Product Considerations
  • Implementation
  • TODO

@bajtos
Copy link
Author

bajtos commented Dec 4, 2024

Tagging @steven004 @LexLuthr @magik6k @masih @willscott @juliangruber @patrickwoodhead for visibility.


#### Link on-chain MinerId and IPNI provider identity

Storage providers are requires to use the same libp2p peer ID for their block-chain identity as returned by `Filecoin.StateMinerInfo` and for the index provider identity used when communicating with IPNI instances like [cid.contact](https://cid.contact).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requirement cannot be fulfilled in Curio. We no longer have a concept of minerID <> Unique peerID binding. IPNI must be extended to support other keys types like worker key to sign ads.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am aware of that; see the note in the text below this paragraphs.

> [!NOTE]
> This is open to extensions in the future, we can support more than one form of linking
> index-provides to filecoin-miners. See e.g. [ipni/spec#33](https://github.com/ipni/specs/issues/33).

From my point of view, I prefer not to block progress on this FRC until the Curio team figures out how to extend IPNI to support other key types. Instead, I'd like this FRC to document the solution that works with Boost & Venus now and then enhance it with the new mechanism Curio needs once that new solution is agreed on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am probably mistaken here but Droplet (Venus' Boost) supports multiple minerIDs being associated with a single PeerID (see docs), does that mean if I am using Droplet, I need to limit myself to a 1:1 relationship to meet this requirement?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call, @lanzafame! I am still learning more about how Venus Droplet work and what features they offer.

Based on the docs you linked to, I believe you can have multiple minerIDs associated with a single Droplet PeerID and still meet this requirement.

In Spark, we need the PeerID returned by Filecoin.StateMinerInfo to match the PeerID used in IPNI advertisements. Spark does not check whether that PeerID is unique or shared by multiple miners.

@bajtos bajtos marked this pull request as ready for review December 18, 2024 12:56
Signed-off-by: Miroslav Bajtoš <[email protected]>
Signed-off-by: Miroslav Bajtoš <[email protected]>
Signed-off-by: Miroslav Bajtoš <[email protected]>
Signed-off-by: Miroslav Bajtoš <[email protected]>
Copy link
Member

@jsoares jsoares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few editorial comments. I do not know enough about the specific topic to be able to opine on a technical level. I also found the explanation somewhat unclear, but that could be a consequence of my lack of knowledge, so not holding that against the draft.

Others will be better suited to provide a full review.

Comment on lines +37 to +43
When we set out to build [Spark](https://filspark.com), a protocol for testing whether _payload_ of Filecoin deals can be retrieved back, we designed it based on how [Boost](https://github.com/filecoin-project/boost) worked at that time (mid-2023). Soon after FIL+ allocator compliance started to use Spark retrieval success score (Spark RSR) in mid-2024, we learned that [Venus](https://github.com/filecoin-project/venus) [Droplet](https://github.com/ipfs-force-community/droplet), an alternative miner software, is implemented slightly differently and requires tweaks to support Spark. Things evolved quite a bit since then. We need to overhaul most of the Spark protocol to support Direct Data Onboarding deals. We will need all miner software projects (Boost, Curio, Venus) to accommodate the new requirements imposed by the upcoming Spark v2 release.

This FRC has the following goals:
1. Document the retrieval process based on IPFS/IPLD.
2. Specify what Spark needs from miner software.
3. Collaborate with the community to tweak the requirements to work well for all parties involved.
4. Let this spec and the building blocks like [IPNI Reverse Index](https://github.com/filecoin-project/devgrants/issues/1781) empower other builders to design & implement their own retrieval-checking networks as alternatives to Spark.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads more like motivation than an abstract. It'd be useful for the abstract to summarise the actual requirements/spec.

3. Map `(PieceCID, PieceSize)` to IPNI `ContextID` value.
4. Query IPNI reverse index for a sample of payload blocks advertised by `ProviderID` with
`ContextID` (see the [proposed API
spec](https://github.com/ipni/xedni/blob/526f90f5a6001cb50b52e6376f8877163f8018af/openapi.yaml)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in the FRC or is it our of scope? The link is fine, but trying to understand whether we see it as central.


#### Link on-chain MinerId and IPNI provider identity

Storage providers are requires to use the same libp2p peer ID for their block-chain identity as returned by `Filecoin.StateMinerInfo` and for the index provider identity used when communicating with IPNI instances like [cid.contact](https://cid.contact).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Storage providers are requires to use the same libp2p peer ID for their block-chain identity as returned by `Filecoin.StateMinerInfo` and for the index provider identity used when communicating with IPNI instances like [cid.contact](https://cid.contact).
Storage providers are required to use the same libp2p peer ID for their block-chain identity as returned by `Filecoin.StateMinerInfo` and for the index provider identity used when communicating with IPNI instances like [cid.contact](https://cid.contact).

}
```

IPNI provider status ([query](https://cid.contact/providers/12D3KooWPNbkEgjdBNeaCGpsgCrPRETe4uBZf1ShFXStobdN18ys)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge fan of these arbitrary links on a document that's intended to be frozen for a long time.

Comment on lines +268 to +274
1. It's inefficient.

1. Each retrieval check requires two requests - one to download ~8MB chunk of a piece, the second one to download the payload block found in that chunk.

1. Spark typically repeats every retrieval check 40-100 times. Scanning CAR byte range 40-100 times does not bring enough value to justify the network bandwidth & CPU cost.

1. It's not clear how can retrieval checkers discover the address where the SP serves piece retrievals.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1 numbered list renders fine, but is not great for reading in raw md.

Comment on lines +288 to +290
[Retrieval Checking Requirements](#retrieval-checking-requirements) introduce the following breaking changes:
- Miner software must construct IPNI `ContextID` values in a specific way.
- Because such ContextIDs are scoped per piece (not per deal), miner software must de-duplicate advertisements for deals storing the same piece.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, and given this is an FRC, what are we breaking exactly?

Comment on lines +24 to +31
To make Filecoin a usable data storage offering, we need the content to be retrievable. It's difficult to improve what you don't measure; therefore, we need to measure the quality of retrieval service provided by each storage provider. To allow 3rd-party networks like [Spark](https://filspark.com) to sample active deals from the on-chain activity and check whether the SP is serving retrievals for the stored content, we need SPs to meet the following requirements:

1. Link on-chain MinerId and IPNI provider identity ([spec](#link-on-chain-minerid-and-ipni-provider-identity)).
2. Provide retrieval service using the [IPFS Trustless HTTP Gateway protocol](https://specs.ipfs.tech/http-gateways/trustless-gateway/).
3. Advertise retrievals to IPNI.
4. In IPNI advertisements, construct the `ContextID` field from `(PieceCID, PieceSize)` ([spec](#construct-ipni-contextid-from-piececid-piecesize))

Meeting these requirements needs support in software implementations like Boost, Curio & Venus Droplet but potentially also updates in settings configured by the individual SPs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To make Filecoin a usable data storage offering, we need the content to be retrievable. It's difficult to improve what you don't measure; therefore, we need to measure the quality of retrieval service provided by each storage provider. To allow 3rd-party networks like [Spark](https://filspark.com) to sample active deals from the on-chain activity and check whether the SP is serving retrievals for the stored content, we need SPs to meet the following requirements:
1. Link on-chain MinerId and IPNI provider identity ([spec](#link-on-chain-minerid-and-ipni-provider-identity)).
2. Provide retrieval service using the [IPFS Trustless HTTP Gateway protocol](https://specs.ipfs.tech/http-gateways/trustless-gateway/).
3. Advertise retrievals to IPNI.
4. In IPNI advertisements, construct the `ContextID` field from `(PieceCID, PieceSize)` ([spec](#construct-ipni-contextid-from-piececid-piecesize))
Meeting these requirements needs support in software implementations like Boost, Curio & Venus Droplet but potentially also updates in settings configured by the individual SPs.
To make Filecoin a usable data storage offering, we need the content to be retrievable. It's difficult to improve what you don't measure; therefore, we need to measure the quality of retrieval service provided by each storage provider. This FRC outlines requirements that SPs and their software stacks should meet to allow 3rd-party networks to sample active deals from the on-chain activity and check whether the SP is serving retrievals for the stored content.

The goal here is not to go into technical detail. I left a non-binding suggestion; something along these lines would be preferable.

The content in 26-29 would potentially be a good fit for the abstract; see comment below.


### Retrieval Requirements

1. Whenever a deal is activated, the SP MUST advertise all IPFS/IPLD payload block CIDs found in the Piece to IPNI. See the [IPNI Specification](https://github.com/ipni/specs/blob/main/IPNI.md) and [IPNI HTTP Provider](https://github.com/ipni/specs/blob/main/IPNI_HTTP_PROVIDER.md) for technical details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a client opt out of their deal payload being indexing?

@bajtos
Copy link
Author

bajtos commented Jan 23, 2025

Thank you for the feedback! I'll take a look and respond to your comments (early) next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants