Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport/manager: Refill dial addresses after a dial failure to make dialing more robust #232

Open
Tracked by #307
lexnv opened this issue Sep 3, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@lexnv
Copy link
Collaborator

lexnv commented Sep 3, 2024

Litep2p supports dialing a peer with a maximum of 8 parallel dial addresses.

let mut records: HashMap<_, _> = addresses
.take(limit)
.into_iter()
.map(|record| (record.address().clone(), record))
.collect();

At the moment, 8 or less addresses are extracted from the available addresses of a peer.
Then, each protocol is provided a list of addresses and the peer enters an Opening state (multiple dial requests):

if !tcp.is_empty() {
self.transports
.get_mut(&SupportedTransport::Tcp)
.expect("transport to be supported")
.open(connection_id, tcp)?;
}
#[cfg(feature = "quic")]
if !quic.is_empty() {
self.transports
.get_mut(&SupportedTransport::Quic)
.expect("transport to be supported")
.open(connection_id, quic)?;
}
#[cfg(feature = "websocket")]
if !websocket.is_empty() {
self.transports
.get_mut(&SupportedTransport::WebSocket)
.expect("transport to be supported")
.open(connection_id, websocket)?;
}

This is problematic when the dial fails on initially the provided addresses, although the peer contains more addresses that are rechable.

To make the dialing more robust, refill dial addresses if any, after dial failures:

  • option A: Refactor transport trait to modify the open() method to return a future

    • with this option, we have complete control over what individual protocols are doing when it comes to dialing
    • an advantage of this is that we can "backpressure" the amount of global dials that happen in litep2p
    • another advantage is that we can refill failed dials immediately with another future
    • anecdata: my system's ability to dial slows down after around 250-300 parallel dials
  • option B: Refactor the transport manager to refill the dial addresses

    • this is similar to option A, without refactoring the transport trait
    • this may be easier to implement, although dials can only be refilled after a transport (TCP) fails to dial all provided addresses

I tend to lean towards option A for the long term benefits

@lexnv lexnv added the enhancement New feature or request label Sep 3, 2024
@dmitry-markin
Copy link
Collaborator

Not sure if this should be of high priority. If it happens 8 first addresses of a peer are unreachable, there is something odd with discovery. Under normal circumstances peers should not provide that many external addresses, and the list should be ordered in a way most reliable addresses are returned first (i.e., --public-addr).

@lexnv
Copy link
Collaborator Author

lexnv commented Sep 5, 2024

Yep indeed, I think #209 should be higher on our list. I think this was one of the most important issues that came about after the kad performance investigation that may improve things :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants