Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[redis-rs][core] Move connection refresh to the background #2915

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

GilboaAWS
Copy link
Collaborator

Issue link

This Pull Request is linked to issue (URL): [#2910]

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one issue.
  • Commit message has a detailed description of what changed and why.
  • Tests are added or updated.
  • CHANGELOG.md and documentation files are updated.
  • Destination branch is correct - main or release
  • Create merge commit if merging release branch into main, squash otherwise.

@GilboaAWS GilboaAWS requested a review from a team as a code owner January 5, 2025 09:55
@GilboaAWS GilboaAWS requested review from barshaul and ikolomi January 5, 2025 10:04
Copy link
Collaborator

@ikolomi ikolomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a substantial change in the core mechanics. It must include tests that fail w/o this change to prove it necessity

@nihohit
Copy link
Contributor

nihohit commented Jan 6, 2025

instead of splitting your state between multiple maps and leaving room for error in reading the wrong map, consider just use an enum representing each connection state (Usable, Reconnecting, etc.) in the main connection map.

See

for an example on how to do this.

Copy link
Collaborator

@barshaul barshaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First comments (stopped before fn update_refreshed_connection).
Please see all inline comments and see how it behaves with cases where refresh connections is being called only with management/user/both. Also, maybe I haven't got to it yet - but the refresh task should be bounded to the lifetime of the clusterNode / address, so it would be cancelled when refresh_slots removes it from the topology

// Holds all the failed addresses that started a refresh task.
pub(crate) refresh_addresses_started: DashSet<String>,
// Follow the refresh ops on the connections
pub(crate) refresh_operations: DashMap<String, RefreshState<Connection>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we talked - instead of using RefreshState, use a ConnectionState that internally holds the operation / connection (same as is reconnectingConnection). For example -

enum ConnectionState<Connection> {
     Connected(ConnectionDetails<Connection>>),
     Reconnecting(JoinHandle<()>),
}

pub struct ClusterNode<Connection> {
    pub user_connection: ConnectionState<Connection>,
    pub management_connection: Option<ConnectionState<Connection>>,
}

or

enum ConnectionState {
     Connected(ClusterNode<Connection>>),
     Reconnecting(handle),
}

connection_map: DashMap<String, ConnectionState<Connection>>,

the refresh_connections can be called only for user/management connection or for both, so you should make sure this solution covers all cases

// Follow the refresh ops on the connections
pub(crate) refresh_operations: DashMap<String, RefreshState<Connection>>,
// Holds all the refreshed addresses that are ready to be inserted into the connection_map
pub(crate) refresh_addresses_done: DashSet<String>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a good reason for using DashSet or DashMap, these structs are only going to be used from a single point without concurrency

@@ -139,6 +146,13 @@ pub(crate) struct ConnectionsContainer<Connection> {
pub(crate) slot_map: SlotMap,
read_from_replica_strategy: ReadFromReplicaStrategy,
topology_hash: TopologyHash,

// Holds all the failed addresses that started a refresh task.
pub(crate) refresh_addresses_started: DashSet<String>,
Copy link
Collaborator

@barshaul barshaul Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You can use the completed addresses to hold the newly created cluster node.
  2. Instead of throwing all refresh structs directly into the connection container we might be better by wrapping all updates with some struct, so later on we'll be able to add refresh slots or other updates there too.
    For example:
struct RefreshUpdates {
    in_progress_addresses: HashSet<String>, 
    completed_addresses: HashMap<String, ClusterNode<Connection>>, 
     // refreshed_slot_map: Option<SlotMap>,
}

impl RefreshUpdates {
    pub(crate) fn in_progress(&self) -> &HashSet<String> {
        &self.in_progress_addresses
    }

    pub(crate) fn add_refresh(&mut self, address: String) {
        self.in_progress_addresses.insert(address);
    }

    pub(crate) fn complete_refresh(&mut self, address: String, node: ClusterNode<Connection>) {
        if !self.in_progress_addresses.remove(&address) {
            warn!("Attempted to complete refresh for an address not in progress: {}", address);
        }
        self.completed_addresses.insert(address, node);
    }
    
    pub(crate) fn completed_addresses(&mut self) -> HashMap<String, ClusterNode<Connection>> {
        std::mem::take(&mut self.completed_addresses)
    }
}

None
};
for address in addresses {
if refresh_ops_map.contains_key(&address) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets have an API in the connection_container for "is_reconnecting(address)".
And you can use filter and for each:

        addresses
            .into_iter()
            .filter(|address| !connections_container.is_reconnecting(address))
            .for_each(|address| {
                let handle = tokio::spawn(async move {
                    info!("Refreshing connection task to {:?} started", address);
                    ...
                });
            });

"Refreshing connection task to {:?} started",
address_clone_for_task
);
let _ = async {
Copy link
Collaborator

@barshaul barshaul Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you wrap it with async block again?

}
debug!("refresh connections completed");
debug!("refresh connection tasts initiated");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
debug!("refresh connection tasts initiated");
debug!("refresh connection tasks initiated");

@@ -1798,9 +1858,8 @@ where
if !failed_connections.is_empty() {
Self::refresh_connections(
inner,
failed_connections,
failed_connections.into_iter().collect::<HashSet<String>>(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it creates an unnecessary copy. instead you can change failed_connections to be a set to begin with

Signed-off-by: GilboaAWS <[email protected]>
This reverts commit 4e6535f.

Signed-off-by: GilboaAWS <[email protected]>
…ils because all connection are dropped and the refresh task take longer than the request timeout.

Signed-off-by: GilboaAWS <[email protected]>
@GilboaAWS GilboaAWS marked this pull request as draft January 7, 2025 11:48
@GilboaAWS GilboaAWS self-assigned this Jan 7, 2025
…eturned the refresh_connection logic of sending the connection but without removing it from the connection_map

Signed-off-by: GilboaAWS <[email protected]>
Signed-off-by: GilboaAWS <[email protected]>
…tion before redirecting to a random node

Signed-off-by: GilboaAWS <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants