You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An authservice node will issue a peek request to other nodes if it can't find a record. This is to ensure we cover the case of a user inserting an auth record in one region and retrieving in another region (e.g. insert on ap1, retrieve on us1 as the service is global), or the request goes to node (a) behind a load balancer/proxy but then the retrieval request goes to node (b).
We do a "broadcasted get" sending a peek request to all nodes and the first one to return a record wins, with the rest being canceled.
However, we noticed a few things:
Logs show the peek context cancelation error for requests we canceled intentionally.
as_badgerauth_peer_down metric seems to be reported when peeks occur. Connection between nodes appear to be fine in production now that we solved https://github.com/storj/infra/issues/3086. Artur thinks it's probably Context cancellation closes the connection drpc#37 in which the way we cancel peek requests is causing redials to other nodes in the DRPC connection pool.
Acceptance criteria:
Log does not show intentional peek cancellation errors.
as_badgerauth_peer_down metric is not be reported when we cancel peek requests.
The text was updated successfully, but these errors were encountered:
Figuring out a better strategy in DRPC for connection reuse in the presence of cancellation seems to be more and more important, and this seems like a great example problem to work on because we control all the components for debugging and inspection and it happens frequently. I think DRPC is currently a little bit too eager to close connections and could maybe delay the decision to close until right before it's about to used.
Happy to help out with this problem if it can be solved by improving DRPC cancellation 😄
An authservice node will issue a peek request to other nodes if it can't find a record. This is to ensure we cover the case of a user inserting an auth record in one region and retrieving in another region (e.g. insert on ap1, retrieve on us1 as the service is global), or the request goes to node (a) behind a load balancer/proxy but then the retrieval request goes to node (b).
We do a "broadcasted get" sending a peek request to all nodes and the first one to return a record wins, with the rest being canceled.
However, we noticed a few things:
as_badgerauth_peer_down
metric seems to be reported when peeks occur. Connection between nodes appear to be fine in production now that we solved https://github.com/storj/infra/issues/3086. Artur thinks it's probably Context cancellation closes the connection drpc#37 in which the way we cancel peek requests is causing redials to other nodes in the DRPC connection pool.Acceptance criteria:
as_badgerauth_peer_down
metric is not be reported when we cancel peek requests.The text was updated successfully, but these errors were encountered: