Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUERY] #43677

Open
gauravsindhwani opened this issue Jan 2, 2025 · 3 comments
Open

[QUERY] #43677

gauravsindhwani opened this issue Jan 2, 2025 · 3 comments
Labels
Client This issue points to a problem in the data-plane of the library. Cosmos customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@gauravsindhwani
Copy link

gauravsindhwani commented Jan 2, 2025

Query/Question
Dear team,
We have a multiwrite cosmos setup across 3 regions. One of service does only read operations on the cosmos DB and connects via Direct Mode. During a recent outage the South central region was not reachable for long time and our expectation was that sdk should take care of failing over to secondary region. In our setup we do provide more than one preferredLocations and also enableEndpointDiscovery is set to true. However the failover did not happen and we continued to see request failures, following cosmos diagnostis appeared in our logs

errorMessage:"Failed to find items, reason: {"innerErrorMessage":null,"cosmosDiagnostics":{"userAgent":"azsdk-java-cosmos/4.61.0 Linux/6.5.0-1025-azure JRE/17.0.11.0.101 tdc-read-service-azsdk-java-cosmos/4.61.0 Linux/6.5.0-1025-azure JRE/17.0.11.0.101:South Central US","activityId":"a8cc8943-c4d5-4a82-8a2c-f22ea35a6d53","requestLatencyInMs":16524,"requestStartTimeUTC":"2024-12-26T23:57:45.399215480Z","requestEndTimeUTC":"2024-12-26T23:58:01.923240249Z","responseStatisticsList":[],


"supplementalResponseStatisticsList":[],"addressResolutionStatistics":{"327507ea-f818-4682-8439-983b6a931018":{"startTimeUTC":"2024-12-26T23:57:45.399356837Z","endTimeUTC":"2024-12-26T23:57:45.907925093Z",

"targetEndpoint":"https://*****-southcentralus.documents.azure.com:443/addresses/?$resolveFor=dbs%2FnjNSAA%3D%3D%2Fcolls%2FnjNSALZEqc0%3D%2Fdocs&$filter=protocol%20eq%20rntbd&$partitionKeyRangeIds=241",

"exceptionMessage":"io.netty.handler.timeout.ReadTimeoutException",
"forceRefresh":false,
"forceCollectionRoutingMapRefresh":false,
"inflightRequest":false},

"addc800f-b00e-414f-a221-07031a7e6c39":{"startTimeUTC":"2024-12-26T23:57:45.908113Z","endTimeUTC":"2024-12-26T23:57:50.908545216Z","

targetEndpoint":"https://****-southcentralus.documents.azure.com:443/addresses/?$resolveFor=dbs%2FnjNSAA%3D%3D%2Fcolls%2FnjNSALZEqc0%3D%2Fdocs&$filter=protocol%20eq%20rntbd&$partitionKeyRangeIds=241",

"exceptionMessage":"io.netty.handler.timeout.ReadTimeoutException","forceRefresh":false,"forceCollectionRoutingMapRefresh":false,"inflightRequest":false},"295c3cac-d503-459c-bf5c-aa39b5d580a7":{"startTimeUTC":"2024-12-26T23:57:51.908818850Z","endTimeUTC":"2024-12-26T23:58:01.923104238Z","targetEndpoint":"https://*****-southcentralus.documents.azure.com:443/addresses/?$resolveFor=dbs%2FnjNSAA%3D%3D%2Fcolls%2FnjNSALZEqc0%3D%2Fdocs&$filter=protocol%20eq%20rntbd&$partitionKeyRangeIds=241","exceptionMessage":"io.netty.handler.timeout.ReadTimeoutException","forceRefresh":false,"forceCollectionRoutingMapRefresh":false,"inflightRequest":false}},

"regionsContacted":["south central us"],

"retryContext":{"statusAndSubStatusCodes":[[408,10002]],

"retryCount":1,"retryLatency":0},

"metadataDiagnosticsContext":{"metadataDiagnosticList":null},"serializationDiagnosticsContext":{"serializationDiagnosticsList":null},"

gatewayStatisticsList":[{"sessionToken":null,

"operationType":"Query","resourceType":"Document",

"statusCode":408,"subStatusCode":10002,

"requestCharge":0.0,"requestTimeline":null,

"partitionKeyRangeId":null,"responsePayloadSizeInBytes":0,"exceptionResponseHeaders":"{x-ms-substatus=10002}"},{"sessionToken":null,"operationType":"Query","resourceType":"Document","statusCode":408,"subStatusCode":10002,"requestCharge":0.0,"requestTimeline":null,"partitionKeyRangeId":null,"responsePayloadSizeInBytes":0,"exceptionResponseHeaders":"{x-ms-substatus=10002}"},{"sessionToken":null,"operationType":"Query","resourceType":"Document","statusCode":408,"subStatusCode":10002,"requestCharge":0.0,"requestTimeline":null,"partitionKeyRangeId":null,"responsePayloadSizeInBytes":0,"exceptionResponseHeaders":"{x-ms-substatus=10002}"}],"samplingRateSnapshot":1.0,"bloomFilterInsertionCountSnapshot":0,"systemInformation":{"usedMemory":"1728512 KB","availableMemory":"3514368 KB","systemCpuLoad":"(2024-12-26T23:57:33.160185621Z 11.9%), (2024-12-26T23:57:38.160172947Z 11.8%), (2024-12-26T23:57:43.160175609Z 9.8%), (2024-12-26T23:57:48.160200653Z 14.6%), (2024-12-26T23:57:53.160199819Z 53.8%), (2024-12-26T23:57:58.160170415Z 10.0%)",

"availableProcessors":4},

"clientCfgs":{"id":1,"machineId":"vmId_98da20cb-0728-401f-b766-29a691fe2270","connectionMode":"DIRECT","numberOfClients":1,"excrgns":"[]",

"clientEndpoints":{"https://*****.documents.azure.com:443/":1},"connCfg":{"rntbd":"(cto:PT5S, nrto:PT5S, icto:PT0S, ieto:PT3H, mcpe:400, mrpc:40, cer:true)","gw":"(cps:1000, nrto:PT1M, icto:PT1M, p:false)","other":"(ed: true, cs: false, rv: true)"},"consistencyCfg":"(consistency: Session, mm: true, prgns: [southcentralus,westus2])","proactiveInitCfg":"","e2ePolicyCfg":"{e2eto=PT15S, as=}",

"sessionRetryCfg":"(rsh:REMOTE_REGION_PREFERRED, minrrt:PT0.5S, maxrrc:1)"}}}"

Browsing though the sdk code I came across this class

Exceptions.isSubStatusCode(clientException, HttpConstants.SubStatusCodes.GATEWAY_ENDPOINT_READ_TIMEOUT)) {
. Its not clear from the code that how the failover to alternate region will happen incase of GATEWAY_ENDPOINT_READ_TIMEOUT which is substatuscode that appeared in log. I am not even sure that there was even a retry attempted.
Can you please help that how should we setup our cosmos sdk that gateway read timeout failures are handled by failing over to alternate preferred location, I see that there is different handling incase gateway endpoint is unavailable but we did not see this error.

Why is this not a Bug or a feature Request?
It is not clear yet the code is behaving as expected or needs a code fix.

Setup (please complete the following information if applicable):

  • Library/Libraries:com.azure:azure-cosmos:4.61.0
@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. Cosmos customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. labels Jan 2, 2025
Copy link

github-actions bot commented Jan 2, 2025

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @kushagraThapar @pjohari-ms @TheovanKraay.

@TheovanKraay
Copy link
Member

By default, client level "failover" with respect to preferred regions will only occur if there is a total region outage. That is, where no partitions are available in the region and the regional endpoint itself is not even reachable. On the other hand, "partial" regional outages result in cross-region retries (for errors that are retriable) and in the case of timeouts, this can effectively result in an outage from the client perspective, even though preferred regions have been set, because requests are continually re-routed to the failing partition. To be able to handle a broader spectrum of scenarios and improve high availability and/or tail latency, mitigating transient errors and/or partial outages where the region as a whole is still reachable, may require different strategies.

The Java SDK does provide two advanced strategies out of the box; threshold based availability strategy, and partition level circuit breaker. These strategies can mitigate or eliminate errors even in partial outage scenarios, but are opt-in configurations as they have inherent trade-offs which need to be considered.

@gauravsindhwani
Copy link
Author

@TheovanKraay Thank you for your reply and suggestions we will definitely try them, but I want to understand the SDK design approach little more, as you said the current behaviour of SDK is that even though the gateway endpoint is consistently timing out as was the case for our outage the SDK does not do a failover. Is there a limitation on SDK side to build a failover strategy in this scenario, why I ask is that as far as sdk clients are concerned a region of cosmos is not reachable and SDK is expected to failover, understanding the subcodes and then interpreting the behaviour of SDK(like in this case it wont to failover) should ideally be out of scope of end customers. If there is no limitation then we can raise a feature request to do failover to secondary region incase gateway is persistently timing out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. Cosmos customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests

2 participants