[QUERY] #43677

gauravsindhwani · 2025-01-02T12:03:44Z

Query/Question
Dear team,
We have a multiwrite cosmos setup across 3 regions. One of service does only read operations on the cosmos DB and connects via Direct Mode. During a recent outage the South central region was not reachable for long time and our expectation was that sdk should take care of failing over to secondary region. In our setup we do provide more than one preferredLocations and also enableEndpointDiscovery is set to true. However the failover did not happen and we continued to see request failures, following cosmos diagnostis appeared in our logs

errorMessage:"Failed to find items, reason: {"innerErrorMessage":null,"cosmosDiagnostics":{"userAgent":"azsdk-java-cosmos/4.61.0 Linux/6.5.0-1025-azure JRE/17.0.11.0.101 tdc-read-service-azsdk-java-cosmos/4.61.0 Linux/6.5.0-1025-azure JRE/17.0.11.0.101:South Central US","activityId":"a8cc8943-c4d5-4a82-8a2c-f22ea35a6d53","requestLatencyInMs":16524,"requestStartTimeUTC":"2024-12-26T23:57:45.399215480Z","requestEndTimeUTC":"2024-12-26T23:58:01.923240249Z","responseStatisticsList":[],


"supplementalResponseStatisticsList":[],"addressResolutionStatistics":{"327507ea-f818-4682-8439-983b6a931018":{"startTimeUTC":"2024-12-26T23:57:45.399356837Z","endTimeUTC":"2024-12-26T23:57:45.907925093Z",

"targetEndpoint":"https://*****-southcentralus.documents.azure.com:443/addresses/?$resolveFor=dbs%2FnjNSAA%3D%3D%2Fcolls%2FnjNSALZEqc0%3D%2Fdocs&$filter=protocol%20eq%20rntbd&$partitionKeyRangeIds=241",

"exceptionMessage":"io.netty.handler.timeout.ReadTimeoutException",
"forceRefresh":false,
"forceCollectionRoutingMapRefresh":false,
"inflightRequest":false},

"addc800f-b00e-414f-a221-07031a7e6c39":{"startTimeUTC":"2024-12-26T23:57:45.908113Z","endTimeUTC":"2024-12-26T23:57:50.908545216Z","

targetEndpoint":"https://****-southcentralus.documents.azure.com:443/addresses/?$resolveFor=dbs%2FnjNSAA%3D%3D%2Fcolls%2FnjNSALZEqc0%3D%2Fdocs&$filter=protocol%20eq%20rntbd&$partitionKeyRangeIds=241",

"exceptionMessage":"io.netty.handler.timeout.ReadTimeoutException","forceRefresh":false,"forceCollectionRoutingMapRefresh":false,"inflightRequest":false},"295c3cac-d503-459c-bf5c-aa39b5d580a7":{"startTimeUTC":"2024-12-26T23:57:51.908818850Z","endTimeUTC":"2024-12-26T23:58:01.923104238Z","targetEndpoint":"https://*****-southcentralus.documents.azure.com:443/addresses/?$resolveFor=dbs%2FnjNSAA%3D%3D%2Fcolls%2FnjNSALZEqc0%3D%2Fdocs&$filter=protocol%20eq%20rntbd&$partitionKeyRangeIds=241","exceptionMessage":"io.netty.handler.timeout.ReadTimeoutException","forceRefresh":false,"forceCollectionRoutingMapRefresh":false,"inflightRequest":false}},

"regionsContacted":["south central us"],

"retryContext":{"statusAndSubStatusCodes":[[408,10002]],

"retryCount":1,"retryLatency":0},

"metadataDiagnosticsContext":{"metadataDiagnosticList":null},"serializationDiagnosticsContext":{"serializationDiagnosticsList":null},"

gatewayStatisticsList":[{"sessionToken":null,

"operationType":"Query","resourceType":"Document",

"statusCode":408,"subStatusCode":10002,

"requestCharge":0.0,"requestTimeline":null,

"partitionKeyRangeId":null,"responsePayloadSizeInBytes":0,"exceptionResponseHeaders":"{x-ms-substatus=10002}"},{"sessionToken":null,"operationType":"Query","resourceType":"Document","statusCode":408,"subStatusCode":10002,"requestCharge":0.0,"requestTimeline":null,"partitionKeyRangeId":null,"responsePayloadSizeInBytes":0,"exceptionResponseHeaders":"{x-ms-substatus=10002}"},{"sessionToken":null,"operationType":"Query","resourceType":"Document","statusCode":408,"subStatusCode":10002,"requestCharge":0.0,"requestTimeline":null,"partitionKeyRangeId":null,"responsePayloadSizeInBytes":0,"exceptionResponseHeaders":"{x-ms-substatus=10002}"}],"samplingRateSnapshot":1.0,"bloomFilterInsertionCountSnapshot":0,"systemInformation":{"usedMemory":"1728512 KB","availableMemory":"3514368 KB","systemCpuLoad":"(2024-12-26T23:57:33.160185621Z 11.9%), (2024-12-26T23:57:38.160172947Z 11.8%), (2024-12-26T23:57:43.160175609Z 9.8%), (2024-12-26T23:57:48.160200653Z 14.6%), (2024-12-26T23:57:53.160199819Z 53.8%), (2024-12-26T23:57:58.160170415Z 10.0%)",

"availableProcessors":4},

"clientCfgs":{"id":1,"machineId":"vmId_98da20cb-0728-401f-b766-29a691fe2270","connectionMode":"DIRECT","numberOfClients":1,"excrgns":"[]",

"clientEndpoints":{"https://*****.documents.azure.com:443/":1},"connCfg":{"rntbd":"(cto:PT5S, nrto:PT5S, icto:PT0S, ieto:PT3H, mcpe:400, mrpc:40, cer:true)","gw":"(cps:1000, nrto:PT1M, icto:PT1M, p:false)","other":"(ed: true, cs: false, rv: true)"},"consistencyCfg":"(consistency: Session, mm: true, prgns: [southcentralus,westus2])","proactiveInitCfg":"","e2ePolicyCfg":"{e2eto=PT15S, as=}",

"sessionRetryCfg":"(rsh:REMOTE_REGION_PREFERRED, minrrt:PT0.5S, maxrrc:1)"}}}"

Browsing though the sdk code I came across this class

azure-sdk-for-java/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java

Line 131 in 0348696

    
           Exceptions.isSubStatusCode(clientException, HttpConstants.SubStatusCodes.GATEWAY_ENDPOINT_READ_TIMEOUT)) {

. Its not clear from the code that how the failover to alternate region will happen incase of GATEWAY_ENDPOINT_READ_TIMEOUT which is substatuscode that appeared in log. I am not even sure that there was even a retry attempted.
Can you please help that how should we setup our cosmos sdk that gateway read timeout failures are handled by failing over to alternate preferred location, I see that there is different handling incase gateway endpoint is unavailable but we did not see this error.

Why is this not a Bug or a feature Request?
It is not clear yet the code is behaving as expected or needs a code fix.

Setup (please complete the following information if applicable):

Library/Libraries:com.azure:azure-cosmos:4.61.0

The text was updated successfully, but these errors were encountered:

github-actions · 2025-01-02T12:04:17Z

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @kushagraThapar @pjohari-ms @TheovanKraay.

TheovanKraay · 2025-01-03T15:27:17Z

By default, client level "failover" with respect to preferred regions will only occur if there is a total region outage. That is, where no partitions are available in the region and the regional endpoint itself is not even reachable. On the other hand, "partial" regional outages result in cross-region retries (for errors that are retriable) and in the case of timeouts, this can effectively result in an outage from the client perspective, even though preferred regions have been set, because requests are continually re-routed to the failing partition. To be able to handle a broader spectrum of scenarios and improve high availability and/or tail latency, mitigating transient errors and/or partial outages where the region as a whole is still reachable, may require different strategies.

The Java SDK does provide two advanced strategies out of the box; threshold based availability strategy, and partition level circuit breaker. These strategies can mitigate or eliminate errors even in partial outage scenarios, but are opt-in configurations as they have inherent trade-offs which need to be considered.

gauravsindhwani · 2025-01-06T08:43:25Z

@TheovanKraay Thank you for your reply and suggestions we will definitely try them, but I want to understand the SDK design approach little more, as you said the current behaviour of SDK is that even though the gateway endpoint is consistently timing out as was the case for our outage the SDK does not do a failover. Is there a limitation on SDK side to build a failover strategy in this scenario, why I ask is that as far as sdk clients are concerned a region of cosmos is not reachable and SDK is expected to failover, understanding the subcodes and then interpreting the behaviour of SDK(like in this case it wont to failover) should ideally be out of scope of end customers. If there is no limitation then we can raise a feature request to do failover to secondary region incase gateway is persistently timing out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUERY] #43677

[QUERY] #43677

gauravsindhwani commented Jan 2, 2025 •

edited

Loading

github-actions bot commented Jan 2, 2025

TheovanKraay commented Jan 3, 2025

gauravsindhwani commented Jan 6, 2025

[QUERY] #43677

[QUERY] #43677

Comments

gauravsindhwani commented Jan 2, 2025 • edited Loading

github-actions bot commented Jan 2, 2025

TheovanKraay commented Jan 3, 2025

gauravsindhwani commented Jan 6, 2025

gauravsindhwani commented Jan 2, 2025 •

edited

Loading