Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Internal] Diagnostics: Adds Merge API that combines several CosmosTraceDiagnostics Instances #4175

Closed
wants to merge 16 commits into from

Conversation

NaluTripician
Copy link
Contributor

@NaluTripician NaluTripician commented Nov 7, 2023

Pull Request Template

Description

This PR adds a constructor for CosmosTraceDiagnostics that takes in a list of CosmosTraceDiagnostics and merges them into one. This feature will be used for when parallel request hedging is added to the SDK as a way to have the full context for a request that sends out multiple parallel requests.

The Merged Diagnostics will be broken down into the following:

  • A summary of all the requests made + status codes (this exists in the normal diagnostics)
  • A name that will make it clear that this is a multi request trace + a GUID that wraps all requests
  • Start Time
  • Duration for all requests
  • Data
    • Client configuration
    • Total request charge
  • Children containing full diagnostics strings minus client configuration for each request
  • Multi Request Trace Reason (Parallel Request Hedging for example)

Other things that can be added to data

  • All regions contacted

Example of Merged Diagnostics in Current form as a proof of concept (here there are two merged read requests to the emulator):

{
  "Summary": {
    "DirectCalls": {
      "(200, 0)": 2
    }
  },
  "name": "Multi-request Trace Instance: 806ef11f-4051-4655-b516-fc1ab63b319d",
  "start datetime": "2024-01-11T22:36:55.501Z",
  "duration in milliseconds": 62.3775,
  "data": {
    "Client Configuration": {
      "Client Created Time Utc": "2024-01-11T22:36:53.6973513Z",
      "MachineId": "hashedMachineName:b06bba98-6c97-3610-9de4-1b586eee7a34",
      "NumberOfClientsCreated": 1,
      "NumberOfActiveClients": 1,
      "ConnectionMode": "Direct",
      "User Agent": "cosmos-netstandard-sdk/3.37.1|1|X64|Microsoft Windows 10.0.22621|.NET 6.0.26|N|F 00000010|",
      "ConnectionConfig": {
        "gw": "(cps:50, urto:6, p:False, httpf: True)",
        "rntbd": "(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: True, pr: ReuseUnicastPort)",
        "other": "(ed:False, be:False)"
      },
      "ConsistencyConfig": "(consistency: NotSet, prgns:[], apprgn: )",
      "ProcessorCount": 12
    },
    "totalRequestCharge": 2
  },
  "children": [
    {
      "name": "ReadItemAsync",
      "duration in milliseconds": 54.6068,
      "children": [
        {
          "name": "Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler",
          "duration in milliseconds": 53.285,
          "children": [
            {
              "name": "Get Collection Cache",
              "duration in milliseconds": 0.0006
            },
            {
              "name": "Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler",
              "duration in milliseconds": 53.2013,
              "data": {
                "System Info": {
                  "systemHistory": [
                    {
                      "dateUtc": "2024-01-11T22:36:53.8545006Z",
                      "cpu": 1.651,
                      "memory": 40170952,
                      "threadInfo": {
                        "isThreadStarving": "no info",
                        "availableThreads": 32764,
                        "minThreads": 12,
                        "maxThreads": 32767
                      },
                      "numberOfOpenTcpConnection": 0
                    }
                  ]
                }
              },
              "children": [
                {
                  "name": "Microsoft.Azure.Cosmos.Handlers.TelemetryHandler",
                  "duration in milliseconds": 53.1907,
                  "children": [
                    {
                      "name": "Microsoft.Azure.Cosmos.Handlers.RetryHandler",
                      "duration in milliseconds": 53.186,
                      "children": [
                        {
                          "name": "Microsoft.Azure.Cosmos.Handlers.RouterHandler",
                          "duration in milliseconds": 53.1759,
                          "children": [
                            {
                              "name": "Microsoft.Azure.Cosmos.Handlers.TransportHandler",
                              "duration in milliseconds": 53.1739,
                              "children": [
                                {
                                  "name": "Microsoft.Azure.Documents.ServerStoreModel Transport Request",
                                  "duration in milliseconds": 53.1183,
                                  "data": {
                                    "Client Side Request Stats": {
                                      "Id": "AggregatedClientSideRequestStatistics",
                                      "ContactedReplicas": [
                                        {
                                          "Count": 1,
                                          "Uri": "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer8/partitions/a4cb4954-38c8-11e6-8106-8cdcd42c33be/replicas/1p/"
                                        }
                                      ],
                                      "RegionsContacted": [],
                                      "FailedReplicas": [],
                                      "AddressResolutionStatistics": [],
                                      "StoreResponseStatistics": [
                                        {
                                          "ResponseTimeUTC": "2024-01-11T22:36:55.5460396Z",
                                          "ResourceType": "Document",
                                          "OperationType": "Read",
                                          "RequestSessionToken": "0:-1#427",
                                          "LocationEndpoint": "https://127.0.0.1:8081/",
                                          "StoreResult": {
                                            "ActivityId": "b076b939-4bdd-4976-bd43-6379316e98c2",
                                            "StatusCode": "Ok",
                                            "SubStatusCode": "Unknown",
                                            "LSN": 427,
                                            "PartitionKeyRangeId": "0",
                                            "GlobalCommittedLSN": -1,
                                            "ItemLSN": 427,
                                            "UsingLocalLSN": true,
                                            "QuorumAckedLSN": 427,
                                            "SessionToken": "-1#427",
                                            "CurrentWriteQuorum": 1,
                                            "CurrentReplicaSetSize": 1,
                                            "NumberOfReadRegions": 0,
                                            "IsValid": true,
                                            "StorePhysicalAddress": "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer8/partitions/a4cb4954-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
                                            "RequestCharge": 1,
                                            "RetryAfterInMs": null,
                                            "BELatencyInMs": "0.387",
                                            "ReplicaHealthStatuses": [
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)"
                                            ],
                                            "transportRequestTimeline": {
                                              "requestTimeline": [
                                                {
                                                  "event": "Created",
                                                  "startTimeUtc": "2024-01-11T22:36:55.5428408Z",
                                                  "durationInMs": 0.0099
                                                },
                                                {
                                                  "event": "ChannelAcquisitionStarted",
                                                  "startTimeUtc": "2024-01-11T22:36:55.5428507Z",
                                                  "durationInMs": 0.0084
                                                },
                                                {
                                                  "event": "Pipelined",
                                                  "startTimeUtc": "2024-01-11T22:36:55.5428591Z",
                                                  "durationInMs": 0.3004
                                                },
                                                {
                                                  "event": "Transit Time",
                                                  "startTimeUtc": "2024-01-11T22:36:55.5431595Z",
                                                  "durationInMs": 0.7603
                                                },
                                                {
                                                  "event": "Received",
                                                  "startTimeUtc": "2024-01-11T22:36:55.5439198Z",
                                                  "durationInMs": 0.6171
                                                },
                                                {
                                                  "event": "Completed",
                                                  "startTimeUtc": "2024-01-11T22:36:55.5445369Z",
                                                  "durationInMs": 0
                                                }
                                              ],
                                              "serviceEndpointStats": {
                                                "inflightRequests": 1,
                                                "openConnections": 1
                                              },
                                              "connectionStats": {
                                                "waitforConnectionInit": "False",
                                                "callsPendingReceive": 0,
                                                "lastSendAttempt": "2024-01-11T22:36:55.4832006Z",
                                                "lastSend": "2024-01-11T22:36:55.4833937Z",
                                                "lastReceive": "2024-01-11T22:36:55.4992622Z"
                                              },
                                              "requestSizeInBytes": 489,
                                              "responseMetadataSizeInBytes": 507,
                                              "responseBodySizeInBytes": 728
                                            },
                                            "TransportException": null
                                          }
                                        }
                                      ]
                                    }
                                  }
                                }
                              ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        },
        {
          "name": "Response Serialization",
          "duration in milliseconds": 0.1103
        }
      ]
    },
    {
      "name": "ReadItemAsync",
      "duration in milliseconds": 7.6076,
      "data": {
        "Additional Request Context": "Hedged Request"
      },
      "children": [
        {
          "name": "Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler",
          "duration in milliseconds": 7.4155,
          "children": [
            {
              "name": "Get Collection Cache",
              "duration in milliseconds": 0.0004
            },
            {
              "name": "Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler",
              "duration in milliseconds": 7.3286,
              "data": {
                "System Info": {
                  "systemHistory": [
                    {
                      "dateUtc": "2024-01-11T22:36:53.8545006Z",
                      "cpu": 1.651,
                      "memory": 40170952,
                      "threadInfo": {
                        "isThreadStarving": "no info",
                        "availableThreads": 32764,
                        "minThreads": 12,
                        "maxThreads": 32767
                      },
                      "numberOfOpenTcpConnection": 0
                    }
                  ]
                }
              },
              "children": [
                {
                  "name": "Microsoft.Azure.Cosmos.Handlers.TelemetryHandler",
                  "duration in milliseconds": 7.3116,
                  "children": [
                    {
                      "name": "Microsoft.Azure.Cosmos.Handlers.RetryHandler",
                      "duration in milliseconds": 7.3029,
                      "children": [
                        {
                          "name": "Microsoft.Azure.Cosmos.Handlers.RouterHandler",
                          "duration in milliseconds": 7.2826,
                          "children": [
                            {
                              "name": "Microsoft.Azure.Cosmos.Handlers.TransportHandler",
                              "duration in milliseconds": 7.2768,
                              "children": [
                                {
                                  "name": "Microsoft.Azure.Documents.ServerStoreModel Transport Request",
                                  "duration in milliseconds": 7.2212,
                                  "data": {
                                    "Client Side Request Stats": {
                                      "Id": "AggregatedClientSideRequestStatistics",
                                      "ContactedReplicas": [
                                        {
                                          "Count": 1,
                                          "Uri": "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer17/partitions/a4cb495d-38c8-11e6-8106-8cdcd42c33be/replicas/1p/"
                                        }
                                      ],
                                      "RegionsContacted": [],
                                      "FailedReplicas": [],
                                      "AddressResolutionStatistics": [],
                                      "StoreResponseStatistics": [
                                        {
                                          "ResponseTimeUTC": "2024-01-11T22:36:55.5580289Z",
                                          "ResourceType": "Document",
                                          "OperationType": "Read",
                                          "RequestSessionToken": "1:-1#79",
                                          "LocationEndpoint": "https://127.0.0.1:8081/",
                                          "StoreResult": {
                                            "ActivityId": "6f68be8d-a717-4fa9-96fd-cfb4b1ecdc27",
                                            "StatusCode": "Ok",
                                            "SubStatusCode": "Unknown",
                                            "LSN": 79,
                                            "PartitionKeyRangeId": "1",
                                            "GlobalCommittedLSN": -1,
                                            "ItemLSN": 79,
                                            "UsingLocalLSN": true,
                                            "QuorumAckedLSN": 79,
                                            "SessionToken": "-1#79",
                                            "CurrentWriteQuorum": 1,
                                            "CurrentReplicaSetSize": 1,
                                            "NumberOfReadRegions": 0,
                                            "IsValid": true,
                                            "StorePhysicalAddress": "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer17/partitions/a4cb495d-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
                                            "RequestCharge": 1,
                                            "RetryAfterInMs": null,
                                            "BELatencyInMs": "0.395",
                                            "ReplicaHealthStatuses": [
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)",
                                              "(port: 10253 | status: Unknown | lkt: 1/11/2024 10:36:55 PM)"
                                            ],
                                            "transportRequestTimeline": {
                                              "requestTimeline": [
                                                {
                                                  "event": "Created",
                                                  "startTimeUtc": "2024-01-11T22:36:55.5565064Z",
                                                  "durationInMs": 0.0044
                                                },
                                                {
                                                  "event": "ChannelAcquisitionStarted",
                                                  "startTimeUtc": "2024-01-11T22:36:55.5565108Z",
                                                  "durationInMs": 0.0049
                                                },
                                                {
                                                  "event": "Pipelined",
                                                  "startTimeUtc": "2024-01-11T22:36:55.5565157Z",
                                                  "durationInMs": 0.1835
                                                },
                                                {
                                                  "event": "Transit Time",
                                                  "startTimeUtc": "2024-01-11T22:36:55.5566992Z",
                                                  "durationInMs": 0.7877
                                                },
                                                {
                                                  "event": "Received",
                                                  "startTimeUtc": "2024-01-11T22:36:55.5574869Z",
                                                  "durationInMs": 0.5322
                                                },
                                                {
                                                  "event": "Completed",
                                                  "startTimeUtc": "2024-01-11T22:36:55.5580191Z",
                                                  "durationInMs": 0
                                                }
                                              ],
                                              "serviceEndpointStats": {
                                                "inflightRequests": 1,
                                                "openConnections": 1
                                              },
                                              "connectionStats": {
                                                "waitforConnectionInit": "False",
                                                "callsPendingReceive": 0,
                                                "lastSendAttempt": "2024-01-11T22:36:55.5429560Z",
                                                "lastSend": "2024-01-11T22:36:55.5431745Z",
                                                "lastReceive": "2024-01-11T22:36:55.5439311Z"
                                              },
                                              "requestSizeInBytes": 493,
                                              "responseMetadataSizeInBytes": 506,
                                              "responseBodySizeInBytes": 728
                                            },
                                            "TransportException": null
                                          }
                                        }
                                      ]
                                    }
                                  }
                                }
                              ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        },
        {
          "name": "Response Serialization",
          "duration in milliseconds": 0.1099
        }
      ]
    }
  ]
}

Type of change

Please delete options that are not relevant.

  • [] New feature (non-breaking change which adds functionality)
  • [] This change requires a documentation update

Closing issues

To automatically close an issue: closes #4179

@NaluTripician NaluTripician changed the title Diagnostics: Adds constructor that merges several CosmosTraceDiagnostics Instances Diagnostics: Merge API that combines several CosmosTraceDiagnostics Instances Nov 27, 2023
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@NaluTripician NaluTripician changed the title Diagnostics: Merge API that combines several CosmosTraceDiagnostics Instances Diagnostics: Adds Merge API that combines several CosmosTraceDiagnostics Instances Nov 27, 2023
@sourabh1007
Copy link
Contributor

  1. Just a suggestion, in this PR we are getting multiple diagnostics and then merging them into one. Did you explore, if it is possible to do it during collection itself and keep logic to generate JSON generic?
  2. Do we need operation name in each block?
  3. can you please link a document which talks about the feature.

@NaluTripician
Copy link
Contributor Author

  1. Just a suggestion, in this PR we are getting multiple diagnostics and then merging them into one. Did you explore, if it is possible to do it during collection itself and keep logic to generate JSON generic?
  2. Do we need operation name in each block?
  3. can you please link a document which talks about the feature.
  1. Yes, for parallel request hedging, we need to send out the parallel requests on a layer before the diagnostics are created. This also aligns with what the Java SDK does.
  2. For operaton name "child" diagnostic needs to have a name. Currently the default name is the operation name.
  3. I have linked the work item for this feature in the PR discription (Diagnostics Combination #4179 )

@NaluTripician NaluTripician self-assigned this Jan 9, 2024
this.traceDiagnostics = traceDiagnostics;
this.isMergedDiagnostics = true;

TraceSummary traceSummary = new TraceSummary();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the summary for every merge?

ex:
ReadItem -> Inside RequestInvoker, will it include a summary?
Queries -> Which might go multiple NW interactions, will every interaction have its own summary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summar of offline sync-up: For initial version having the simple version with hint/context to dis-ambiguate if it's a hedging or not is good enough.

Copy link
Member

@kirankumarkolli kirankumarkolli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope: Lets just have the hint about retrycontext

@NaluTripician NaluTripician changed the title Diagnostics: Adds Merge API that combines several CosmosTraceDiagnostics Instances [Internal] Diagnostics: Adds Merge API that combines several CosmosTraceDiagnostics Instances Jan 18, 2024
@kirankumarkolli
Copy link
Member

If the hedging is from ReqeustInvoker, then clientconfig will not be present there right?
Just adding the traces should do to get started with right?

Diagnostics might get repeated but that's something we can follow-up later.
Thoughts?

@kirankumarkolli
Copy link
Member

If the hedging is from ReqeustInvoker, then clientconfig will not be present there right? Just adding the traces should do to get started with right?

Diagnostics might get repeated but that's something we can follow-up later. Thoughts?

Follow-up from offline sync-up: Start with how the hedged diagnostics look like. Mostly only system info might be repeated and for initial preview we can live with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Diagnostics Combination
4 participants