chore(telemetry): integration exception tracking #11732

ygree · 2024-12-13T23:01:36Z

This change is part of the cross tracer Integration Exception Tracking initiative:

This project is to implement across all tracers a mechanism to capture errors generated by the tracer itself and then to transmit them to Datadog. Once received errors can later be fixed and the health of the tracers be improved

It's intended to capture integration related errors and report them to the telemetry backend. It does this by implementing a DDTelemetryLogger that is used by the instrumentation code that is part of the ddtrace.contrib package, and only catches logs with level error or higher, or others that have an exception attached.

The motivation for having DDTelemetryLogger is that the logger is the natural way to report integration specific errors. This eliminates the need to duplicate this logic and helps ensure that such exceptions are not forgotten to report to telemetry. Here are some numbers about the ddtrace.contrib package log statements

~118 log.debug (some of them in exception catchers, some have exc_info=True)
~49 log.warning (some have exc_info=True)
~7 log.exception

To avoid overloading the telemetry backend, and also to minimize the impact of traversing and formatting the trace back, DDTelemetryLogger introduces a rate limiter that will not report the same error more than once per 60 second heartbeat interval.

For the trace back it does some processing including:

Replacing absolute paths with relative ones
and redacts trace back frames that are not part of the tracer and may belong to the client application

Jira ticket: AIDM-389

Checklist

PR author has checked that all the criteria below are met
The PR description includes an overview of the change
The PR description articulates the motivation for the change
The change includes tests OR the PR description describes a testing strategy
The PR description notes risks associated with the change, if any
Newly-added code is easy to change
The change follows the library release note guidelines
The change includes or references documentation updates if necessary
Backport labels are set (if applicable)

Reviewer Checklist

Reviewer has checked that all the criteria below are met
Title is accurate
All changes are related to the pull request's stated goal
Avoids breaking API changes
Testing strategy adequately addresses listed risks
Newly-added code is easy to change
Release note makes sense to a user of the library
If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
Backport labels are set in a manner that is consistent with the release branch maintenance policy

github-actions · 2024-12-13T23:02:04Z

CODEOWNERS have been resolved as:

ddtrace/internal/logger.py                                              @DataDog/apm-core-python
ddtrace/internal/telemetry/writer.py                                    @DataDog/apm-core-python

pr-commenter · 2024-12-13T23:40:55Z

Benchmarks

Benchmark execution time: 2025-01-16 19:15:54

Comparing candidate commit f613c49 in PR branch ygree/integration-exception-tracking with baseline commit b028cc6 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 394 metrics, 2 unstable metrics.

Collect, dedupe, ddtrace.contrib logs, and send to the telemetry.

Report only an error or an exception with a stack trace. Added tags and stack trace (without redaction)

Add count

P403n1x87 · 2025-01-07T09:56:16Z

ddtrace/internal/logger.py

+                )
+
+
+class _TelemetryConfig:


It looks like we are introducing telemetry-specific logic into a logging source. Can we try to see if there is a different design that allows keeping the two separate, please?

Not really "introducing", since some of this was already there to capture errors, and this change just extends it to exception tracking.
Alternatively, we would have to duplicate all the logging calls in the contib modules just to have exception tracking, which is easy to forget to add, and just introduces code duplication in the instrumentation code.

I'll consider adding a separate telemetry logger if you think that's a better solution. It will probably need to be in the same package, because my attempt to put it in a telemetry package ended with

ImportError: cannot import name 'get_logger' from partially initialized module 'ddtrace.internal.logger' (most likely due to circular import)

I have introduced DDTelemetryLogger to separate concerns. Please let me know what you think about it.

Great, thanks. I think we really need to move all telemetry-related code to the already existing telemetry sources. For instance, we already parse DD_INSTRUMENTATION_TELEMETRY_ENABLED in

dd-trace-py/ddtrace/settings/config.py

Lines 509 to 510 in 0e31457

self._telemetry_enabled = _get_config("DD_INSTRUMENTATION_TELEMETRY_ENABLED", True, asbool)

self._telemetry_heartbeat_interval = _get_config("DD_TELEMETRY_HEARTBEAT_INTERVAL", 60, float)

so there is no need to duplicate that logic here. In general we should avoid making tight coupling between components, or making them tighter. If logging and telemetry need to interact with each other, one will have to do it via an abstract interface that knows nothing about the other. Otherwise we will end up with circular reference issues. Perhaps @mabdinur can advise better on how to proceed here.

Thank you for the feedback! While I agree with the general concern about coupling software components, I would appreciate some clarification and guidance on how the proposed improvements can be implemented effectively. My previous attempts to achieve this didn’t succeed, so your input would be invaluable.

Could you elaborate on what you mean by "all telemetry-related code"? Moving DDTelemetryLogger to the telemetry module isn’t straightforward because it is tightly coupled with DDLogger. Its primary functionality revolves around logging - extracting exceptions and passing them to the telemetry module. As a result, its logic and state are more closely tied to the logger than to telemetry itself.

Regarding the configuration, this is indeed a trade-off. Moving it to the telemetry module would result in circular dependency issues during initialization. Any suggestions on how to address these challenges while keeping the codebase clean and decoupled would be greatly appreciated.

Hey Yury,

n ddtrace/contrib/, we define 0 error logs, 49 warning logs, and 118 debug logs (GitHub search). This accounts for only a small fraction of the errors that occur.

In most cases, when ddtrace instrumentation fails at runtime, an unhandled exception is raised. These exceptions are not captured by ddtrace loggers.

If an exception escapes a user's application and reaches the Python interpreter, it will be captured by the TelemetryWriter exception hook. Currently, this hook only captures startup errors, but it could be extended to capture exceptions raised during runtime.

Rather than defining a ddtrace logger primarily for debug logs, we could capture critical runtime integration errors directly using the telemetry exception hook. This approach decouples the telemetry writer from the core loggers and ensures that one error per failed process is captured, eliminating the need for rate limiting.

Would this approach capture the errors you're concerned about?

Additionally, I’m a big fan of using telemetry metrics where possible. Metrics are easier to send and ingest, have lower cardinality, and are generally simpler to monitor and analyze. While a metric wouldn’t provide the context of tracebacks, it would be valuable if we could define telemetry metrics to track integration health.

Thanks for taking a look and sharing your thoughts, Munir!

I appreciate your suggestion, it makes perfect sense and would complement this effort well. Extending the telemetry exception hook to capture runtime errors in addition to startup errors would indeed provide valuable insight and ensure that critical errors are visible to us. It would be interesting to hear how the telemetry exception hook would need to be modified to do this, as I thought it already covered this.

However, I think this is a slightly different goal than the one addressed in this PR. I think that reporting caught exceptions in our instrumentation can still be valuable, even though most caught exceptions in the contrib code are currently logged at the debug level. While this approach ensures that they remain largely invisible to customers (which makes sense), these exceptions can still be very useful to us internally, particularly in identifying and improving potentially broken integration code.

Without this functionality, we remain unaware of the problems associated with these caught exceptions that this PR is intended to address. The primary consumer of this data would be our team, not end users. While uncaught exceptions are visible to users, caught exceptions, while less severe, can provide us with actionable insights to improve the product and that is the idea behind this change. I hope this clarifies the intent and need behind the proposed changes.

…tive

ygree self-assigned this Dec 13, 2024

ygree marked this pull request as ready for review December 14, 2024 01:53

ygree requested a review from a team as a code owner December 14, 2024 01:53

ygree requested a review from erikayasuda December 14, 2024 01:53

ygree added 4 commits January 6, 2025 15:38

Integration Exception Tracking

6905806

Collect, dedupe, ddtrace.contrib logs, and send to the telemetry.

Integration Exception Tracking

c409143

Report only an error or an exception with a stack trace. Added tags and stack trace (without redaction)

Integration Exception Tracking

1a2ac4f

Add count

Fix format

ec8f7ca

ygree force-pushed the ygree/integration-exception-tracking branch from b11966f to ec8f7ca Compare January 6, 2025 23:57

P403n1x87 reviewed Jan 7, 2025

View reviewed changes

Extract DDTelemetryLogger

b132c4e

ygree requested a review from P403n1x87 January 9, 2025 06:17

ygree added 3 commits January 15, 2025 11:55

Implement _format_stack_trace to replace absolute file path with rela…

a6a8ab1

…tive

Redact non-ddtrace stack frames from being set to telemetry.

c735c60

Merge branch 'main' into ygree/integration-exception-tracking

606fcb8

ygree requested a review from mabdinur January 16, 2025 18:20

ygree added 2 commits January 16, 2025 10:25

Fix format

daab422

Merge branch 'main' into ygree/integration-exception-tracking

f613c49

ygree changed the title ~~Integration Exception Tracking~~ chore(telemetry): integration exception tracking Jan 16, 2025

ygree added the changelog/no-changelog A changelog entry is not required for this PR. label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(telemetry): integration exception tracking #11732

chore(telemetry): integration exception tracking #11732

ygree commented Dec 13, 2024 •

edited

Loading

github-actions bot commented Dec 13, 2024 •

edited

Loading

pr-commenter bot commented Dec 13, 2024 •

edited

Loading

P403n1x87 Jan 7, 2025

ygree Jan 8, 2025 •

edited

Loading

ygree Jan 9, 2025

P403n1x87 Jan 9, 2025

ygree Jan 9, 2025

mabdinur Jan 17, 2025

ygree Jan 17, 2025

	self._telemetry_enabled = _get_config("DD_INSTRUMENTATION_TELEMETRY_ENABLED", True, asbool)
	self._telemetry_heartbeat_interval = _get_config("DD_TELEMETRY_HEARTBEAT_INTERVAL", 60, float)

		)


		class _TelemetryConfig:

chore(telemetry): integration exception tracking #11732

Are you sure you want to change the base?

chore(telemetry): integration exception tracking #11732

Conversation

ygree commented Dec 13, 2024 • edited Loading

Checklist

Reviewer Checklist

github-actions bot commented Dec 13, 2024 • edited Loading

pr-commenter bot commented Dec 13, 2024 • edited Loading

Benchmarks

P403n1x87 Jan 7, 2025

Choose a reason for hiding this comment

ygree Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

ygree Jan 9, 2025

Choose a reason for hiding this comment

P403n1x87 Jan 9, 2025

Choose a reason for hiding this comment

ygree Jan 9, 2025

Choose a reason for hiding this comment

mabdinur Jan 17, 2025

Choose a reason for hiding this comment

ygree Jan 17, 2025

Choose a reason for hiding this comment

ygree commented Dec 13, 2024 •

edited

Loading

github-actions bot commented Dec 13, 2024 •

edited

Loading

pr-commenter bot commented Dec 13, 2024 •

edited

Loading

ygree Jan 8, 2025 •

edited

Loading