fix(profiling): fix SystemError when collecting memory profiler events #12075

nsrip-dd · 2025-01-24T16:01:25Z

We added locking to the memory profiler to address crashes. These locks
are mostly "try" locks, meaning we bail out if we can't acquire them
right away. This was done defensively to mitigate the possibility of
deadlock until we fully understood why the locks are needed and could
guarantee their correctness. But as a result of using try locks, the
iter_events function in particular can fail if the memory profiler lock
is contended when it tries to collect profiling events. The function
then returns NULL, leading to SystemError exceptions because we don't
set an error.

Even if we set an error, returning NULL isn't the right thing to do.
It'll basically mean we wait until the next profile iteration, still
accumulating events in the same buffer, and try again to upload the
events. So we're going to get multiple iteration's worth of events. The
right thing to do is take the lock unconditionally in iter_events. This
is safe because the only thing we're guarding is allocating a new sample
buffer, which doesn't call back into the python object allocator, and
rotating out the pointer to the allocation buffer.

Fixes #11831

TODO - regression test?

Checklist

PR author has checked that all the criteria below are met
The PR description includes an overview of the change
The PR description articulates the motivation for the change
The change includes tests OR the PR description describes a testing strategy
The PR description notes risks associated with the change, if any
Newly-added code is easy to change
The change follows the library release note guidelines
The change includes or references documentation updates if necessary
Backport labels are set (if applicable)

Reviewer Checklist

Reviewer has checked that all the criteria below are met
Title is accurate
All changes are related to the pull request's stated goal
Avoids breaking API changes
Testing strategy adequately addresses listed risks
Newly-added code is easy to change
Release note makes sense to a user of the library
If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
Backport labels are set in a manner that is consistent with the release branch maintenance policy

We added locking to the memory profiler to address crashes. These locks are mostly "try" locks, meaning we bail out if we can't acquire them right away. This was done defensively to mitigate the possibility of deadlock until we fully understood why the locks are needed and could guarantee their correctness. But as a result of using try locks, the iter_events function in particular can fail if the memory profiler lock is contended when it tries to collect profiling events. The function then returns NULL, leading to SystemError exceptions because we don't set an error. Even if we set an error, returning NULL isn't the right thing to do. It'll basically mean we wait until the next profile iteration, still accumulating events in the same buffer, and try again to upload the events. So we're going to get multiple iteration's worth of events. The right thing to do is take the lock unconditionally in iter_events. This is safe because the only thing we're guarding is allocating a new sample buffer, which doesn't call back into the python object allocator, and rotating out the pointer to the allocation buffer. Fixes #11831

github-actions · 2025-01-24T16:02:02Z

CODEOWNERS have been resolved as:

releasenotes/notes/profiling-memalloc-iter-events-null-780fd50bbebbf616.yaml  @DataDog/apm-python
ddtrace/profiling/collector/_memalloc.c                                 @DataDog/profiling-python

P403n1x87 · 2025-01-24T16:12:34Z

ddtrace/profiling/collector/_memalloc.c

@@ -394,20 +394,18 @@ iterevents_new(PyTypeObject* type, PyObject* Py_UNUSED(args), PyObject* Py_UNUSE
    }

    IterEventsState* iestate = (IterEventsState*)type->tp_alloc(type, 0);
-    if (!iestate)
+    if (!iestate) {
+        PyErr_SetString(PyExc_RuntimeError, "failed to allocate IterEventsState");


Is the exception handled somewhere?

Yeah, it's handled here:

dd-trace-py/ddtrace/profiling/collector/memalloc.py

Lines 170 to 179 in b90fa38

def collect(self):

# TODO: The event timestamp is slightly off since it's going to be the time we copy the data from the

# _memalloc buffer to our Recorder. This is fine for now, but we might want to store the nanoseconds

# timestamp in C and then return it via iter_events.

try:

events_iter, count, alloc_count = _memalloc.iter_events()

except RuntimeError:

# DEV: This can happen if either _memalloc has not been started or has been stopped.

LOG.debug("Unable to collect memory events from process %d", os.getpid(), exc_info=True)

return tuple()

dd-trace-py/ddtrace/profiling/collector/memalloc.py

Lines 174 to 179 in 71e3997

try:

events_iter, count, alloc_count = _memalloc.iter_events()

except RuntimeError:

# DEV: This can happen if either _memalloc has not been started or has been stopped.

LOG.debug("Unable to collect memory events from process %d", os.getpid(), exc_info=True)

return tuple()

Yes, we do handle RuntimeError here.

datadog-dd-trace-py-rkomorn · 2025-01-24T16:16:37Z

Datadog Report

Branch report: nick.ripley/fix-iter-events-null
Commit report: 1f067a5
Test service: dd-trace-py

✅ 0 Failed, 130 Passed, 1468 Skipped, 4m 40s Total duration (36m 30.45s time saved)

pr-commenter · 2025-01-24T16:41:13Z

Benchmarks

Benchmark execution time: 2025-01-24 16:41:11

Comparing candidate commit 1f067a5 in PR branch nick.ripley/fix-iter-events-null with baseline commit 9e87349 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 394 metrics, 2 unstable metrics.

nsrip-dd added backport 2.18 backport 2.19 backport 2.20 labels Jan 24, 2025

nsrip-dd marked this pull request as ready for review January 24, 2025 16:10

nsrip-dd requested review from a team as code owners January 24, 2025 16:10

nsrip-dd requested review from juanjux and Yun-Kim January 24, 2025 16:10

P403n1x87 reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(profiling): fix SystemError when collecting memory profiler events #12075

fix(profiling): fix SystemError when collecting memory profiler events #12075

nsrip-dd commented Jan 24, 2025

github-actions bot commented Jan 24, 2025

P403n1x87 Jan 24, 2025

nsrip-dd Jan 24, 2025

taegyunkim Jan 24, 2025

datadog-dd-trace-py-rkomorn bot commented Jan 24, 2025

pr-commenter bot commented Jan 24, 2025

	def collect(self):
	# TODO: The event timestamp is slightly off since it's going to be the time we copy the data from the
	# _memalloc buffer to our Recorder. This is fine for now, but we might want to store the nanoseconds
	# timestamp in C and then return it via iter_events.
	try:
	events_iter, count, alloc_count = _memalloc.iter_events()
	except RuntimeError:
	# DEV: This can happen if either _memalloc has not been started or has been stopped.
	LOG.debug("Unable to collect memory events from process %d", os.getpid(), exc_info=True)
	return tuple()

fix(profiling): fix SystemError when collecting memory profiler events #12075

Are you sure you want to change the base?

fix(profiling): fix SystemError when collecting memory profiler events #12075

Conversation

nsrip-dd commented Jan 24, 2025

Checklist

Reviewer Checklist

github-actions bot commented Jan 24, 2025

P403n1x87 Jan 24, 2025

Choose a reason for hiding this comment

nsrip-dd Jan 24, 2025

Choose a reason for hiding this comment

taegyunkim Jan 24, 2025

Choose a reason for hiding this comment

datadog-dd-trace-py-rkomorn bot commented Jan 24, 2025

Datadog Report

pr-commenter bot commented Jan 24, 2025

Benchmarks