Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(profiling): fix SystemError when collecting memory profiler events #12075

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nsrip-dd
Copy link
Contributor

We added locking to the memory profiler to address crashes. These locks
are mostly "try" locks, meaning we bail out if we can't acquire them
right away. This was done defensively to mitigate the possibility of
deadlock until we fully understood why the locks are needed and could
guarantee their correctness. But as a result of using try locks, the
iter_events function in particular can fail if the memory profiler lock
is contended when it tries to collect profiling events. The function
then returns NULL, leading to SystemError exceptions because we don't
set an error.

Even if we set an error, returning NULL isn't the right thing to do.
It'll basically mean we wait until the next profile iteration, still
accumulating events in the same buffer, and try again to upload the
events. So we're going to get multiple iteration's worth of events. The
right thing to do is take the lock unconditionally in iter_events. This
is safe because the only thing we're guarding is allocating a new sample
buffer, which doesn't call back into the python object allocator, and
rotating out the pointer to the allocation buffer.

Fixes #11831

TODO - regression test?

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

We added locking to the memory profiler to address crashes. These locks
are mostly "try" locks, meaning we bail out if we can't acquire them
right away. This was done defensively to mitigate the possibility of
deadlock until we fully understood why the locks are needed and could
guarantee their correctness. But as a result of using try locks, the
iter_events function in particular can fail if the memory profiler lock
is contended when it tries to collect profiling events. The function
then returns NULL, leading to SystemError exceptions because we don't
set an error.

Even if we set an error, returning NULL isn't the right thing to do.
It'll basically mean we wait until the next profile iteration, still
accumulating events in the same buffer, and try again to upload the
events. So we're going to get multiple iteration's worth of events. The
right thing to do is take the lock unconditionally in iter_events. This
is safe because the only thing we're guarding is allocating a new sample
buffer, which doesn't call back into the python object allocator, and
rotating out the pointer to the allocation buffer.

Fixes #11831
Copy link
Contributor

CODEOWNERS have been resolved as:

releasenotes/notes/profiling-memalloc-iter-events-null-780fd50bbebbf616.yaml  @DataDog/apm-python
ddtrace/profiling/collector/_memalloc.c                                 @DataDog/profiling-python

@nsrip-dd nsrip-dd marked this pull request as ready for review January 24, 2025 16:10
@nsrip-dd nsrip-dd requested review from a team as code owners January 24, 2025 16:10
@nsrip-dd nsrip-dd requested review from juanjux and Yun-Kim January 24, 2025 16:10
@@ -394,20 +394,18 @@ iterevents_new(PyTypeObject* type, PyObject* Py_UNUSED(args), PyObject* Py_UNUSE
}

IterEventsState* iestate = (IterEventsState*)type->tp_alloc(type, 0);
if (!iestate)
if (!iestate) {
PyErr_SetString(PyExc_RuntimeError, "failed to allocate IterEventsState");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the exception handled somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's handled here:

def collect(self):
# TODO: The event timestamp is slightly off since it's going to be the time we copy the data from the
# _memalloc buffer to our Recorder. This is fine for now, but we might want to store the nanoseconds
# timestamp in C and then return it via iter_events.
try:
events_iter, count, alloc_count = _memalloc.iter_events()
except RuntimeError:
# DEV: This can happen if either _memalloc has not been started or has been stopped.
LOG.debug("Unable to collect memory events from process %d", os.getpid(), exc_info=True)
return tuple()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try:
events_iter, count, alloc_count = _memalloc.iter_events()
except RuntimeError:
# DEV: This can happen if either _memalloc has not been started or has been stopped.
LOG.debug("Unable to collect memory events from process %d", os.getpid(), exc_info=True)
return tuple()

Yes, we do handle RuntimeError here.

@datadog-dd-trace-py-rkomorn
Copy link

Datadog Report

Branch report: nick.ripley/fix-iter-events-null
Commit report: 1f067a5
Test service: dd-trace-py

✅ 0 Failed, 130 Passed, 1468 Skipped, 4m 40s Total duration (36m 30.45s time saved)

@pr-commenter
Copy link

pr-commenter bot commented Jan 24, 2025

Benchmarks

Benchmark execution time: 2025-01-24 16:41:11

Comparing candidate commit 1f067a5 in PR branch nick.ripley/fix-iter-events-null with baseline commit 9e87349 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 394 metrics, 2 unstable metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: SystemError: <class 'iter_events'> returned NULL without setting an exception
3 participants