-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(profiling): fix SystemError when collecting memory profiler events #12075
base: main
Are you sure you want to change the base?
Conversation
We added locking to the memory profiler to address crashes. These locks are mostly "try" locks, meaning we bail out if we can't acquire them right away. This was done defensively to mitigate the possibility of deadlock until we fully understood why the locks are needed and could guarantee their correctness. But as a result of using try locks, the iter_events function in particular can fail if the memory profiler lock is contended when it tries to collect profiling events. The function then returns NULL, leading to SystemError exceptions because we don't set an error. Even if we set an error, returning NULL isn't the right thing to do. It'll basically mean we wait until the next profile iteration, still accumulating events in the same buffer, and try again to upload the events. So we're going to get multiple iteration's worth of events. The right thing to do is take the lock unconditionally in iter_events. This is safe because the only thing we're guarding is allocating a new sample buffer, which doesn't call back into the python object allocator, and rotating out the pointer to the allocation buffer. Fixes #11831
|
@@ -394,20 +394,18 @@ iterevents_new(PyTypeObject* type, PyObject* Py_UNUSED(args), PyObject* Py_UNUSE | |||
} | |||
|
|||
IterEventsState* iestate = (IterEventsState*)type->tp_alloc(type, 0); | |||
if (!iestate) | |||
if (!iestate) { | |||
PyErr_SetString(PyExc_RuntimeError, "failed to allocate IterEventsState"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the exception handled somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's handled here:
dd-trace-py/ddtrace/profiling/collector/memalloc.py
Lines 170 to 179 in b90fa38
def collect(self): | |
# TODO: The event timestamp is slightly off since it's going to be the time we copy the data from the | |
# _memalloc buffer to our Recorder. This is fine for now, but we might want to store the nanoseconds | |
# timestamp in C and then return it via iter_events. | |
try: | |
events_iter, count, alloc_count = _memalloc.iter_events() | |
except RuntimeError: | |
# DEV: This can happen if either _memalloc has not been started or has been stopped. | |
LOG.debug("Unable to collect memory events from process %d", os.getpid(), exc_info=True) | |
return tuple() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dd-trace-py/ddtrace/profiling/collector/memalloc.py
Lines 174 to 179 in 71e3997
try: | |
events_iter, count, alloc_count = _memalloc.iter_events() | |
except RuntimeError: | |
# DEV: This can happen if either _memalloc has not been started or has been stopped. | |
LOG.debug("Unable to collect memory events from process %d", os.getpid(), exc_info=True) | |
return tuple() |
Yes, we do handle RuntimeError here.
Datadog ReportBranch report: ✅ 0 Failed, 130 Passed, 1468 Skipped, 4m 40s Total duration (36m 30.45s time saved) |
BenchmarksBenchmark execution time: 2025-01-24 16:41:11 Comparing candidate commit 1f067a5 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 394 metrics, 2 unstable metrics. |
We added locking to the memory profiler to address crashes. These locks
are mostly "try" locks, meaning we bail out if we can't acquire them
right away. This was done defensively to mitigate the possibility of
deadlock until we fully understood why the locks are needed and could
guarantee their correctness. But as a result of using try locks, the
iter_events
function in particular can fail if the memory profiler lockis contended when it tries to collect profiling events. The function
then returns NULL, leading to SystemError exceptions because we don't
set an error.
Even if we set an error, returning NULL isn't the right thing to do.
It'll basically mean we wait until the next profile iteration, still
accumulating events in the same buffer, and try again to upload the
events. So we're going to get multiple iteration's worth of events. The
right thing to do is take the lock unconditionally in
iter_events
. Thisis safe because the only thing we're guarding is allocating a new sample
buffer, which doesn't call back into the python object allocator, and
rotating out the pointer to the allocation buffer.
Fixes #11831
TODO - regression test?
Checklist
Reviewer Checklist