fix rmm managed memory resource initialization to resolve some intermittent … #787

eordentlich · 2024-11-20T23:56:40Z

…memory issues

Looks like rmm.reinitialize destroys memory resources assigned to all other devices besides the current device. This way of initializing as in this PR and doing only once per process avoids this. I think the intermittent cuda errors were due to hit or miss uses of destroyed resources on the c++ api rmm operations. So this should address the issue.

Also, in umap, current device was set before the memory resource was initialized. Needs to be the other way around.

also pin numpy < 1 in readme

…memory issuespin numpy < 1 in readme Signed-off-by: Erik Ordentlich <[email protected]>

eordentlich · 2024-11-20T23:56:59Z

build

Signed-off-by: Erik Ordentlich <[email protected]>

eordentlich · 2024-12-09T17:50:47Z

build

lijinf2 · 2024-12-09T20:31:58Z

python/src/spark_rapids_ml/utils.py

+        if not type(rmm.mr.get_current_device_resource()) == type(
+            rmm.mr.ManagedMemoryResource()
+        ):
+            rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())


Looks good.

One question: any idea on the difference of this line compared with "rmm.reinitialize(managed_memory=True)"?

Yes. rmm.reinitialize also resets memory resources for previously specified devices to default mr's. This I believe was the source of our problems. It destroyed the original memory resources, replacing the device to mr mapping to point to the new defaults. However this is only done on the python side. On the c++ side, the corresponding map still points to the destroyed mrs (which are the same object), leading to dangling pointers. So this would lead to undefined behavior in the C++ rmm invocations, if the devices were changing in the same process, like between fit and transform, with python worker reuse on. The observations and code inspection were consistent with the above.

avoid reinitializing rmm multiple times to resolve some intermittent …

ff25f8f

…memory issuespin numpy < 1 in readme Signed-off-by: Erik Ordentlich <[email protected]>

eordentlich marked this pull request as ready for review December 6, 2024 23:09

move memory resource setting to function and invoke in predict as well

8fa85d6

Signed-off-by: Erik Ordentlich <[email protected]>

eordentlich changed the title ~~avoid reinitializing rmm multiple times to resolve some intermittent …~~ fix rmm managed memory resource initialization to resolve some intermittent … Dec 9, 2024

eordentlich requested a review from lijinf2 December 9, 2024 18:58

lijinf2 reviewed Dec 9, 2024

View reviewed changes

lijinf2 approved these changes Dec 9, 2024

View reviewed changes

eordentlich merged commit acc220b into NVIDIA:branch-24.12 Dec 9, 2024
3 checks passed

eordentlich deleted the eo_24.10_patches branch December 9, 2024 22:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix rmm managed memory resource initialization to resolve some intermittent … #787

fix rmm managed memory resource initialization to resolve some intermittent … #787

eordentlich commented Nov 20, 2024 •

edited

Loading

eordentlich commented Nov 20, 2024

eordentlich commented Dec 9, 2024

lijinf2 Dec 9, 2024

eordentlich Dec 9, 2024

fix rmm managed memory resource initialization to resolve some intermittent … #787

fix rmm managed memory resource initialization to resolve some intermittent … #787

Conversation

eordentlich commented Nov 20, 2024 • edited Loading

eordentlich commented Nov 20, 2024

eordentlich commented Dec 9, 2024

lijinf2 Dec 9, 2024

Choose a reason for hiding this comment

eordentlich Dec 9, 2024

Choose a reason for hiding this comment

eordentlich commented Nov 20, 2024 •

edited

Loading