Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix rmm managed memory resource initialization to resolve some intermittent … #787

Merged
merged 2 commits into from
Dec 9, 2024

Conversation

eordentlich
Copy link
Collaborator

@eordentlich eordentlich commented Nov 20, 2024

…memory issues

Looks like rmm.reinitialize destroys memory resources assigned to all other devices besides the current device. This way of initializing as in this PR and doing only once per process avoids this. I think the intermittent cuda errors were due to hit or miss uses of destroyed resources on the c++ api rmm operations. So this should address the issue.

Also, in umap, current device was set before the memory resource was initialized. Needs to be the other way around.

also pin numpy < 1 in readme

…memory issuespin numpy < 1 in readme

Signed-off-by: Erik Ordentlich <[email protected]>
@eordentlich
Copy link
Collaborator Author

build

@eordentlich eordentlich marked this pull request as ready for review December 6, 2024 23:09
@eordentlich eordentlich changed the title avoid reinitializing rmm multiple times to resolve some intermittent … fix rmm managed memory resource initialization to resolve some intermittent … Dec 9, 2024
@eordentlich
Copy link
Collaborator Author

build

@eordentlich eordentlich requested a review from lijinf2 December 9, 2024 18:58
if not type(rmm.mr.get_current_device_resource()) == type(
rmm.mr.ManagedMemoryResource()
):
rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

One question: any idea on the difference of this line compared with "rmm.reinitialize(managed_memory=True)"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. rmm.reinitialize also resets memory resources for previously specified devices to default mr's. This I believe was the source of our problems. It destroyed the original memory resources, replacing the device to mr mapping to point to the new defaults. However this is only done on the python side. On the c++ side, the corresponding map still points to the destroyed mrs (which are the same object), leading to dangling pointers. So this would lead to undefined behavior in the C++ rmm invocations, if the devices were changing in the same process, like between fit and transform, with python worker reuse on. The observations and code inspection were consistent with the above.

@eordentlich eordentlich merged commit acc220b into NVIDIA:branch-24.12 Dec 9, 2024
3 checks passed
@eordentlich eordentlich deleted the eo_24.10_patches branch December 9, 2024 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants