Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how can I solve the problem:Test NCCL failure common.cu:1012 'internal error - please report this issue to the NCCL developers / ' .. dell03 pid 543317: Test failure common.cu:891 #275

Open
mstJuly opened this issue Dec 12, 2024 · 2 comments

Comments

@mstJuly
Copy link

mstJuly commented Dec 12, 2024

No description provided.

@sjeaugey
Copy link
Member

If you are running with NCCL 2.18, you can ignore that message, and simply run again setting NCCL_DEBUG=WARN to figure out why NCCL failed.

If you still see that with recent NCCL versions, please provide the log with NCCL_DEBUG=INFO.

@mstJuly
Copy link
Author

mstJuly commented Dec 16, 2024

If you are running with NCCL 2.18, you can ignore that message, and simply run again setting NCCL_DEBUG=WARN to figure out why NCCL failed.如果您运行的是 NCCL 2.18,则可以忽略该消息,只需设置 NCCL_DEBUG=WARN 再次运行即可找出 NCCL 失败的原因。

If you still see that with recent NCCL versions, please provide the log with NCCL_DEBUG=INFO.如果您在最新的 NCCL 版本中仍然看到该问题,请提供包含 NCCL_DEBUG=INFO 的日志。

Thanks, I solved the problem, I changed the version to 2.19.3 and I was able to test successfully!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants