-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Colab tutorial failing due to CUDA version conflict #1191
Comments
Hi, Thanks for reporting this. I'm able to reproduce it, and it's not as trivial as past errors to resolve so a bit outside of my expertise. You'll need to wait for my colleague @ptheywood to get to it, hopefully next week (it's a four day weekend for Easter here currently). my notes updated the wheelhouse link (needed latest RC too)
Package import then fails with
Checking
Reports
This would imply we either need to symlink |
I've looked into this a bit but no resolution yet, notes for future reference below This does disagree with my understanding of the nvrtc shared libary versioning scheme from CUDA 11.3+ https://docs.nvidia.com/cuda/nvrtc/#versioning-scheme https://developer.nvidia.com/blog/programming-efficiently-with-the-cuda-11-3-compiler-toolchain/ which suggests that just depending on an We could switch to static linking libnvrtc.so, which would increase the size of our wheels and force recompilation for any bugfixes etc, but woudl resolve the issue, however this would only be viable for CUDA 11.5+ with CMake >= 3.26 via cmake. https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html#cuda-toolkit-nvrtc Nvidia do distribute nvrtc's runtime dependencies as a pip package per major cuda version (nvidia-cuda-nvrtc-cu11, nvidia-cuda-nvrtc-cu12), which we could add a runtime dependency on, or optionally install the corresponding version, e.g.
However on colab, this conflicts with the version of that package which pytorch depends on, although it does install it and create the appropriate .so files do get created, but then do not link as they are not visible to the linker even after installation.
Attempting to add that location via Attempting this locally after uninstalling cuda 12.0 (so that the binary's RPATH expected location does not exist) allows the error to be reproduce, and investigated via In which case installing
so we could potentially do something in our I would hope its possible to find a way to make this work on collab via Torch adds the appropraite version of So that local wheels don't need it (to avoid pulling in 80MB per pyflamegpu build locally) but distribtued wheels do depend upon it. We could add a CMake option to enable this behaviour only in our release CI workflows (e.g. a less wodry version of However, this will still cause conflicts if a user wants pyflamegpu and other python packages built with different cuda's in the same env, though in that case the solution is a local build. It's still not entirely clear to me how we then ensure that the correct library is found at pyflamegpu import time via python. Alternatively we could distribute Torch adds the appropraite version of So that local wheels don't need it (to avoid pulling in 80MB per pyflamegpu build locally) but distribtued wheels do depend upon it. We could add a CMake option to enable this behaviour only in our release CI workflows (e.g. a less wodry version of However, this will still cause conflicts if a user wants pyflamegpu and other python packages built with different cuda's in the same env, though in that case the solution is a local build. It's still not entirely clear to me how we then ensure that the correct library is found at pyflamegpu import time via python. |
Short term fix is to replace the contents of the second cell with:
This appears to work but results in pip errors/warnings about package conflicts, and the hardcoded path to the dll is not ideal. Longer term we can probably fold some similar logic into However as colab ships with CUDA 12.2 now, this encounters the very bad RTC compilation runtime from 12.2+ due to the use of jitify, so the first run of the |
@complexitysoftware I've now pushed an update to the We will need to make changes to FLAMEGPU/FLAMEGPU2 itself for a more robust / correct fix (and update the tutorial again), but can't commit to a timeline for that. Thank you for making us aware of the issue, and I've opened #1193 to track us fixing this in a more robust way for future releases. |
Yes I have tried the amended Colab notebook and it does run correctly. The compile is slow but it does work and the warning sets users' expectations. Thanks for your work on this and for developing and maintaining FlameGPU. I have worked in IT in many areas for many years and emergent behaviour is still the most fascinating for me. Thanks again. |
Bug Description
Tutorial fails with
ImportError: libnvrtc.so.11.2: cannot open shared object file: No such file or directory
It seems the error is a conflict on the CUDA version installed on Colab which seems to be 12.2
Find on libnvrtc gives:
/usr/local/cuda-12.2/targets/x86_64-linux/lib/stubs/libnvrtc.so
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libnvrtc.so
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libnvrtc.so.12.2.140
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libnvrtc.so.12
To Reproduce
Clicking the 'Try on Colab' button on the Flame GPU home page opens the Colab tutorial for potential users (https://colab.research.google.com/github/FLAMEGPU/FLAMEGPU2-tutorial-python/blob/google-colab/FLAME_GPU_2_python_tutorial.ipynb).
Running the tutorial fails with
ImportError: libnvrtc.so.11.2: cannot open shared object file: No such file or directory
on the import pyflamegpu line
Expected Behaviour
Tutorial should run
OS
Ubuntu 22.04.3 LTS
CUDA Versions
CUDA 12.2
GPUs
T4
GPU Driver
535.104.05
Additional Information
No response
The text was updated successfully, but these errors were encountered: