Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 [Bug] libtorchtrt.so: undefined symbol when importing torch_tensorrt in docker #3350

Closed
NetaPanda opened this issue Jan 10, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@NetaPanda
Copy link

Bug Description

I have tried installing the repo with docker by:

sudo DOCKER_BUILDKIT=1 docker build --build-arg TENSORRT_VERSION=10.7.0 -f docker/Dockerfile -t torch_tensorrt:latest .

At my first attempt, the docker building process shows a warning:
INFO: pip is looking at multiple versions of torch

then it downloads many torch versions without installing them and the process stucks forever.

After some investigation I added the RUN pip install --upgrade pip into Dockerfile, right under the line of

RUN curl -L https://github.com/a8m/envsubst/releases/download/v1.2.0/envsubst-`uname -s`-`uname -m` -o envsubst &&\
    chmod +x envsubst && mv envsubst /usr/local/bin

AND right above the line of

RUN pip install -r /opt/torch_tensorrt/py/requirements.txt
blabla...

Now the build process can finish, but once I go inside the container with

sudo docker run --rm --runtime=nvidia --gpus all -it --shm-size=8gb --env="DISPLAY" --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" --name=torch_tensorrt --ipc=host --net=host torch_tensorrt:latest

and try to import torch_tensorrt inside python, it gives the following error:

OSError: /root/.pyenv/versions/3.10.16/lib/python3.10/site-packages/torch_tensorrt/lib/libtorchtrt.so: undefined symbol: _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_RKSs

I have also tried to build from source, which ended up with exactly the same error as in docker (undefined symbol).

I wonder are there any issues with my OS? I am using Ubuntu 22.04, with cuda 12.6 installed.

To Reproduce

Steps to reproduce the behavior:

  1. git clone the repo by git clone https://github.com/pytorch/TensorRT.git
  2. Modify the docker/Dockerfile, add RUN pip install --upgrade pip into Dockerfile as described above
  3. sudo DOCKER_BUILDKIT=1 docker build --build-arg TENSORRT_VERSION=10.7.0 -f docker/Dockerfile -t torch_tensorrt:latest .
  4. sudo docker run --rm --runtime=nvidia --gpus all -it --shm-size=8gb --env="DISPLAY" --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" --name=torch_tensorrt --ipc=host --net=host torch_tensorrt:latest
  5. python
  6. import torch_tensorrt

Expected behavior

The torch_tensorrt should be imported succesfully.

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

  • Torch-TensorRT Version (e.g. 1.0.0): 2.6.0a0 (since I directly cloned the main branch)
  • PyTorch Version (e.g. 1.0): N/A (installed within docker)
  • CPU Architecture: X86-64 (Intel I9-13900K)
  • OS (e.g., Linux): Ubuntu 22.04 Desktop
  • How you installed PyTorch (conda, pip, libtorch, source): Managed by Dockerfile
  • Build command you used (if compiling from source): See the above Steps to reproduce
  • Are you using local sources or building from archives: N/A
  • Python version: 3.10
  • CUDA version: local 12.6, inside docker it seems to be 12.4
  • GPU models and configuration: RTX4090
  • Any other relevant information: N/A

Additional context

The INFO: pip is looking at multiple versions of torch issue was not exist for the very first few attempts of my docker build, I wonder if this is a cache conflict or caused by other issues.

@NetaPanda NetaPanda added the bug Something isn't working label Jan 10, 2025
@zewenli98
Copy link
Collaborator

To my knowledge, you would get the undefined symbol error in two cases: 1) mismatched torch version and libtorch where you can find in MODULE.bazel 2) didn't use --use-cxx11-abi. For CUDA 12.6, if you want to build torch-trt from source, you need to run something like python setup.py develop --use-cxx11-abi

@NetaPanda
Copy link
Author

Thanks for the tip! After a painful try of 3-4 days, I finally gave up and used the nvidia pytorch docker instead. It was a mess for me to sort out the library, especially when multiple torch packages were installed (both in conda and local python env) in my system. Indeed I did not use --use-cxx11-abi, not sure if that was the real reason behind the issue. Since I have found another solution, I shall close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants