-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 [Bug] Error when serving Torch-TensorRT JIT model to Nvidia-Triton #3248
Comments
Seems like you are mixing dynamo and torchscript. There are two options. 1. use dynamo to trace and deploy in torchscript (this is what we recommend) import torch
import torch_tensorrt
torch.hub._validate_not_a_forked_repo=lambda a,b,c: True
# load model
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True).eval().to("cuda")
# Compile with Torch TensorRT;
trt_model = torch_tensorrt.compile(model,
#ir="dynamo" implicitly
inputs= [torch_tensorrt.Input((1, 3, 224, 224))],
enabled_precisions= { torch.half} # Run with FP32
)
# Save the model
torch_tensorrt.save(trt_model, "model.ts", output_format="torchscript", inputs=torch.randn((1,3,224,224)) (https://pytorch.org/TensorRT/user_guide/saving_models.html) Alternatively 2. Use the torchscript frontend import torch
import torch_tensorrt
torch.hub._validate_not_a_forked_repo=lambda a,b,c: True
# load model
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True).eval().to("cuda")
# Compile with Torch TensorRT;
trt_model = torch_tensorrt.compile(model,
ir="torchscript",
inputs= [torch_tensorrt.Input((1, 3, 224, 224))],
enabled_precisions= { torch.half} # Run with FP32
)
# Save the model
torch.jit.save(trt_model, "model.pt") |
@narendasan Thanks for the quick reply. If I run the first script you provided
I got this error at the end when saving the model
With the second script I can get the TS model. However, when I try to perform inference wiht this model on triton server. Can anyone on your end confirm that if the torch-tensorrt optimized model can (or cannot) run on nvidia triton? (basically confirm this tutorial work/not work https://pytorch.org/TensorRT/tutorials/serving_torch_tensorrt_with_triton.html) because this tutorial has been on the torch-tensorrt page for a while, for the past year I've tried multiple times over multiple triton server/torch tensorrt versions.. it never worked. |
will poke around, might just be that the tutorial is outdated |
Fixes: #3248 Signed-off-by: Naren Dasan <[email protected]> Signed-off-by: Naren Dasan <[email protected]>
Fixes: #3248 Signed-off-by: Naren Dasan <[email protected]> Signed-off-by: Naren Dasan <[email protected]>
@zmy1116 Updated the triton tutorial, seems like there were some subtle things that could be off but for the most part nothing has changed. I uploaded scripts that I have verified to work for exporting and querying a resnet model. Hopefully that is enough to go on. The TL;DR of that tutorial in #3292 is if you check out that branch and go to # Could be any recent publish tag (I tested with 24.08), just use the same for all containers so that the TRT versions are the same
# Export model into model repo
docker run --gpus all -it --rm -v ${PWD}:/triton_example nvcr.io/nvidia/pytorch:24.08-py3 python /triton_example/export.py
# Start server
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ${PWD}:/triton_example nvcr.io/nvidia/tritonserver:24.08-py3 tritonserver --model-repository=/triton_example/model_repository and in another terminal # Get a sample image
wget -O img1.jpg "https://www.hakaimagazine.com/wp-content/uploads/header-gulf-birds.jpg"
# Query server
docker run -it --net=host -v ${PWD}:/triton_example nvcr.io/nvidia/tritonserver:24.08-py3-sdk bash -c "pip install torchvision && python /triton_example/client.py" You should get an output like: [b'12.460938:90' b'11.523438:92' b'9.656250:14' b'8.414062:136'
b'8.210938:11'] You can take a look at config.pbtxt for the Triton config I used. I would recommend using explicit dim sizes when possible |
Fixes: #3248 Signed-off-by: Naren Dasan <[email protected]> Signed-off-by: Naren Dasan <[email protected]>
Thanks for the fix. I tested and confirmed that at least for I also did some comparison with code I was using to create model/start engine/inference. The difference I see are :
I verified all these differences, and I confirm that none of these changes really cause the problem... I then tried with `24.09' version (the version I was using when creating the bug report). I found out the issue is following:
I tested both Thank you. |
Hmm, well its good that at least 1 GPU works, I think at this point the folks in https://github.com/triton-inference-server/server would be better able to debug what is happening. From our side we mostly focus on model export and the runtime extension and they handle all of the orchestration stuff. |
@narendasan It seems like triton server 24.10 addressed the multigpu issue. The problem I seem to have now is that the model compiled with pytorch container 24.10 is much slower than 24.06 for example. Benchmarking with perfanalyzer tool and loading each model to single gpu in corresponding triton server version: Do you have recommendations on how to improve performance with 24.10? |
Bug Description
I'm trying to serve torch-tensorrt optimized model to Nvidia Triton server based on the provided tutorial
https://pytorch.org/TensorRT/tutorials/serving_torch_tensorrt_with_triton.html
First the provided script to generate optimized model does not work. I tweak a bit got that to work. Then when I try to perform inference using Triton server, I got the error
ERROR: [Torch-TensorRT] - IExecutionContext::enqueueV3: Error Code 1: Cuda Runtime (invalid resource handle)
To Reproduce
So the pytorch page provide the followoing script to save the optimized jit model
When I run this script, I got the error
AttributeError: 'GraphModule' object has no attribute 'save
To resolve this I tried the following 2 ways
Save model with
torch_tensorrt.save
torch.jit.save(trt_model._run_on_acc_0, "/home/ubuntu/model.pt")
compile a traced jit model directly
I confirm both methods create jit model correctly.
I then put model in folder with the same structure the tutorial provides. Launch the triton server. The triton server launch successfully.
However, when I perform infernece, I got error
ERROR: [Torch-TensorRT] - IExecutionContext::enqueueV3: Error Code 1: Cuda Runtime (invalid resource handle)
Expected behavior
I expect the inference to succeed. I want to serve
Torch-TensorRT
optimized model on Nvidia-Triton. Our team observed that, on models likeSAM2
,Torch-TensorRT
is significantly faster than (Torch -> onnx -> TensorRT) converted model. Our entire inference stack is on Nvidia-Triton, and we would like to take advantage of this new tool.Environment
We use directly Nvidia NGC docker.
Pytorch for model optimiztion: nvcr.io/nvidia/pytorch:24.09-py3
Triton for hosting: nvcr.io/nvidia/tritonserver:24.09-py3
Additional context
Actually our current stack is on tritonserver:24.03, and we tested that it does not work with nvcr.io/nvidia/tritonserver:24.03py3 and nvcr.io/nvidia/pytorch:24.03-py3
Pleaes let us know if you need additional information
The text was updated successfully, but these errors were encountered: