[Performance] model inference in onnxruntime is toooooo slow #23282
Labels
model:transformer
issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.
performance
issues related to performance regressions
Describe the issue
I covert bge-reranker-v2-m3 model to onnx 。And run it in GPU. But I find it is toooo slow in onnx inference。
I run this model in torch , and get about 4min in 10000 pairs of sentence。
When I run in onnx, I get almost 1h in same data and same server.
Here is device info when run onnx model:
CPU:
GPU:
My device is GPU(NVIDIA GeForce RTX 4090)
Here is versions:
Why onnx model so slow?
To reproduce
Here is My inference code:
Here is my covert code (torch 2 onnx):
Urgency
No response
Platform
Linux
OS Version
Ubuntu 22.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
onnxruntime-gpu 1.19.2
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 12.3
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: