-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add TensorRT support for GNNs #4016
base: main
Are you sure you want to change the base?
feat: Add TensorRT support for GNNs #4016
Conversation
Warning Rate limit exceeded@benjaminhuth has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 8 minutes and 38 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughA new TensorRT-powered edge classification capability, added to the Acts library, yes. Multiple files, the changes span, introducing a Changes
Possibly related PRs
Suggested Labels
Suggested Reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Quality Gate passedIssues Measures |
📊: Physics performance monitoring for 09ce4b2Full contents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (4)
Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp (1)
98-100
: Use ACTS logging instead ofstd::cout
, prefer you should.For consistency within the codebase, replace
std::cout
with ACTS logging macros.Apply this diff to use the logging framework:
~TimePrinter() { - std::cout << name << ": " << milliseconds(t0, t1) << std::endl; + ACTS_INFO(name << ": " << milliseconds(t0, t1) << " ms"); }Plugins/ExaTrkX/include/Acts/Plugins/ExaTrkX/TensorRTEdgeClassifier.hpp (2)
38-41
: Destructor to be markedoverride
, consider you should.Since the base class has a virtual destructor, marking the destructor in the derived class with
override
good practice it is.Apply this diff for clarity:
~TensorRTEdgeClassifier(); + ~TensorRTEdgeClassifier() override;
49-58
: Member variables' initialization order, ensure you must.Initialize member variables in the order they are declared to avoid warnings.
Ensure that
m_cfg
is initialized beforem_trtLogger
, as declared.Examples/Python/src/ExaTrkXTrackFinding.cpp (1)
110-128
: Logger name, more specific make you should.For clarity and consistency, use a distinct logger name for
TensorRTEdgeClassifier
.Apply this diff to specify the logger name:
return std::make_shared<Alg>( - c, getDefaultLogger("EdgeClassifier", lvl)); + c, getDefaultLogger("TensorRTEdgeClassifier", lvl));
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.gitlab-ci.yml
(1 hunks)Examples/Python/src/ExaTrkXTrackFinding.cpp
(2 hunks)Plugins/ExaTrkX/CMakeLists.txt
(1 hunks)Plugins/ExaTrkX/include/Acts/Plugins/ExaTrkX/TensorRTEdgeClassifier.hpp
(1 hunks)Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: CI Bridge / build_gnn_tensorrt
- GitHub Check: linux_physmon
- GitHub Check: linux_examples_test
- GitHub Check: missing_includes
- GitHub Check: linux_ubuntu_extra (ubuntu2204_clang, 20)
- GitHub Check: build_debug
🔇 Additional comments (3)
Examples/Python/src/ExaTrkXTrackFinding.cpp (1)
126-126
: Missing configuration memberuseEdgeFeatures
, verify you should.Inconsistent the configuration is with other classifiers. Include
useEdgeFeatures
if required.Ensure that all necessary configuration options are included.
Plugins/ExaTrkX/CMakeLists.txt (2)
Line range hint
1-38
: Well-structured, this CMake configuration is!Follow consistent patterns for different backends, it does. Proper organization and clarity, I sense.
26-38
: Version constraints for TensorRT, specify we must!Hmmmm, missing version constraints for TensorRT package, I see. Dangerous this can be, yes. Compatibility issues, it may cause.
Apply this change, you should:
- find_package(TensorRT REQUIRED) + find_package(TensorRT 8.6 REQUIRED)
|
||
void *outputMem{nullptr}; | ||
std::size_t outputSize = edgeIndex.size(1) * sizeof(float); | ||
cudaMalloc(&outputMem, outputSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check return value of cudaMalloc
, you must.
Ensure that memory allocation on the GPU is successful before proceeding.
Apply this diff to check cudaMalloc
:
-cudaMalloc(&outputMem, outputSize);
+cudaError_t err = cudaMalloc(&outputMem, outputSize);
+if (err != cudaSuccess) {
+ ACTS_ERROR("cudaMalloc failed: " << cudaGetErrorString(err));
+ // Handle the error appropriately.
+}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
cudaMalloc(&outputMem, outputSize); | |
cudaError_t err = cudaMalloc(&outputMem, outputSize); | |
if (err != cudaSuccess) { | |
ACTS_ERROR("cudaMalloc failed: " << cudaGetErrorString(err)); | |
// Handle the error appropriately. | |
} |
auto status = initLibNvInferPlugins(m_trtLogger.get(), ""); | ||
assert(status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proper error handling for plugin initialization, implement you must.
Relying on assert(status);
insufficient it is, especially in release builds. Check the return status and handle errors gracefully, you should.
Apply this diff to handle the error:
auto status = initLibNvInferPlugins(m_trtLogger.get(), "");
-assert(status);
+if (!status) {
+ ACTS_ERROR("Failed to initialize TensorRT plugins.");
+ // Handle the error appropriately, perhaps throw an exception or return an error code.
+}
Committable suggestion skipped: line range outside the PR's diff.
m_engine.reset(m_runtime->deserializeCudaEngine(engineData.data(), fsize)); | ||
|
||
m_context.reset(m_engine->createExecutionContext()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deserialization errors, handle you should.
Check if m_engine
is successfully created after deserialization. Handle errors to avoid null pointer dereferences.
Apply this diff to verify the engine:
m_engine.reset(m_runtime->deserializeCudaEngine(engineData.data(), fsize));
+if (!m_engine) {
+ ACTS_ERROR("Failed to deserialize CUDA engine.");
+ // Handle the error appropriately.
+}
Committable suggestion skipped: line range outside the PR's diff.
auto status = m_context->enqueueV3(stream); | ||
cudaStreamSynchronize(stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inference execution status, verify you should.
Check the return status of enqueueV3
to confirm that the inference executed successfully.
Apply this diff to handle inference errors:
auto status = m_context->enqueueV3(stream);
+if (!status) {
+ ACTS_ERROR("Inference execution failed.");
+ // Handle the error appropriately.
+}
cudaStreamSynchronize(stream);
Committable suggestion skipped: line range outside the PR's diff.
std::ifstream engineFile(m_cfg.modelPath); | ||
engineFile.read(engineData.data(), fsize); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File opening and reading errors, check you must.
Ensure the model file opens successfully before reading. Handle any file I/O errors to prevent unexpected behavior.
Apply this diff to add error handling:
std::ifstream engineFile(m_cfg.modelPath);
+if (!engineFile.is_open()) {
+ ACTS_ERROR("Failed to open model file: " << m_cfg.modelPath);
+ // Handle the error appropriately.
+}
engineFile.read(engineData.data(), fsize);
+if (!engineFile) {
+ ACTS_ERROR("Failed to read model data from: " << m_cfg.modelPath);
+ // Handle the error appropriately.
+}
Committable suggestion skipped: line range outside the PR's diff.
const ExecutionContext &execContext = {}) override; | ||
|
||
Config config() const { return m_cfg; } | ||
torch::Device device() const override { return torch::kCUDA; }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return a proper torch::Device
object, you must.
Current implementation returns a device type, not a torch::Device
instance. Correct this, you should.
Apply this diff to return the correct device:
torch::Device device() const override { return torch::kCUDA; };
+ torch::Device device() const override { return torch::Device(torch::kCUDA); };
Committable suggestion skipped: line range outside the PR's diff.
.gitlab-ci.yml
Outdated
build_gnn_tensorrt: | ||
stage: build | ||
image: nvcr.io/nvidia/tensorrt:24.12-py3 | ||
variables: | ||
DEPENDENCY_URL: https://acts.web.cern.ch/ACTS/ci/ubuntu-24.04/deps.$DEPENDENCY_TAG.tar.zst | ||
|
||
cache: | ||
key: ccache-${CI_JOB_NAME}-${CI_COMMIT_REF_SLUG}-${CCACHE_KEY_SUFFIX} | ||
fallback_keys: | ||
- ccache-${CI_JOB_NAME}-${CI_DEFAULT_BRANCH}-${CCACHE_KEY_SUFFIX} | ||
when: always | ||
paths: | ||
- ${CCACHE_DIR} | ||
|
||
tags: | ||
- docker-gpu-nvidia | ||
|
||
script: | ||
- apt-get update -y | ||
- git clone $CLONE_URL src | ||
- cd src | ||
- git checkout $HEAD_SHA | ||
- source CI/dependencies.sh | ||
- cd .. | ||
- mkdir build | ||
- > | ||
cmake -B build -S src | ||
-DACTS_BUILD_PLUGIN_EXATRKX=ON | ||
-DACTS_EXATRKX_ENABLE_TENSORRT=ON | ||
-DPython_EXECUTABLE=$(which python3) | ||
-DCMAKE_CUDA_ARCHITECTURES="75;86" | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete, this CI job configuration is not! Missing crucial elements, I sense.
Several improvements needed, there are:
- Build command after cmake configuration, missing it is
- Testing stage for TensorRT functionality, define we must
- Artifacts for downstream jobs, configure we should
- CUDA architectures with other ExaTrkX jobs, align we must
Apply these changes, you should:
build_gnn_tensorrt:
stage: build
image: nvcr.io/nvidia/tensorrt:24.12-py3
variables:
DEPENDENCY_URL: https://acts.web.cern.ch/ACTS/ci/ubuntu-24.04/deps.$DEPENDENCY_TAG.tar.zst
+ TORCH_CUDA_ARCH_LIST: "8.0 8.6 8.9 9.0"
cache:
key: ccache-${CI_JOB_NAME}-${CI_COMMIT_REF_SLUG}-${CCACHE_KEY_SUFFIX}
fallback_keys:
- ccache-${CI_JOB_NAME}-${CI_DEFAULT_BRANCH}-${CCACHE_KEY_SUFFIX}
when: always
paths:
- ${CCACHE_DIR}
+ artifacts:
+ paths:
+ - build/
+ exclude:
+ - build/**/*.o
+ expire_in: 6 hours
tags:
- docker-gpu-nvidia
script:
- apt-get update -y
- git clone $CLONE_URL src
- cd src
- git checkout $HEAD_SHA
- source CI/dependencies.sh
- cd ..
- mkdir build
- >
cmake -B build -S src
-DACTS_BUILD_PLUGIN_EXATRKX=ON
-DACTS_EXATRKX_ENABLE_TENSORRT=ON
-DPython_EXECUTABLE=$(which python3)
-DCMAKE_CUDA_ARCHITECTURES="75;86"
+
+ - ccache -z
+ - cmake --build build -- -j6
+ - ccache -s
+test_gnn_tensorrt:
+ stage: test
+ needs:
+ - build_gnn_tensorrt
+ image: nvcr.io/nvidia/tensorrt:24.12-py3
+ tags:
+ - docker-gpu-nvidia
+ script:
+ - apt-get update -y
+ - git clone $CLONE_URL src
+ - cd src
+ - git checkout $HEAD_SHA
+ - source CI/dependencies.sh
+ - cd ..
+ - ctest --test-dir build -R TensorRT
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
build_gnn_tensorrt: | |
stage: build | |
image: nvcr.io/nvidia/tensorrt:24.12-py3 | |
variables: | |
DEPENDENCY_URL: https://acts.web.cern.ch/ACTS/ci/ubuntu-24.04/deps.$DEPENDENCY_TAG.tar.zst | |
cache: | |
key: ccache-${CI_JOB_NAME}-${CI_COMMIT_REF_SLUG}-${CCACHE_KEY_SUFFIX} | |
fallback_keys: | |
- ccache-${CI_JOB_NAME}-${CI_DEFAULT_BRANCH}-${CCACHE_KEY_SUFFIX} | |
when: always | |
paths: | |
- ${CCACHE_DIR} | |
tags: | |
- docker-gpu-nvidia | |
script: | |
- apt-get update -y | |
- git clone $CLONE_URL src | |
- cd src | |
- git checkout $HEAD_SHA | |
- source CI/dependencies.sh | |
- cd .. | |
- mkdir build | |
- > | |
cmake -B build -S src | |
-DACTS_BUILD_PLUGIN_EXATRKX=ON | |
-DACTS_EXATRKX_ENABLE_TENSORRT=ON | |
-DPython_EXECUTABLE=$(which python3) | |
-DCMAKE_CUDA_ARCHITECTURES="75;86" | |
build_gnn_tensorrt: | |
stage: build | |
image: nvcr.io/nvidia/tensorrt:24.12-py3 | |
variables: | |
DEPENDENCY_URL: https://acts.web.cern.ch/ACTS/ci/ubuntu-24.04/deps.$DEPENDENCY_TAG.tar.zst | |
TORCH_CUDA_ARCH_LIST: "8.0 8.6 8.9 9.0" | |
cache: | |
key: ccache-${CI_JOB_NAME}-${CI_COMMIT_REF_SLUG}-${CCACHE_KEY_SUFFIX} | |
fallback_keys: | |
- ccache-${CI_JOB_NAME}-${CI_DEFAULT_BRANCH}-${CCACHE_KEY_SUFFIX} | |
when: always | |
paths: | |
- ${CCACHE_DIR} | |
artifacts: | |
paths: | |
- build/ | |
exclude: | |
- build/**/*.o | |
expire_in: 6 hours | |
tags: | |
- docker-gpu-nvidia | |
script: | |
- apt-get update -y | |
- git clone $CLONE_URL src | |
- cd src | |
- git checkout $HEAD_SHA | |
- source CI/dependencies.sh | |
- cd .. | |
- mkdir build | |
- > | |
cmake -B build -S src | |
-DACTS_BUILD_PLUGIN_EXATRKX=ON | |
-DACTS_EXATRKX_ENABLE_TENSORRT=ON | |
-DPython_EXECUTABLE=$(which python3) | |
-DCMAKE_CUDA_ARCHITECTURES="75;86" | |
- ccache -z | |
- cmake --build build -- -j6 | |
- ccache -s | |
test_gnn_tensorrt: | |
stage: test | |
needs: | |
- build_gnn_tensorrt | |
image: nvcr.io/nvidia/tensorrt:24.12-py3 | |
tags: | |
- docker-gpu-nvidia | |
script: | |
- apt-get update -y | |
- git clone $CLONE_URL src | |
- cd src | |
- git checkout $HEAD_SHA | |
- source CI/dependencies.sh | |
- cd .. | |
- ctest --test-dir build -R TensorRT |
Cannot be compiled currently in the CI
--- END COMMIT MESSAGE ---
Any further description goes here, @-mentions are ok here!
feat
,fix
,refactor
,docs
,chore
andbuild
types.Summary by CodeRabbit
Release Notes
New Features
Infrastructure
Technical Improvements