-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring Fusion Executor, pulling out compiled kernel #3468
base: main
Are you sure you want to change the base?
Conversation
…d in executor::compile.
…need to run compileRTC and runRTC calls in tests with CompiledKernel instances directly. Also fix profiling calls in KernelExecutor.
…owering but is now checked after lowering.
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a few TODO's in this PR, I might not take on all of them in this PR.
buffer << cuda_src.rdbuf(); | ||
return buffer.str(); | ||
} | ||
} // namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything above this is only code motion.
@@ -58,20 +47,11 @@ struct CompiledKernel : public NonCopyable { | |||
int register_spills = -1; | |||
}; | |||
|
|||
// Returns executable function and the ptxas log from compilation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to compiled_kernel.cpp
@@ -253,12 +233,5 @@ void validateCircularBuffering( | |||
kir::Kernel* kernel, | |||
ExpressionEvaluator& expr_eval); | |||
|
|||
//! Query the target GPU version number NVRTC compiles CUDA kernels for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to compiled_kernel.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually these are likely just removed as they should now be contained in runtime/compiled_kernel.cpp
@@ -32,117 +30,6 @@ | |||
#include <cstdlib> | |||
|
|||
namespace nvfuser { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to compiled_kernel.cpp
@@ -194,14 +81,6 @@ bool detectComputeSanitizer() { | |||
|
|||
namespace nvfuser { | |||
|
|||
namespace executor_utils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to compiled_kernel.cpp
TODO list:
I'm leaving these to consider in the future.
|
!test |
47 successful checks |
…untime_id, group_id, and device constant in CompiledKernel.
!test |
!test |
All checks have passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just took a pass. Left several comments/questions.
As far as I can see, there's no change in the underlying logic, but it's just code movement. Am I missing anything?
!test |
!test |
Getting a bunch of thunder failures: https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/133353224 I was able to reproduce one of them on main, so uncertain what's going on. Clang build was the only other test to fail. Follow up: Thunder failures reproduced on main at this point: #3698 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The Thunder failures are unlikely to have anything to do with this PR.
!test |
!test |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
return lowered_->kernel(); | ||
} | ||
|
||
Fusion* fusion() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this isn't any more useful than kernel() so I removed this in #3725.
|
||
// function to query whether a `CompiledKernel` has a compiled kernel to | ||
// execute | ||
bool hasCompiledKernel() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/NVIDIA/Fuser/pull/3725/files#diff-31a5ef26405804394f573b42c15512e1fb87f930fe7e5bfd95ad4034d867c30fR147 has merged isCompiled and hasCompiledKernel. Having both used to be important for a FusionExecutor which may or may not be a kernel executor -- no longer after your ExecutorAbstract work.
} | ||
// Lowered is needed to compute launch parameters as it uses the CA map. We | ||
// could modify that, but simply generating that part first. | ||
compiled_kernel_ = std::make_unique<CompiledKernel>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand the supposed responsibility between KernelExecutor::compile and CompiledKernel::compile. With this PR, KernelExecutor::compile appears to become a thin wrapper of CompiledKernel::compile that merely does profiling and overrides some compilation/launch parameters. Do you plan to get rid of KernelExecutor::compile so FusionKernelRuntime doesn't need to create a KernelExecutor to compile?
Pull out kernel compilation from the
KernelExecutor
, trying to separate out the two concepts as we will move towards a world where the execution of a kernel is done throughHostIr
.CompiledKernel
class to hold compilation information incompiled_kernel.h/cpp
runtime/executor.h
toruntime/compiled_kernel.h
runtime/executor.cpp
toruntime/compiled_kernel.cpp
runtime/executor_utils.cpp
toruntime/compiled_kernel.cpp
(these are functions only used incompiled_kernel
)sys/utils.cpp
toruntime/compiled_kernel.cpp
(these are functions only used incompiled_kernel
)executor::compileRTC
andexecutor::runRTC
into its own class (RtcKernel
). It shares compilation logic withCompiledKernel
and is incompiled_kernel.h/cpp
Some improvements left for another time:
executor_utils::CudaExecutable
andCompiledKernel
. I'm not sure if this is the right thing to do, partially just because of RTCKernel and CompiledKernel both own aexecutor_utils::CudaExecutable