Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA build support and some code refinements #581

Merged
merged 8 commits into from
Oct 31, 2023
Merged

Add CUDA build support and some code refinements #581

merged 8 commits into from
Oct 31, 2023

Conversation

wenbingl
Copy link
Member

No description provided.

@wenbingl wenbingl requested a review from a team as a code owner October 23, 2023 19:18
@wenbingl wenbingl marked this pull request as draft October 23, 2023 19:18
@wenbingl wenbingl changed the title A math op with the first cuda kernel [WIP] A math op with the first cuda kernel Oct 23, 2023
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@souptc , @RandySheriffH , it looks like we at least need 4 files to write a simplest cuda kernel due to the limitation of nvcc, is there any good idea for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's keep it this way for bypass NVCC incompatibilities for now.

includes/onnxruntime_customop.hpp Outdated Show resolved Hide resolved
includes/onnxruntime_customop.hpp Outdated Show resolved Hide resolved
@wejoncy
Copy link
Contributor

wejoncy commented Oct 24, 2023

Hi @wenbingl Is it possible to support a general Cuda Op? This op is only responsible for converting tensor to DLPACK and collect output tensor.
The benefits are that we don't have to write a specific kernel for a specific op., and we could free the flexibility to user to support any kinds of customized operations just like what torch.extentions does.

@wenbingl
Copy link
Member Author

wenbingl commented Oct 24, 2023

Hi @wenbingl Is it possible to support a general Cuda Op? This op is only responsible for converting tensor to DLPACK and collect output tensor. The benefits are that we don't have to write a specific kernel for a specific op., and we could free the flexibility to user to support any kinds of customized operations just like what torch.extentions does.

The current goal is to implement a kernel just like we did in onnxruntime. Are you suggesting how we can support the CUDA custom op in Python, which hasn't been implemented yet?

@wejoncy
Copy link
Contributor

wejoncy commented Oct 25, 2023

Hi @wenbingl Is it possible to support a general Cuda Op? This op is only responsible for converting tensor to DLPACK and collect output tensor. The benefits are that we don't have to write a specific kernel for a specific op., and we could free the flexibility to user to support any kinds of customized operations just like what torch.extentions does.

The current goal is to implement a kernel just like we did in onnxruntime. Are you suggesting how we can support the CUDA custom op in Python, which hasn't been implemented yet?

Yes, It seems quite useful if we have such a mechanism to support user customized cuda kernel but no touch python inside the graph as what ort-training did (python-op).
With this, users can easily add a custom kernel support by registering a kernel function pointer to ORT, and ort can just call the function pointer (signautre is (DLTensor* tensors, int tensor_lengths, const char* function_name)). This will allow us easily to support all torch extension Ops, TVM ops or any user implemented ops without needing to merge these kernels to ORT/ORT-ext.

@wenbingl
Copy link
Member Author

Hi @wenbingl Is it possible to support a general Cuda Op? This op is only responsible for converting tensor to DLPACK and collect output tensor. The benefits are that we don't have to write a specific kernel for a specific op., and we could free the flexibility to user to support any kinds of customized operations just like what torch.extentions does.

The current goal is to implement a kernel just like we did in onnxruntime. Are you suggesting how we can support the CUDA custom op in Python, which hasn't been implemented yet?

Yes, It seems quite useful if we have such a mechanism to support user customized cuda kernel but no touch python inside the graph as what ort-training did (python-op). With this, users can easily add a custom kernel support by registering a kernel function pointer to ORT, and ort can just call the function pointer (signautre is (DLTensor* tensors, int tensor_lengths, const char* function_name)). This will allow us easily to support all torch extension Ops, TVM ops or any user implemented ops without needing to merge these kernels to ORT/ORT-ext.

It's a good idea and it's another 'extensions' for the current custom op mechanism. Do you have any use case or a good torch extensions/TVM ops that can be integrated?

@wejoncy
Copy link
Contributor

wejoncy commented Oct 26, 2023

It's a good idea and it's another 'extensions' for the current custom op mechanism. Do you have any use case or a good torch extensions/TVM ops that can be integrated?

Yes, PagedAttention , Flashattention, memory_efficent_attn
highly optimized w4a16/w8a8 kernels for Matmul in TVM-MLC/TensorRT-LLM.
I know it's possible to port those cuda code into ORT, but it's really hard to maintain.

Especially, all these kernels can comunicate with ORT by DLPACK Tensor.

@wenbingl wenbingl changed the title [WIP] A math op with the first cuda kernel Add CUDA build support and some code refinements Oct 30, 2023
@wenbingl wenbingl marked this pull request as ready for review October 30, 2023 17:47
@wenbingl wenbingl merged commit a0c2625 into main Oct 31, 2023
40 of 41 checks passed
@wenbingl wenbingl deleted the cuda branch October 31, 2023 04:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants