Add CUDA build support and some code refinements #581

wenbingl · 2023-10-23T19:18:40Z

No description provided.

wenbingl · 2023-10-23T19:22:08Z

operators/math/cuda/negpos.cu

@souptc , @RandySheriffH , it looks like we at least need 4 files to write a simplest cuda kernel due to the limitation of nvcc, is there any good idea for this?

Yes, let's keep it this way for bypass NVCC incompatibilities for now.

includes/onnxruntime_customop.hpp

CMakeLists.txt

wejoncy · 2023-10-24T00:56:59Z

Hi @wenbingl Is it possible to support a general Cuda Op? This op is only responsible for converting tensor to DLPACK and collect output tensor.
The benefits are that we don't have to write a specific kernel for a specific op., and we could free the flexibility to user to support any kinds of customized operations just like what torch.extentions does.

wenbingl · 2023-10-24T17:41:47Z

Hi @wenbingl Is it possible to support a general Cuda Op? This op is only responsible for converting tensor to DLPACK and collect output tensor. The benefits are that we don't have to write a specific kernel for a specific op., and we could free the flexibility to user to support any kinds of customized operations just like what torch.extentions does.

The current goal is to implement a kernel just like we did in onnxruntime. Are you suggesting how we can support the CUDA custom op in Python, which hasn't been implemented yet?

wejoncy · 2023-10-25T01:37:04Z

Hi @wenbingl Is it possible to support a general Cuda Op? This op is only responsible for converting tensor to DLPACK and collect output tensor. The benefits are that we don't have to write a specific kernel for a specific op., and we could free the flexibility to user to support any kinds of customized operations just like what torch.extentions does.

The current goal is to implement a kernel just like we did in onnxruntime. Are you suggesting how we can support the CUDA custom op in Python, which hasn't been implemented yet?

Yes, It seems quite useful if we have such a mechanism to support user customized cuda kernel but no touch python inside the graph as what ort-training did (python-op).
With this, users can easily add a custom kernel support by registering a kernel function pointer to ORT, and ort can just call the function pointer (signautre is (DLTensor* tensors, int tensor_lengths, const char* function_name)). This will allow us easily to support all torch extension Ops, TVM ops or any user implemented ops without needing to merge these kernels to ORT/ORT-ext.

wenbingl · 2023-10-25T17:08:47Z

Hi @wenbingl Is it possible to support a general Cuda Op? This op is only responsible for converting tensor to DLPACK and collect output tensor. The benefits are that we don't have to write a specific kernel for a specific op., and we could free the flexibility to user to support any kinds of customized operations just like what torch.extentions does.

The current goal is to implement a kernel just like we did in onnxruntime. Are you suggesting how we can support the CUDA custom op in Python, which hasn't been implemented yet?

Yes, It seems quite useful if we have such a mechanism to support user customized cuda kernel but no touch python inside the graph as what ort-training did (python-op). With this, users can easily add a custom kernel support by registering a kernel function pointer to ORT, and ort can just call the function pointer (signautre is (DLTensor* tensors, int tensor_lengths, const char* function_name)). This will allow us easily to support all torch extension Ops, TVM ops or any user implemented ops without needing to merge these kernels to ORT/ORT-ext.

It's a good idea and it's another 'extensions' for the current custom op mechanism. Do you have any use case or a good torch extensions/TVM ops that can be integrated?

wejoncy · 2023-10-26T04:25:04Z

It's a good idea and it's another 'extensions' for the current custom op mechanism. Do you have any use case or a good torch extensions/TVM ops that can be integrated?

Yes, PagedAttention , Flashattention, memory_efficent_attn
highly optimized w4a16/w8a8 kernels for Matmul in TVM-MLC/TensorRT-LLM.
I know it's possible to port those cuda code into ORT, but it's really hard to maintain.

Especially, all these kernels can comunicate with ORT by DLPACK Tensor.

the cuda kernel first example

0ebf3a1

wenbingl requested a review from a team as a code owner October 23, 2023 19:18

wenbingl marked this pull request as draft October 23, 2023 19:18

wenbingl changed the title ~~A math op with the first cuda kernel~~ [WIP] A math op with the first cuda kernel Oct 23, 2023

wenbingl and others added 2 commits October 23, 2023 12:23

Update test_ortops.cc

82f6a6b

revert some unneccesary changes

61b3b20

wenbingl commented Oct 23, 2023

View reviewed changes

unix-like os build failure

5f94622

sayanshaw24 reviewed Oct 23, 2023

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

sayanshaw24 reviewed Oct 23, 2023

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

wenbingl and others added 3 commits October 27, 2023 18:06

refactor header files

af48e99

Merge branch 'main' into cuda

f1de075

fix python dll exporting error.

385eabd

wenbingl changed the title ~~[WIP] A math op with the first cuda kernel~~ Add CUDA build support and some code refinements Oct 30, 2023

wenbingl marked this pull request as ready for review October 30, 2023 17:47

RandySheriffH approved these changes Oct 30, 2023

View reviewed changes

Merge branch 'main' into cuda

5c305f4

wenbingl merged commit a0c2625 into main Oct 31, 2023
40 of 41 checks passed

wenbingl deleted the cuda branch October 31, 2023 04:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA build support and some code refinements #581

Add CUDA build support and some code refinements #581

wenbingl commented Oct 23, 2023

wenbingl Oct 23, 2023

RandySheriffH Oct 30, 2023

wejoncy commented Oct 24, 2023

wenbingl commented Oct 24, 2023 •

edited

Loading

wejoncy commented Oct 25, 2023 •

edited

Loading

wenbingl commented Oct 25, 2023

wejoncy commented Oct 26, 2023 •

edited

Loading

Add CUDA build support and some code refinements #581

Add CUDA build support and some code refinements #581

Conversation

wenbingl commented Oct 23, 2023

wenbingl Oct 23, 2023

Choose a reason for hiding this comment

RandySheriffH Oct 30, 2023

Choose a reason for hiding this comment

wejoncy commented Oct 24, 2023

wenbingl commented Oct 24, 2023 • edited Loading

wejoncy commented Oct 25, 2023 • edited Loading

wenbingl commented Oct 25, 2023

wejoncy commented Oct 26, 2023 • edited Loading

wenbingl commented Oct 24, 2023 •

edited

Loading

wejoncy commented Oct 25, 2023 •

edited

Loading

wejoncy commented Oct 26, 2023 •

edited

Loading