-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA]: Cache cuda.parallel builds #2590
Comments
@leofang might have some input on how we should approach caching (on-disk storage etc.) |
Yeah, in |
@leofang we could consider caching on C++ end. I was thinking about caching on the Python side of |
Yeah then |
Just a note here that #3001 implements caching on the Python side (as an immediate improvement). Caching on the C++ side, and file-based caches would still be extremely valuable. |
In addition, I think we will want to capture In [11]: def f(x):
...: def g(y):
...: return x + y
...: return g
...:
In [12]: g = f(2)
In [13]: g.__closure__
Out[13]: (<cell at 0x7e13984df220: int object at 0x7e139b700110>,) |
I moved #3215 into a separate issue instead of being a sub-issue of this one. We decided it's not required for an MVP. |
Is this a duplicate?
Area
cuda.parallel (Python)
Is your feature request related to a problem? Please describe.
The
cuda.parallel
module contains a time-consuming JIT compilation (build) step. For reduction, this step invokes the following C API:cccl/c/parallel/include/cccl/c/reduce.h
Line 33 in 70a2872
On Python end, this step returns the following structure:
cccl/python/cuda_parallel/cuda/parallel/experimental/__init__.py
Lines 177 to 184 in 70a2872
that contains cubin (like an object file) along with pointers to kernels in this cubin. User code usually contains more than one invocation of an algorithm with matching parameter types.
Describe the solution you'd like
We should amortize the JIT-ting cost by caching
_CCCLDeviceReduceBuildResult
based on parameters affecting C++ codegen. For instance,reduce_into = cudax.reduce_into(d_output, d_output, op, h_init)
the following parameters affect codegen (examples in paranthesis are meant to indicate different cache entries):cc
field of_CCCLDeviceReduceBuildResult
)op.__code__.co_code
?)Let's say we build two reductions as follows:
Since parameters described above match between incocations of
cudax.reduce_into
, the second call should not lead to invocation ofextern "C" CCCL_C_API CUresult cccl_device_reduce_build
.Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: