Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Cache cuda.parallel builds #2590

Closed
1 task done
gevtushenko opened this issue Oct 17, 2024 · 7 comments
Closed
1 task done

[FEA]: Cache cuda.parallel builds #2590

gevtushenko opened this issue Oct 17, 2024 · 7 comments
Assignees
Labels
2.8.0 target for 2.8.0 release feature request New feature or request.

Comments

@gevtushenko
Copy link
Collaborator

gevtushenko commented Oct 17, 2024

Is this a duplicate?

Area

cuda.parallel (Python)

Is your feature request related to a problem? Please describe.

The cuda.parallel module contains a time-consuming JIT compilation (build) step. For reduction, this step invokes the following C API:

extern "C" CCCL_C_API CUresult cccl_device_reduce_build(

On Python end, this step returns the following structure:
class _CCCLDeviceReduceBuildResult(ctypes.Structure):
_fields_ = [("cc", ctypes.c_int),
("cubin", ctypes.c_void_p),
("cubin_size", ctypes.c_size_t),
("library", ctypes.c_void_p),
("single_tile_kernel", ctypes.c_void_p),
("single_tile_second_kernel", ctypes.c_void_p),
("reduction_kernel", ctypes.c_void_p)]

that contains cubin (like an object file) along with pointers to kernels in this cubin. User code usually contains more than one invocation of an algorithm with matching parameter types.

Describe the solution you'd like

We should amortize the JIT-ting cost by caching _CCCLDeviceReduceBuildResult based on parameters affecting C++ codegen. For instance, reduce_into = cudax.reduce_into(d_output, d_output, op, h_init) the following parameters affect codegen (examples in paranthesis are meant to indicate different cache entries):

  • compute capability of current GPU (cc field of _CCCLDeviceReduceBuildResult)
  • type of input sequence (container, counting iteretor, zip iterator, etc)
  • dtype of input sequence (int32, uint64, etc)
  • type of output sequence (container, counting iteretor, zip iterator, etc)
  • operator source code (different function bodies, maybe could be op.__code__.co_code?)
  • dtype of initial value

Let's say we build two reductions as follows:

import numpy
from numba import cuda

import cuda.parallel.experimental as cudax


def op(a, b):
    return a if a < b else b


dtype = numpy.int32
h_init = numpy.array([42], dtype)
h_input = numpy.array([8, 6, 7, 5, 3, 0, 9], dtype)
d_output = cuda.device_array(1, dtype)
d_input = cuda.to_device(h_input)

red1 = cudax.reduce_into(d_output, d_output, op, h_init)
red2 = cudax.reduce_into(d_output, d_output, op, h_init)

Since parameters described above match between incocations of cudax.reduce_into, the second call should not lead to invocation of extern "C" CCCL_C_API CUresult cccl_device_reduce_build.

Describe alternatives you've considered

No response

Additional context

No response

@gevtushenko gevtushenko added the feature request New feature or request. label Oct 17, 2024
@github-project-automation github-project-automation bot moved this to Todo in CCCL Oct 17, 2024
@gevtushenko
Copy link
Collaborator Author

@leofang might have some input on how we should approach caching (on-disk storage etc.)

@leofang
Copy link
Member

leofang commented Oct 17, 2024

Yeah, in cuda.core we'll offer various caches in the Python level, see NVIDIA/cuda-python#176 (can't link the internal design doc here, but we know where to find it 🙂). Not sure if the CCCL C library can easily hook up with a Python-based cache, though.

@gevtushenko
Copy link
Collaborator Author

Not sure if the CCCL C library can easily hook up with a Python-based cache, though.

@leofang we could consider caching on C++ end. I was thinking about caching on the Python side of cuda.parallel for now instead if this makes this any easier.

@leofang
Copy link
Member

leofang commented Oct 17, 2024

Yeah then cuda.core should meet your need. You just need to figure a way to generate a stable hash key for the items you listed.

@jollylili jollylili added the 2.8.0 target for 2.8.0 release label Nov 15, 2024
@shwina
Copy link
Contributor

shwina commented Dec 2, 2024

Just a note here that #3001 implements caching on the Python side (as an immediate improvement). Caching on the C++ side, and file-based caches would still be extremely valuable.

@shwina
Copy link
Contributor

shwina commented Dec 12, 2024

operator source code (different function bodies, maybe could be op.code.co_code?)

In addition, I think we will want to capture op.__closure__:

In [11]: def f(x):
    ...:     def g(y):
    ...:         return x + y
    ...:     return g
    ...:

In [12]: g = f(2)

In [13]: g.__closure__
Out[13]: (<cell at 0x7e13984df220: int object at 0x7e139b700110>,)

@shwina
Copy link
Contributor

shwina commented Jan 14, 2025

I moved #3215 into a separate issue instead of being a sub-issue of this one. We decided it's not required for an MVP.

@shwina shwina closed this as completed Jan 14, 2025
@github-project-automation github-project-automation bot moved this from Todo to Done in CCCL Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.8.0 target for 2.8.0 release feature request New feature or request.
Projects
Status: Done
Development

No branches or pull requests

5 participants