[FEA]: Cache cuda.parallel builds #2590

gevtushenko · 2024-10-17T00:14:11Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

cuda.parallel (Python)

Is your feature request related to a problem? Please describe.

The cuda.parallel module contains a time-consuming JIT compilation (build) step. For reduction, this step invokes the following C API:

cccl/c/parallel/include/cccl/c/reduce.h

Line 33 in 70a2872

extern "C" CCCL_C_API CUresult cccl_device_reduce_build(

On Python end, this step returns the following structure:

cccl/python/cuda_parallel/cuda/parallel/experimental/__init__.py

Lines 177 to 184 in 70a2872

    
           class _CCCLDeviceReduceBuildResult(ctypes.Structure): 
        
               _fields_ = [("cc", ctypes.c_int), 
        
                           ("cubin", ctypes.c_void_p), 
        
                           ("cubin_size", ctypes.c_size_t), 
        
                           ("library", ctypes.c_void_p), 
        
                           ("single_tile_kernel", ctypes.c_void_p), 
        
                           ("single_tile_second_kernel", ctypes.c_void_p), 
        
                           ("reduction_kernel", ctypes.c_void_p)]

that contains cubin (like an object file) along with pointers to kernels in this cubin. User code usually contains more than one invocation of an algorithm with matching parameter types.

Describe the solution you'd like

We should amortize the JIT-ting cost by caching _CCCLDeviceReduceBuildResult based on parameters affecting C++ codegen. For instance, reduce_into = cudax.reduce_into(d_output, d_output, op, h_init) the following parameters affect codegen (examples in paranthesis are meant to indicate different cache entries):

compute capability of current GPU (cc field of _CCCLDeviceReduceBuildResult)
type of input sequence (container, counting iteretor, zip iterator, etc)
dtype of input sequence (int32, uint64, etc)
type of output sequence (container, counting iteretor, zip iterator, etc)
operator source code (different function bodies, maybe could be op.__code__.co_code?)
dtype of initial value

Let's say we build two reductions as follows:

import numpy
from numba import cuda

import cuda.parallel.experimental as cudax


def op(a, b):
    return a if a < b else b


dtype = numpy.int32
h_init = numpy.array([42], dtype)
h_input = numpy.array([8, 6, 7, 5, 3, 0, 9], dtype)
d_output = cuda.device_array(1, dtype)
d_input = cuda.to_device(h_input)

red1 = cudax.reduce_into(d_output, d_output, op, h_init)
red2 = cudax.reduce_into(d_output, d_output, op, h_init)

Since parameters described above match between incocations of cudax.reduce_into, the second call should not lead to invocation of extern "C" CCCL_C_API CUresult cccl_device_reduce_build.

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

gevtushenko · 2024-10-17T00:17:56Z

@leofang might have some input on how we should approach caching (on-disk storage etc.)

leofang · 2024-10-17T00:47:47Z

Yeah, in cuda.core we'll offer various caches in the Python level, see NVIDIA/cuda-python#176 (can't link the internal design doc here, but we know where to find it 🙂). Not sure if the CCCL C library can easily hook up with a Python-based cache, though.

gevtushenko · 2024-10-17T00:54:44Z

Not sure if the CCCL C library can easily hook up with a Python-based cache, though.

@leofang we could consider caching on C++ end. I was thinking about caching on the Python side of cuda.parallel for now instead if this makes this any easier.

leofang · 2024-10-17T00:57:29Z

Yeah then cuda.core should meet your need. You just need to figure a way to generate a stable hash key for the items you listed.

shwina · 2024-12-02T22:43:12Z

Just a note here that #3001 implements caching on the Python side (as an immediate improvement). Caching on the C++ side, and file-based caches would still be extremely valuable.

shwina · 2024-12-12T14:55:48Z

operator source code (different function bodies, maybe could be op.code.co_code?)

In addition, I think we will want to capture op.__closure__:

In [11]: def f(x):
    ...:     def g(y):
    ...:         return x + y
    ...:     return g
    ...:

In [12]: g = f(2)

In [13]: g.__closure__
Out[13]: (<cell at 0x7e13984df220: int object at 0x7e139b700110>,)

shwina · 2025-01-14T21:30:04Z

I moved #3215 into a separate issue instead of being a sub-issue of this one. We decided it's not required for an MVP.

gevtushenko added the feature request label Oct 17, 2024

github-project-automation bot added this to CCCL Oct 17, 2024

github-project-automation bot moved this to Todo in CCCL Oct 17, 2024

gevtushenko assigned rwgk Oct 17, 2024

jollylili added the 2.8.0 label Nov 15, 2024

gevtushenko assigned shwina and unassigned rwgk Dec 20, 2024

rwgk mentioned this issue Dec 20, 2024

[cuda-python]: Implement compiler caches #3193

Closed

shwina mentioned this issue Dec 21, 2024

cuda.parallel: In-memory caching of cuda.parallel build objects #3216

Merged

2 tasks

shwina closed this as completed Jan 14, 2025

github-project-automation bot moved this from Todo to Done in CCCL Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Cache cuda.parallel builds #2590

[FEA]: Cache cuda.parallel builds #2590

gevtushenko commented Oct 17, 2024 •

edited

Loading

gevtushenko commented Oct 17, 2024

leofang commented Oct 17, 2024

gevtushenko commented Oct 17, 2024

leofang commented Oct 17, 2024

shwina commented Dec 2, 2024

shwina commented Dec 12, 2024

shwina commented Jan 14, 2025

[FEA]: Cache cuda.parallel builds #2590

[FEA]: Cache cuda.parallel builds #2590

Comments

gevtushenko commented Oct 17, 2024 • edited Loading

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

gevtushenko commented Oct 17, 2024

leofang commented Oct 17, 2024

gevtushenko commented Oct 17, 2024

leofang commented Oct 17, 2024

shwina commented Dec 2, 2024

shwina commented Dec 12, 2024

shwina commented Jan 14, 2025

gevtushenko commented Oct 17, 2024 •

edited

Loading