Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda.parallel: In-memory caching of cuda.parallel build objects #3216

Merged
merged 12 commits into from
Jan 1, 2025
Empty file.
16 changes: 16 additions & 0 deletions python/cuda_parallel/cuda/parallel/experimental/_utils/cai.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. ALL RIGHTS RESERVED.
#
#
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

"""
Utilities for extracting information from `__cuda_array_interface__`.
"""

import numpy as np

from ..typing import DeviceArrayLike


def get_dtype(arr: DeviceArrayLike) -> np.dtype:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this can be replaced by StridedMemoryView once cuda.core becomes a dependency.

return np.dtype(arr.__cuda_array_interface__["typestr"])
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@
from .. import _cccl as cccl
from .._bindings import get_paths, get_bindings
from .._caching import cache_with_key
from ..typing import DeviceArrayLike
from ..iterators._iterators import IteratorBase
from .._utils import cai as cai
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oversight? (delete as cai)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 56f2c61.



class _Op:
Expand Down Expand Up @@ -42,12 +45,18 @@ def _dtype_validation(dt1, dt2):

class _Reduce:
# TODO: constructor shouldn't require concrete `d_in`, `d_out`:
def __init__(self, d_in, d_out, op: Callable, h_init: np.ndarray):
def __init__(
self,
d_in: DeviceArrayLike | IteratorBase,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: Python 3.7 isn't going to like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you reminding yourself to use Union[DeviceArrayLike, IteratorBase]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, I used from __future__ import annotations which will make migrating easy.

d_out: DeviceArrayLike,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remark: later on we'll have to support output iterators, like tabulate output iterator and transform output iterator

op: Callable,
h_init: np.ndarray,
):
d_in_cccl = cccl.to_cccl_iter(d_in)
self._ctor_d_in_cccl_type_enum_name = cccl.type_enum_as_name(
d_in_cccl.value_type.type.value
)
self._ctor_d_out_dtype = d_out.dtype
self._ctor_d_out_dtype = cai.get_dtype(d_out)
self._ctor_init_dtype = h_init.dtype
cc_major, cc_minor = cuda.get_current_device().compute_capability
cub_path, thrust_path, libcudacxx_path, cuda_include_path = get_paths()
Expand Down Expand Up @@ -120,9 +129,14 @@ def __del__(self):
bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))


def make_cache_key(d_in, d_out, op, h_init):
d_in_key = d_in.dtype if hasattr(d_in, "__cuda_array_interface__") else d_in
d_out_key = d_out.dtype if hasattr(d_out, "__cuda_array_interface__") else d_out
def make_cache_key(
d_in: DeviceArrayLike | IteratorBase,
d_out: DeviceArrayLike,
op: Callable,
h_init: np.ndarray,
):
d_in_key = d_in if isinstance(d_in, IteratorBase) else cai.get_dtype(d_in)
d_out_key = d_out if isinstance(d_out, IteratorBase) else cai.get_dtype(d_out)
op_key = (op.__code__.co_code, op.__code__.co_consts, op.__closure__)
h_init_key = h_init.dtype
return (d_in_key, d_out_key, op_key, h_init_key)
Expand All @@ -131,7 +145,12 @@ def make_cache_key(d_in, d_out, op, h_init):
# TODO Figure out `sum` without operator and initial value
# TODO Accept stream
@cache_with_key(make_cache_key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe irrelevant of this PR, just for my curiosity: What if d_in/d_out are discontinuous 1D arrays? Do we handle the stride somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - it's a good point. I believe we need these to be contiguous. Opened #3223 and will address this in a follow-up.

def reduce_into(d_in, d_out, op: Callable, h_init: np.ndarray):
def reduce_into(
d_in: DeviceArrayLike | IteratorBase,
d_out: DeviceArrayLike,
op: Callable,
h_init: np.ndarray,
):
"""Computes a device-wide reduction using the specified binary ``op`` functor and initial value ``init``.

Example:
Expand Down
10 changes: 10 additions & 0 deletions python/cuda_parallel/cuda/parallel/experimental/typing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from typing import Protocol


class DeviceArrayLike(Protocol):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeWithCUDAArrayInterface

would be much more expressive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used DeviceArrayLike to match NumPy's ArrayLike protocol. Not tying it to CAI would also enable us to extend this to a union type to include objects supporting other protocols (like dlpack).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, StriderMemoryView from cuda.core encapsulates both CAI and DLPack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think StridedMemoryView and DeviceArrayLike are significantly different things. I do think they can be used in conjunction with one another.

The former is a library-neutral implementation of the CAI/dlpack "protocols", while the latter is a true Protocol (same word, two subtly different meanings).

DeviceArrayLike is useful to say "this function accepts any device-array like object".
StridedMemoryView is useful for converting library-specific, host or device arrays, into a library-agnostic type. Thus I can imagine a function accepting a DeviceArrayLike object, and within that function converting that object into a StridedMemoryView.

"""
Objects representing a device array, having a `.__cuda_array_interface__`
attribute.
"""

__cuda_array_interface__: dict
Loading