cuda.parallel: In-memory caching of `cuda.parallel` build objects #3216

shwina · 2024-12-21T13:22:10Z

Description

Closes #3214.

High level changes

This PR introduces (in-memory) caching of build objects. To achieve this, it introduces a helper decorate @cache_with_key, which caches the results of a function (in-memory), and accepts a custom function for constructing the cache key from the arguments.

The cache key is constructed from the considerations listed in #2590:

for device array inputs (and outputs), the .dtype is used to compute the corresponding component of the cache key.
for iterator inputs, the .kind is used to compute the corresponding component of the cache key.
to compute the cache key component corresponding to the the reduction function, it is wrapped in a CachableFunction type, which defines __eq__ and __hash__ such that instances whose bytecode and closure are equal, compare equal (ignoring attributes such as function names/docstrings).

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-12-21T13:22:15Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

shwina · 2024-12-21T13:22:31Z

/ok to test

github-actions · 2024-12-21T13:47:28Z

🟩 CI finished in 23m 36s: Pass: 100%/1 | Total: 23m 36s | Avg: 23m 36s | Max: 23m 36s

🟩 python: Pass: 100%/1 | Total: 23m 36s | Avg: 23m 36s | Max: 23m 36s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 23m 36s | Avg: 23m 36s | Max: 23m 36s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 23m 36s | Avg: 23m 36s | Max: 23m 36s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 23m 36s | Avg: 23m 36s | Max: 23m 36s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 23m 36s | Avg: 23m 36s | Max: 23m 36s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 23m 36s | Avg: 23m 36s | Max: 23m 36s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 23m 36s | Avg: 23m 36s | Max: 23m 36s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 23m 36s | Avg: 23m 36s | Max: 23m 36s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 23m 36s | Avg: 23m 36s | Max: 23m 36s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-v100-latest-1`

shwina · 2024-12-22T12:14:47Z

python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py

-    def __init__(self, d_in, d_out, op: Callable, h_init: np.ndarray):
+    def __init__(
+        self,
+        d_in: DeviceArrayLike | IteratorBase,


Note to self: Python 3.7 isn't going to like this.

Are you reminding yourself to use Union[DeviceArrayLike, IteratorBase]?

Instead, I used from __future__ import annotations which will make migrating easy.

github-actions · 2024-12-22T12:38:07Z

🟩 CI finished in 23m 41s: Pass: 100%/1 | Total: 23m 41s | Avg: 23m 41s | Max: 23m 41s

🟩 python: Pass: 100%/1 | Total: 23m 41s | Avg: 23m 41s | Max: 23m 41s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 23m 41s | Avg: 23m 41s | Max: 23m 41s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 23m 41s | Avg: 23m 41s | Max: 23m 41s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 23m 41s | Avg: 23m 41s | Max: 23m 41s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 23m 41s | Avg: 23m 41s | Max: 23m 41s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 23m 41s | Avg: 23m 41s | Max: 23m 41s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 23m 41s | Avg: 23m 41s | Max: 23m 41s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 23m 41s | Avg: 23m 41s | Max: 23m 41s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 23m 41s | Avg: 23m 41s | Max: 23m 41s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-v100-latest-1`

rwgk · 2024-12-23T10:32:51Z

python/cuda_parallel/cuda/parallel/experimental/_caching.py

+import functools
+
+
+def cache_with_key(key):


key_factory
?

I went with key as it's the name for the similar argument in e.g., sorted and cachetools.

rwgk · 2024-12-23T10:41:10Z

python/cuda_parallel/cuda/parallel/experimental/_caching.py

+                result = func(*args, **kwargs)
+                cache[cache_key] = result
+            # `cache_key` *must* be in `cache`, use `.get()`
+            # as it is faster:


I was surprised to read that, and chatgpt does not agree.

My prompt was: "If obj is a Python dict, is obj.get(key) faster than obj[key]?"

I recommend keeping this code straightforward and intuitive: return cache[cache_key]

You're right. I think my memory served me poorly here, and the quick benchmark I did turns out to give inconsistent results. I changed it to just [].

rwgk · 2024-12-23T10:43:49Z

python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py

-    def __init__(self, d_in, d_out, op: Callable, h_init: np.ndarray):
+    def __init__(
+        self,
+        d_in: DeviceArrayLike | IteratorBase,


Are you reminding yourself to use Union[DeviceArrayLike, IteratorBase]?

rwgk · 2024-12-23T10:45:24Z

python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py

+from .._caching import cache_with_key
+from ..typing import DeviceArrayLike
+from ..iterators._iterators import IteratorBase
+from .._utils import cai as cai


Oversight? (delete as cai)

Fixed in 56f2c61.

rwgk · 2024-12-23T11:22:30Z

python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py

+    d_in_key = d_in if isinstance(d_in, IteratorBase) else cai.get_dtype(d_in)
+    d_out_key = d_out if isinstance(d_out, IteratorBase) else cai.get_dtype(d_out)
+    op_key = (op.__code__.co_code, op.__code__.co_consts, op.__closure__)


The d_out type hint is DeviceArrayLike only, but the rhs of the d_out_key expression tests for IteratorBase.

It'll not be great if this code is copy-pasted as we add more algorithms.

One idea would be to introduce helper functions, but that would only be slightly better. I wonder if we could do much better, right in your decorator. You could loop over args and kwargs, and use isinstance() and potentially typing.get_type_hints() to check for supported argument types; we only have a few. This could be fast and compact. The entire make_cache_key() function wouldn't be needed.

The d_out type hint is DeviceArrayLike only, but the rhs of the d_out_key expression tests for IteratorBase.

Ah, nice observation. I changed it to just cai.get_dtype(d_out). 56f2c61.

I wonder if we could do much better, right in your decorator.

We could, but I'd prefer separating the concerns here, at the risk of a tiny bit of logic repetition across usages of cache_with_key. It also helps keep cache_with_key smaller, more generic, and more explainable if we didn't specialize it to deal with DeviceArrayLike arguments in a specific way.

rwgk · 2024-12-23T11:39:46Z

python/cuda_parallel/cuda/parallel/experimental/typing.py

+from typing import Protocol
+
+
+class DeviceArrayLike(Protocol):


TypeWithCUDAArrayInterface

would be much more expressive.

I used DeviceArrayLike to match NumPy's ArrayLike protocol. Not tying it to CAI would also enable us to extend this to a union type to include objects supporting other protocols (like dlpack).

FYI, StriderMemoryView from cuda.core encapsulates both CAI and DLPack.

I think StridedMemoryView and DeviceArrayLike are significantly different things. I do think they can be used in conjunction with one another.

The former is a library-neutral implementation of the CAI/dlpack "protocols", while the latter is a true Protocol (same word, two subtly different meanings).

DeviceArrayLike is useful to say "this function accepts any device-array like object".
StridedMemoryView is useful for converting library-specific, host or device arrays, into a library-agnostic type. Thus I can imagine a function accepting a DeviceArrayLike object, and within that function converting that object into a StridedMemoryView.

rwgk · 2024-12-23T11:42:04Z

python/cuda_parallel/cuda/parallel/experimental/iterators/_iterators.py

@@ -125,10 +162,11 @@ def impl(ptr, offset):
 class RawPointer(IteratorBase):
    def __init__(self, ptr: int, ntype: types.Type):


Maybe rename ntype to value_type everywhere while you're at it? And then inline the assignments right in the super().__init__() calls.

Done in 56f2c61.

rwgk · 2024-12-23T12:05:33Z

python/cuda_parallel/tests/test_iterators.py

+    assert TransformIterator(ary1, op1) == TransformIterator(ary1, op1)
+    assert TransformIterator(ary1, op1) != TransformIterator(ary1, op2)
+    assert TransformIterator(ary1, op1) == TransformIterator(ary1, op3)
+    assert TransformIterator(ary1, op1) != TransformIterator(ary2, op1)


Hm ... if ary1 and ary2 have the same dtype, don't we want this to be equal?

No, transforming ary1 and ary2 with op1 would yield different results so the objects themselves should not compare equal.

Now, in the context of build_reduce, I'm not exactly sure. Can a build_reduce object built for TransformIterator(ary1, op1) be reused for TransformIterator(ary2, op1)? (perhaps cc: @gevtushenko as well). In that case, we would have to specialize the reduce.make_cache_key function to handle that.

I suppose along similar lines, can a build_reduce object built for CountingIterator(np.int32(0)) be reused for CountingIterator(np.int32(1))? I suppose yes...

I like the other discussion we had on another PR since it leaves little room for doubts here. We should just follow what happens in C++.

Runtime comparison of two iterators of the same type should yield false if the state of these iterators is different:

struct op_t { __host__ __device__ int operator()(int x) const { return x * 2; } }; int main() { int data[] = {1, 2}; auto t1 = thrust::make_transform_iterator(data, op_t{}); auto t2 = thrust::make_transform_iterator(data + 1, op_t{}); if (t1 != t2) { std::cout << "not equal" << std::endl; } }

Regarding the build step, it makes sense to match the logic of C++ as well. When you have cub::DeviceReduce::Reduce(..., t1, ...) and cub::DeviceReduce::Reduce(..., t2, ...) the implicit instantiation leads to only one instance of reduction, because types match. So what should go into the build cache is types of iterators rather than iterators themselves.

OK, thanks. We need some similar notion of "iterator type" in Python then, because type(it) doesn't encode the same information as it does in C++. Thinking of a solution...

I introduced an IteratorKind type which holds the same information that the type would in C++. ~~Iterators with different state would not compare equal~~*, but their .kind would compare equal. We cache reduce_build objects on the .kind. This change is introduced in c822da7.

Further, @gevtushenko noted that the .kind encapsulates the same kind of information as .abi_name. In fact, the .kind of an iterator holds even more information - for TransformIterator, it holds information about the bytecode (and closures) of the unary op. In light of this, I've replaced the use of .abi_name with .kind. For each iterator kind, a unique suffix is generated and used to determine the ABI names of the advance, deref, and other functions (such as the unary operation for TransformIterator). This solves the previous problem we had where two different unary operations with the same name would have the same ABI name. This change is in 58b6f69.

*based on the discussion in #3216 (comment), I removed the functionality for comparing iterators (instead, this PR focuses on just comparing iterator kinds).

github-actions · 2024-12-23T15:13:42Z

🟩 CI finished in 23m 11s: Pass: 100%/1 | Total: 23m 11s | Avg: 23m 11s | Max: 23m 11s

🟩 python: Pass: 100%/1 | Total: 23m 11s | Avg: 23m 11s | Max: 23m 11s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 23m 11s | Avg: 23m 11s | Max: 23m 11s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 23m 11s | Avg: 23m 11s | Max: 23m 11s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 23m 11s | Avg: 23m 11s | Max: 23m 11s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 23m 11s | Avg: 23m 11s | Max: 23m 11s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 23m 11s | Avg: 23m 11s | Max: 23m 11s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 23m 11s | Avg: 23m 11s | Max: 23m 11s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 23m 11s | Avg: 23m 11s | Max: 23m 11s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 23m 11s | Avg: 23m 11s | Max: 23m 11s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-v100-latest-1`

gevtushenko · 2024-12-23T15:19:52Z

python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py

+def make_cache_key(
+    d_in: DeviceArrayLike | IteratorBase,
+    d_out: DeviceArrayLike,
+    op: Callable,
+    h_init: np.ndarray,
+):
+    d_in_key = d_in if isinstance(d_in, IteratorBase) else cai.get_dtype(d_in)
+    d_out_key = d_out if isinstance(d_out, IteratorBase) else cai.get_dtype(d_out)
+    op_key = (op.__code__.co_code, op.__code__.co_consts, op.__closure__)
+    h_init_key = h_init.dtype
+    return (d_in_key, d_out_key, op_key, h_init_key)


important: there's an implicit state affecting the key. As written, I'd get the same reducer object for devices of different architectures:

cudaSetDevice(0) reducer_1 = reduce_into(d_in, d_out, ...) cudaSetDevice(1) reducer_2 = reduce_into(d_in, d_out, ...)

Let's incorporate cc_major, cc_minor = cuda.get_current_device().compute_capability somewhere in the key to address that.

Good call. I did that in 56f2c61.

github-actions · 2024-12-23T16:02:44Z

🟩 CI finished in 24m 15s: Pass: 100%/1 | Total: 24m 15s | Avg: 24m 15s | Max: 24m 15s

🟩 python: Pass: 100%/1 | Total: 24m 15s | Avg: 24m 15s | Max: 24m 15s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 24m 15s | Avg: 24m 15s | Max: 24m 15s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 24m 15s | Avg: 24m 15s | Max: 24m 15s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 24m 15s | Avg: 24m 15s | Max: 24m 15s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 24m 15s | Avg: 24m 15s | Max: 24m 15s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 24m 15s | Avg: 24m 15s | Max: 24m 15s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 24m 15s | Avg: 24m 15s | Max: 24m 15s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 24m 15s | Avg: 24m 15s | Max: 24m 15s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 24m 15s | Avg: 24m 15s | Max: 24m 15s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-v100-latest-1`

github-actions · 2024-12-23T18:27:34Z

🟩 CI finished in 23m 34s: Pass: 100%/1 | Total: 23m 34s | Avg: 23m 34s | Max: 23m 34s

🟩 python: Pass: 100%/1 | Total: 23m 34s | Avg: 23m 34s | Max: 23m 34s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 23m 34s | Avg: 23m 34s | Max: 23m 34s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 23m 34s | Avg: 23m 34s | Max: 23m 34s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 23m 34s | Avg: 23m 34s | Max: 23m 34s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 23m 34s | Avg: 23m 34s | Max: 23m 34s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 23m 34s | Avg: 23m 34s | Max: 23m 34s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 23m 34s | Avg: 23m 34s | Max: 23m 34s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 23m 34s | Avg: 23m 34s | Max: 23m 34s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 23m 34s | Avg: 23m 34s | Max: 23m 34s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-v100-latest-1`

github-actions · 2024-12-24T15:21:47Z

🟩 CI finished in 23m 51s: Pass: 100%/1 | Total: 23m 51s | Avg: 23m 51s | Max: 23m 51s

🟩 python: Pass: 100%/1 | Total: 23m 51s | Avg: 23m 51s | Max: 23m 51s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 23m 51s | Avg: 23m 51s | Max: 23m 51s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 23m 51s | Avg: 23m 51s | Max: 23m 51s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 23m 51s | Avg: 23m 51s | Max: 23m 51s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 23m 51s | Avg: 23m 51s | Max: 23m 51s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 23m 51s | Avg: 23m 51s | Max: 23m 51s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 23m 51s | Avg: 23m 51s | Max: 23m 51s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 23m 51s | Avg: 23m 51s | Max: 23m 51s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 23m 51s | Avg: 23m 51s | Max: 23m 51s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-v100-latest-1`

rwgk

Looks great, that's very clean code!

My only remaining concern is: can we remove the code for iterator equivalence (because it's not actually needed; we only need it.kind equivalence)?

rwgk · 2024-12-30T03:23:26Z

python/cuda_parallel/cuda/parallel/experimental/iterators/_iterators.py


 _DEVICE_POINTER_SIZE = 8
 _DEVICE_POINTER_BITWIDTH = _DEVICE_POINTER_SIZE * 8


+@lru_cache(maxsize=None)
+def _get_abi_suffix(kind: "IteratorKind"):


I'd move this after class IteratorKind: reads better, doesn't need the type hint in quotes.

rwgk · 2024-12-30T03:37:35Z

python/cuda_parallel/cuda/parallel/experimental/iterators/_iterators.py

+    def __hash__(self):
+        return hash((self.kind, self.cvalue.value, self.numba_type, self.value_type))
+
+    def __eq__(self, other):
+        if not isinstance(other, self.__class__):
+            return NotImplemented
+        return (
+            self.kind == other.kind
+            and self.cvalue.value == other.cvalue.value
+            and self.numba_type == other.numba_type
+            and self.value_type == other.value_type
+        )
+


Do we still need this? I ran your existing tests, and an extra test I wrote: it seems the caching functionality doesn't need this code. I believe it would be better to remove this code, unless we can carefully explain why we actually need two types of equivalence.

My extra test: rwgk@021a544

Done - I agree, we can worry about implementing equality for iterator types if/when the need arises. I removed this code and the associated tests in d017572

rwgk · 2024-12-30T03:42:54Z

python/cuda_parallel/tests/test_iterators.py

+    assert it1.kind == it2.kind == it4.kind
+
+    ary1 = cp.asarray([0, 1, 2])
+    ary2 = cp.asarray([3, 4, 5])


ary2 = cp.asarray([0, 1, 2])

We want to exercise corner cases: here we want to be sure we don't compare array elements (arrays may be very large).

We don't want to give a wrong impression to human readers.

Could you please clarify what change you would like to see here? Perhaps an example will help me understand.

To clarify, when we previously implemented equality for iterators, TransformIterator(ary1, op1) would indeed compare equal to TransformIterator(ary2, op2) when ary1 .equals(ary2) == Trye. But recall that under the hood, ary1, ary2 are converted into RawPointer objects and that is what is held by the TransformIterator. No elementwise comparison of arrays was done to check for equality.

TL;DR: Just keep what you have.

Sorry I forgot to quote the suggested code before:

ary2 = cp.asarray([0, 1, 2])

This was so that it's immediately obvious that the array elements are not used when determining inequality here:

assert it5 != it9

But that code is gone now, and you only have == it8.kind left. With that it's more obvious that we have equality even if the array elements differ.

Ah, yes I understand now - thanks!

rwgk · 2024-12-30T03:43:47Z

python/cuda_parallel/tests/test_iterators.py

+
+
+def test_different_iterator_types_equality():
+    assert CountingIterator(np.int32(0)) != ConstantIterator(np.int64(0))


assert CountingIterator(np.int32(0)) != ConstantIterator(np.int32(0))

Similar to the above: corner case, wrong impression.

github-actions · 2024-12-30T12:57:59Z

🟩 CI finished in 23m 38s: Pass: 100%/1 | Total: 23m 38s | Avg: 23m 38s | Max: 23m 38s

🟩 python: Pass: 100%/1 | Total: 23m 38s | Avg: 23m 38s | Max: 23m 38s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 23m 38s | Avg: 23m 38s | Max: 23m 38s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 23m 38s | Avg: 23m 38s | Max: 23m 38s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 23m 38s | Avg: 23m 38s | Max: 23m 38s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 23m 38s | Avg: 23m 38s | Max: 23m 38s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 23m 38s | Avg: 23m 38s | Max: 23m 38s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 23m 38s | Avg: 23m 38s | Max: 23m 38s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 23m 38s | Avg: 23m 38s | Max: 23m 38s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 23m 38s | Avg: 23m 38s | Max: 23m 38s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-v100-latest-1`

rwgk

Looks great to me, thanks!

leofang · 2024-12-31T04:33:26Z

python/cuda_parallel/cuda/parallel/experimental/_utils/cai.py

+from ..typing import DeviceArrayLike
+
+
+def get_dtype(arr: DeviceArrayLike) -> np.dtype:


FYI this can be replaced by StridedMemoryView once cuda.core becomes a dependency.

leofang · 2024-12-31T04:37:28Z

python/cuda_parallel/cuda/parallel/experimental/_caching.py

+        @functools.wraps(func)
+        def inner(*args, **kwargs):
+            cc = cuda.get_current_device().compute_capability
+            cache_key = (key(*args, **kwargs), *cc)


Q: Do we have to unpack cc?

No, I suppose I don't really have a good reason for doing that. I will tack on this change in my next PR.

leofang · 2024-12-31T04:40:34Z

python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py

 # TODO Figure out `sum` without operator and initial value
 # TODO Accept stream
-def reduce_into(d_in, d_out, op: Callable, h_init: np.ndarray):
+@cache_with_key(make_cache_key)


Maybe irrelevant of this PR, just for my curiosity: What if d_in/d_out are discontinuous 1D arrays? Do we handle the stride somewhere?

No - it's a good point. I believe we need these to be contiguous. Opened #3223 and will address this in a follow-up.

leofang · 2024-12-31T04:42:42Z

python/cuda_parallel/cuda/parallel/experimental/iterators/_iterators.py


-def pointer(container, ntype: types.Type) -> RawPointer:
-    return RawPointer(container.__cuda_array_interface__["data"][0], ntype)
+def pointer(container, value_type: types.Type) -> RawPointer:


Note: These machinery can also be replaced by StridedMemoryView once cuda.core becomes a dependency.

github-actions · 2024-12-31T19:23:41Z

🟩 CI finished in 25m 02s: Pass: 100%/1 | Total: 25m 02s | Avg: 25m 02s | Max: 25m 02s

🟩 python: Pass: 100%/1 | Total: 25m 02s | Avg: 25m 02s | Max: 25m 02s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 25m 02s | Avg: 25m 02s | Max: 25m 02s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 25m 02s | Avg: 25m 02s | Max: 25m 02s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 25m 02s | Avg: 25m 02s | Max: 25m 02s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 25m 02s | Avg: 25m 02s | Max: 25m 02s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 25m 02s | Avg: 25m 02s | Max: 25m 02s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 25m 02s | Avg: 25m 02s | Max: 25m 02s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 25m 02s | Avg: 25m 02s | Max: 25m 02s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 25m 02s | Avg: 25m 02s | Max: 25m 02s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-v100-latest-1`

gevtushenko · 2024-12-31T22:51:25Z

python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py

+    def __init__(
+        self,
+        d_in: DeviceArrayLike | IteratorBase,
+        d_out: DeviceArrayLike,


remark: later on we'll have to support output iterators, like tabulate output iterator and transform output iterator

* Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <[email protected]>

implement `add_sat` split `signed`/`unsigned` implementation, improve implementation for MSVC improve device `add_sat` implementation add `add_sat` test improve generic `add_sat` implementation for signed types implement `sub_sat` allow more msvc intrinsics on x86 add op tests partially implement `mul_sat` implement `div_sat` and `saturate_cast` add `saturate_cast` test simplify `div_sat` test Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <[email protected]> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <[email protected]> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <[email protected]> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <[email protected]> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <[email protected]> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <[email protected]> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <[email protected]> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <[email protected]> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <[email protected]> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <[email protected]> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <[email protected]> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <[email protected]> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <[email protected]> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop CUB's util_compiler.cuh (#3302) All contained macros were deprecated Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <[email protected]> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <[email protected]> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <[email protected]> --------- Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 move to c++17, finalize device optimization fix msvc compilation, update tests Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <[email protected]> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <[email protected]> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <[email protected]> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <[email protected]> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <[email protected]> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <[email protected]> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <[email protected]> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <[email protected]> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <[email protected]> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <[email protected]> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <[email protected]> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <[email protected]> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <[email protected]> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <[email protected]> Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <[email protected]> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <[email protected]> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <[email protected]> --------- Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 Fix CI issues (#3443) update docs fix review restrict allowed types replace constexpr implementations with generic optimize `__is_arithmetic_integral`

shwina added 3 commits December 21, 2024 08:21

Define __eq__ and __hash__ for Iterators

cb0eccc

Define cache_with_key utility and use it to cache Reduce objects

da652e1

Add tests for caching Reduce objects

221af5c

shwina marked this pull request as ready for review December 22, 2024 11:45

shwina requested a review from a team as a code owner December 22, 2024 11:45

shwina requested a review from rwgk December 22, 2024 11:45

Tighten up types

0eea142

shwina commented Dec 22, 2024

View reviewed changes

rwgk reviewed Dec 23, 2024

View reviewed changes

Updates to support 3.7

f198200

gevtushenko reviewed Dec 23, 2024

View reviewed changes

Address review feedback

56f2c61

Introduce IteratorKind to hold iterator type information

c822da7

shwina marked this pull request as draft December 23, 2024 18:15

shwina marked this pull request as ready for review December 24, 2024 14:45

shwina force-pushed the cache-reducer-in-memory branch from 8950564 to 86c58fb Compare December 24, 2024 14:50

Use the .kind to generate an abi_name

58b6f69

shwina force-pushed the cache-reducer-in-memory branch from 86c58fb to 58b6f69 Compare December 24, 2024 14:56

shwina requested review from gevtushenko and rwgk December 24, 2024 15:00

rwgk reviewed Dec 30, 2024

View reviewed changes

Remove __eq__ and __hash__ methods from IteratorBase

d017572

Move helper function

68235ee

Formatting

f892265

rwgk approved these changes Dec 31, 2024

View reviewed changes

leofang reviewed Dec 31, 2024

View reviewed changes

shwina mentioned this pull request Dec 31, 2024

[FEA]: cuda.parallel - forbid (or handle) non-contiguous arrays as inputs to algorithms #3223

Closed

1 task

Don't unpack tuple in cache key

5e3d95a

leofang approved these changes Dec 31, 2024

View reviewed changes

gevtushenko approved these changes Dec 31, 2024

View reviewed changes

shwina merged commit 0e37b11 into NVIDIA:main Jan 1, 2025
20 checks passed

		@@ -125,10 +162,11 @@ def impl(ptr, offset):
		class RawPointer(IteratorBase):
		def __init__(self, ptr: int, ntype: types.Type):



		def test_different_iterator_types_equality():
		assert CountingIterator(np.int32(0)) != ConstantIterator(np.int64(0))

		from ..typing import DeviceArrayLike


		def get_dtype(arr: DeviceArrayLike) -> np.dtype:

cuda.parallel: In-memory caching of cuda.parallel build objects #3216

cuda.parallel: In-memory caching of cuda.parallel build objects #3216

Conversation

shwina commented Dec 21, 2024 • edited Loading

Description

High level changes

Checklist

copy-pr-bot bot commented Dec 21, 2024

shwina commented Dec 21, 2024

github-actions bot commented Dec 21, 2024

🟩 python: Pass: 100%/1 | Total: 23m 36s | Avg: 23m 36s | Max: 23m 36s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 22, 2024

🟩 python: Pass: 100%/1 | Total: 23m 41s | Avg: 23m 41s | Max: 23m 41s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gevtushenko Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shwina Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Dec 23, 2024

🟩 python: Pass: 100%/1 | Total: 23m 11s | Avg: 23m 11s | Max: 23m 11s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 23, 2024

🟩 python: Pass: 100%/1 | Total: 24m 15s | Avg: 24m 15s | Max: 24m 15s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

github-actions bot commented Dec 23, 2024

🟩 python: Pass: 100%/1 | Total: 23m 34s | Avg: 23m 34s | Max: 23m 34s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

github-actions bot commented Dec 24, 2024

🟩 python: Pass: 100%/1 | Total: 23m 51s | Avg: 23m 51s | Max: 23m 51s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

rwgk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cuda.parallel: In-memory caching of `cuda.parallel` build objects #3216

cuda.parallel: In-memory caching of `cuda.parallel` build objects #3216

shwina commented Dec 21, 2024 •

edited

Loading

gevtushenko Dec 23, 2024 •

edited

Loading

shwina Dec 24, 2024 •

edited

Loading

shwina Dec 30, 2024 •

edited

Loading

rwgk Dec 31, 2024 •

edited

Loading