cuda.parallel: Add optional stream argument to reduce_into() #3348

NaderAlAwar · 2025-01-10T21:39:51Z

Description

This PR adds an optional stream argument to cuda.parallel's reduce_into(). The default value of this argument is None, which replicates the original behavior prior to this PR, so existing code is unaffected. The stream object must either be None (i.e. the default stream) or implement the __cuda_stream__ protocol. Passing in an object that either doesn't implement the protocol or implements it but incorrectly results in an error.

Also included are two tests, one with a cupy stream and one for invalid streams. I have not added documentation yet because I am unsure where to add it.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

github-actions · 2025-01-10T22:07:24Z

🟩 CI finished in 24m 14s: Pass: 100%/1 | Total: 24m 14s | Avg: 24m 14s | Max: 24m 14s

🟩 python: Pass: 100%/1 | Total: 24m 14s | Avg: 24m 14s | Max: 24m 14s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 24m 14s | Avg: 24m 14s | Max: 24m 14s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 24m 14s | Avg: 24m 14s | Max: 24m 14s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 24m 14s | Avg: 24m 14s | Max: 24m 14s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 24m 14s | Avg: 24m 14s | Max: 24m 14s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 24m 14s | Avg: 24m 14s | Max: 24m 14s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 24m 14s | Avg: 24m 14s | Max: 24m 14s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 24m 14s | Avg: 24m 14s | Max: 24m 14s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 24m 14s | Avg: 24m 14s | Max: 24m 14s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-v100-latest-1`

shwina

Looking good! A few minor suggestions

shwina · 2025-01-11T12:50:17Z

python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py

+    if not hasattr(stream, "__cuda_stream__"):
+        raise TypeError(
+            f"stream argument {stream} does not implement the '__cuda_stream__' protocol"
+        )
+
+    stream_property = stream.__cuda_stream__


In general, EAFP is more idiomatic, and also as being discussed in NVIDIA/cuda-python#348, much faster:

So instead of:

if not hasattr(obj, "attr"): raise TypeError(...) attr = obj.attr

Prefer:

try: attr = obj.attr except AttributeError: raise TypeError(...)

As a rule of thumb, we should avoid hasattr in particular as it can be very slow.

(side note: IIRC there's at least one other place in the codebase that we're using hasattr, and likely that needs to be changed as well)

shwina · 2025-01-11T13:08:29Z

python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py

+    if (
+        isinstance(stream_property, tuple)
+        and len(stream_property) == 2
+        and all(isinstance(i, int) for i in stream_property)
+    ):
+        version, handle = stream_property
+        return handle
+
+    raise TypeError(
+        f"__cuda_stream__ property of '{stream}' must return a 'Tuple[int, int]'; got {stream_property} instead"
+    )


nit: Personally, I'd be a bit more lax here and really only ensure that handle is an int. If someone returns a list rather than a tuple, that's probably not relevant to us:

Suggested change

if (

isinstance(stream_property, tuple)

and len(stream_property) == 2

and all(isinstance(i, int) for i in stream_property)

):

version, handle = stream_property

return handle

raise TypeError(

f"__cuda_stream__ property of '{stream}' must return a 'Tuple[int, int]'; got {stream_property} instead"

)

_, handle = stream_property # (version, handle)

if not isinstance(handle, int):

raise TypeError(...)

return handle

We should also check if version is 0, because we could change it to say 3-tuple in a future version of the protocol.

shwina · 2025-01-11T13:12:57Z

python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py

@@ -46,6 +46,30 @@ def _dtype_validation(dt1, dt2):
        raise TypeError(f"dtype mismatch: __init__={dt1}, __call__={dt2}")


+def _validate_and_get_stream(stream) -> Optional[int]:


In terms of where this function should live, here are a couple of suggestions:

Move it to _utils/stream.py

Rename _utils/cai.py to _utils/protocols.py and move it there (this module could be general utilities for working with protocol objects like __cuda_array_interface__ and __cuda_stream__)

shwina · 2025-01-11T13:22:59Z

python/cuda_parallel/tests/test_reduce.py

+    reduce_into(d_temp_storage, d_in, d_out, d_in.size, h_init, stream=stream_wrapper)
+    np.testing.assert_allclose(d_in.sum().get(), d_out.get())


We should call stream.synchronize() after the call to reduce_into.

Perhaps our wrapper Stream type should have a .synchronize() method that calls self.cupy_stream.synchronize().

Wearing my CuPy hat: Just call

cupy.asnumpy(..., stream=stream, blocking=True)

or

with stream: cupy.asnumpy(..., blocking=True)

to perform a stream-ordered, blocking copy to host. No need to add .synchronize() this way.

Even better is call cp.testing.assert_allclose(...) then ~~no need to copy~~ it does the copy internally for us, just need to stream-order it:

with stream: cp.testing.assert_allclose(...)

it does the copy internally for us, just need to stream-order it:

In general, is it recommended that we rely on this?

Yes it's public API.

leofang · 2025-01-11T14:26:07Z

python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py

+    # null stream is allowed
+    if stream is None:
+        return None


So I think this would be a common source of bugs. The first naive question is would bindings.cccl_device_reduce() take None for the stream argument? But more importantly, I expect that by the time of integration (with CuPy & co) we must take an explicit stream, e.g.

reduce_into(..., stream=cp.cuda.get_current_stream())

in order to preserve the respective library's stream ordering. If it's the case, we probably should explicitly forbid stream=None (as we do in cuda.core, btw) so as to avoid the mistakes that someone integrating the code forgets to fetch the library stream and pass it to cuda.par.

FYI, in nvmath-python stream=None is used to mean "look up the input arrays' library and take the library's current stream (or whatever default it uses)," which has the same semantics as the snippet above, but is achieved in a much more complex and laborious way that we hope to simplify with __cuda_stream__.

but is achieved in a much more complex and laborious way that we hope to simplify with cuda_stream.

Could you expand on this a bit? How would __cuda_stream__ help here?

I guess I see what you're asking. For interpreting a provided stream, the protocol will help. For understanding the stream semantics of each array library (ex: does it have the notion of a current stream?), no the protocol would not help.

The first naive question is would bindings.cccl_device_reduce() take None for the stream argument?

Yes - this would match the API of the corresponding C++ function where the stream argument is explicitly optional and defaults to 0.

we probably should explicitly forbid stream=None (as we do in cuda.core, btw) so as to avoid the mistakes that someone integrating the code forgets to fetch the library stream and pass it to cuda.par.

Please correct me if I'm wrong, but my understanding is that your concern is about the use of APIs like CuPy's Stream or PyTorch's stream, which are both context managers, and set a library-specific "current stream":

with torch.cuda.stream() as s: torch.some_function(...) # no need to pass a stream explicitly, uses `s` implicitly

The above works great as long as I'm only using PyTorch, but if I want to combine PyTorch with e.g., CuPy:

with torch.cuda.stream() as s: torch.some_function(...) # no need to pass a stream explicitly, uses `s` implicitly # need to pass stream explicitly, as cupy doesn't know about # PyTorch's "current stream": cupy.some_other_function(..., s)

I certainly agree that what you're describing is a concern, but I don't feel that the API decisions of downstream libraries like CuPy or PyTorch should influence the APIs of upstream libraries like cuda.core or cuda.parallel.

The default stream is a reasonable default and 'just works' across the ecosystem for the majority of users who don't necessarily want to use CUDA streams. I would prefer we keep things easy for them.

If someone opts in to using streams, then I think it's fine to require that they take additional care to pass streams appropriately to the various functions they use across libraries (which is something they need to do already).

In both cuda.core and cudax in this repo, every place that could take a stream (ex: launch()) requires an explicit stream. If a user wants to use the default stream (legacy or PTDS) they can do so by passing it explicitly (ex: with cuda.core it'd be stream=Device().default_stream). All of our "modern" designs have been gearing toward explicitness, and the default stream in "old CUDA" is nothing but a named stream, no more special (apart from all of the usual caveats) than other user-created streams.

One simple example where a default stream choice could fail miserably:

with stream as s: arr = cp.random.random(...) reduce_into(..., d_in=arr, ...) # forgot to pass s here for whatever reason

It is obvious that the input array creation/reduction are not ordered properly. I don't want us to honor any library's stream context like nvmath-python did (perhaps other than cuda.core's, once we decide to do it there), but I want the above snippet to raise an error so that users know they need to do this:

with stream as s: arr = cp.random.random(...) reduce_into(..., d_in=arr, ..., stream=s)

That is, I was suggesting cuda.parallel to always require a stream explicitly.

But, before we spend more time to discuss this choice, it'd be nice to know if the current APIs are final (i.e. end-user facing). I was under the impression that they are not, so perhaps this discussion is not that relevant/urgent as far as this PR is concerned?

But, before we spend more time to discuss this choice, it'd be nice to know if the current APIs are final

The current API is meant to be low-level. We've aligned on making this low-level API as close to the underlying thrust/CUB APIs as possible.

so perhaps this discussion is not that relevant/urgent as far as this PR is concerned?

Yeah I agree - let's revisit when we're designing the higher level "user-facing" API. For this PR, I'd prefer if we're more conservative and not commit to requiring streams (just like the underlying C++ API).

leofang · 2025-01-11T14:28:35Z

python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py

@@ -85,7 +109,9 @@ def __init__(
        if error != enums.CUDA_SUCCESS:
            raise ValueError("Error building reduce")

-    def __call__(self, temp_storage, d_in, d_out, num_items: int, h_init: np.ndarray):
+    def __call__(
+        self, temp_storage, d_in, d_out, num_items: int, h_init: np.ndarray, stream=None


I don't have strong opinion in this, just FYI: in some places in cuda.core we have the following signature

def func(..., *, stream):

so that we enforce stream to be passed as a kw arg, but has no default value. The consideration back then was to ensure users to pass a stream explicitly.

NaderAlAwar added 2 commits January 10, 2025 15:26

Add optional stream argument to reduce_into()

567336c

Add tests to check for reduce_into() stream behavior

a3e7293

NaderAlAwar requested a review from a team as a code owner January 10, 2025 21:39

NaderAlAwar requested a review from shwina January 10, 2025 21:39

shwina requested changes Jan 11, 2025

View reviewed changes

leofang requested changes Jan 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda.parallel: Add optional stream argument to reduce_into() #3348

cuda.parallel: Add optional stream argument to reduce_into() #3348

NaderAlAwar commented Jan 10, 2025

github-actions bot commented Jan 10, 2025

🟩 python: Pass: 100%/1 | Total: 24m 14s | Avg: 24m 14s | Max: 24m 14s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

shwina left a comment

shwina Jan 11, 2025

shwina Jan 11, 2025

leofang Jan 11, 2025

shwina Jan 11, 2025

shwina Jan 11, 2025

leofang Jan 11, 2025 •

edited

Loading

leofang Jan 11, 2025 •

edited

Loading

shwina Jan 11, 2025

leofang Jan 11, 2025

leofang Jan 11, 2025 •

edited

Loading

leofang Jan 11, 2025

shwina Jan 11, 2025

leofang Jan 11, 2025

shwina Jan 12, 2025 •

edited

Loading

leofang Jan 13, 2025

shwina Jan 13, 2025

leofang Jan 11, 2025

		@@ -46,6 +46,30 @@ def _dtype_validation(dt1, dt2):
		raise TypeError(f"dtype mismatch: __init__={dt1}, __call__={dt2}")


		def _validate_and_get_stream(stream) -> Optional[int]:

		reduce_into(d_temp_storage, d_in, d_out, d_in.size, h_init, stream=stream_wrapper)
		np.testing.assert_allclose(d_in.sum().get(), d_out.get())

cuda.parallel: Add optional stream argument to reduce_into() #3348

Are you sure you want to change the base?

cuda.parallel: Add optional stream argument to reduce_into() #3348

Conversation

NaderAlAwar commented Jan 10, 2025

Description

Checklist

github-actions bot commented Jan 10, 2025

🟩 python: Pass: 100%/1 | Total: 24m 14s | Avg: 24m 14s | Max: 24m 14s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

shwina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leofang Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

leofang Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leofang Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shwina Jan 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leofang Jan 11, 2025 •

edited

Loading

leofang Jan 11, 2025 •

edited

Loading

leofang Jan 11, 2025 •

edited

Loading

shwina Jan 12, 2025 •

edited

Loading