Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add asynchronous concurrent execution #3687

Open
wants to merge 11 commits into
base: docs/develop
Choose a base branch
from

Conversation

matyas-streamhpc
Copy link

No description provided.

@matyas-streamhpc matyas-streamhpc self-assigned this Nov 25, 2024
@neon60 neon60 force-pushed the async-doc branch 2 times, most recently from 1484d67 to f81588d Compare December 2, 2024 08:46
@neon60 neon60 marked this pull request as ready for review December 2, 2024 08:53
@neon60 neon60 force-pushed the async-doc branch 4 times, most recently from fd5af51 to 6a139c6 Compare December 6, 2024 18:18
Copy link

@randyh62 randyh62 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left comments. Looks good overall.

docs/how-to/hip_runtime_api/asynchronous.rst Outdated Show resolved Hide resolved
docs/how-to/hip_runtime_api/asynchronous.rst Outdated Show resolved Hide resolved
docs/how-to/hip_runtime_api/asynchronous.rst Outdated Show resolved Hide resolved
docs/how-to/hip_runtime_api/asynchronous.rst Outdated Show resolved Hide resolved
docs/how-to/hip_runtime_api/asynchronous.rst Outdated Show resolved Hide resolved
docs/how-to/hip_runtime_api/asynchronous.rst Outdated Show resolved Hide resolved
docs/how-to/hip_runtime_api/asynchronous.rst Show resolved Hide resolved
docs/how-to/hip_runtime_api/asynchronous.rst Outdated Show resolved Hide resolved
docs/how-to/hip_runtime_api/asynchronous.rst Outdated Show resolved Hide resolved
docs/how-to/hip_runtime_api/asynchronous.rst Outdated Show resolved Hide resolved

Concurrent execution between the host (CPU) and device (GPU) allows the CPU to
perform other tasks while the GPU is executing kernels. Kernels can be launched
asynchronously using ``hipLaunchKernelDefault`` with a stream, enabling the CPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is hipLaunchKernelDefault? Where is it defined?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this sentence.

and shared memory for the kernels. To enable concurrent kernel executions, the
developer may have to reduce the block size of the kernels. The kernel runtimes
can be misleading for concurrent kernel runs, that is why during optimization
it is a good practice to check the trace files, to see if one kernel is blocking another
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify what we mean by tracing here.
User might confuse it with ltrace etc. Also can we point user to the document which helps them trace so that they do not have to search for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am linking in the rocprof documnetation.

utilization and improved performance.

Asynchronous execution is particularly advantageous in iterative processes. For
instance, if an iteration calculation is initiated, it can be efficient to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by

iteration calculation is initiated

Can we provide an example here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added example link, plus rephrased.

<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add an exit or std::abort here if check fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid the exit usage. We mentioned this at the error handling page:
https://rocm.docs.amd.com/projects/HIP/en/docs-develop/how-to/hip_runtime_api/error_handling.html#hip-check-macros

constexpr int numOfBlocks = 256;
constexpr int threadsPerBlock = 4096;
constexpr int numberOfIterations = 50;
size_t arraySize = 1U << 20;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well make this constexpr too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the requested changes.

}

// Wait for all operations to complete
HIP_CHECK(hipDeviceSynchronize());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do some sort of validation here. All we did was see if kernels got executed and user did not get any error.

This should help user get confidence that the output is same regardless of sync or async.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the requested changes.

<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as sync variant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the requested changes.

<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as sync variant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the requested changes.

@neon60 neon60 requested a review from cjatin January 10, 2025 11:44
Copy link

@AidanBeltonS AidanBeltonS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, some minor comments

Comment on lines +25 to +26
those streams might be fed from multiple concurrent host-side threads. Execution
on multiple streams might be concurrent but isn't required to be.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit vague for quite an important detail

Suggested change
those streams might be fed from multiple concurrent host-side threads. Execution
on multiple streams might be concurrent but isn't required to be.
those streams might be fed from multiple concurrent host-side threads. Multiple streams
tied to the same device are not guaranteed to execute their commands in order.

Comment on lines +31 to +32
Streams enable the overlap of computation and data transfer, ensuring
continuous GPU activity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] Seems vague and out of place in.

docs/how-to/hip_runtime_api/asynchronous.rst Outdated Show resolved Hide resolved
docs/how-to/hip_runtime_api/asynchronous.rst Outdated Show resolved Hide resolved
Comment on lines +127 to +128
Asynchronous memory operations allow data to be transferred between the host
and device while kernels are being executed on the GPU. Using operations like
Copy link

@AidanBeltonS AidanBeltonS Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sentence is a bit misleading. A reader could miss the fact that this operation must be on a different streams to get this behavior.

Asynchronous memory operations, do not block the host while copying this data.
Asynchronous memory operations on multiple streams allow for data to be transferred between the host and device while kernels are executed. (and do not block the host while copying this data)

Comment on lines +102 to +104
One of the primary benefits of asynchronous operations is the ability to
overlap data transfer with kernel execution, leading to better resource
utilization and improved performance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] you could clarify that multiple streams are needed to copy while executing a kernel in parallel.

another. This technique is especially useful in applications with large data
sets that need to be processed quickly.

Concurrent data transfers

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this differ to the Asynchronous memory operations section?
This feels repetitive, and it is not clear on how you wish to distinguish between concurrent and asynchronous within this context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants