-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add asynchronous concurrent execution #3687
base: docs/develop
Are you sure you want to change the base?
Conversation
1484d67
to
f81588d
Compare
fd5af51
to
6a139c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left comments. Looks good overall.
495e166
to
9835194
Compare
ed7e05f
to
7cb3237
Compare
|
||
Concurrent execution between the host (CPU) and device (GPU) allows the CPU to | ||
perform other tasks while the GPU is executing kernels. Kernels can be launched | ||
asynchronously using ``hipLaunchKernelDefault`` with a stream, enabling the CPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is hipLaunchKernelDefault? Where is it defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this sentence.
and shared memory for the kernels. To enable concurrent kernel executions, the | ||
developer may have to reduce the block size of the kernels. The kernel runtimes | ||
can be misleading for concurrent kernel runs, that is why during optimization | ||
it is a good practice to check the trace files, to see if one kernel is blocking another |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clarify what we mean by tracing here.
User might confuse it with ltrace etc. Also can we point user to the document which helps them trace so that they do not have to search for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am linking in the rocprof documnetation.
utilization and improved performance. | ||
|
||
Asynchronous execution is particularly advantageous in iterative processes. For | ||
instance, if an iteration calculation is initiated, it can be efficient to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by
iteration calculation is initiated
Can we provide an example here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added example link, plus rephrased.
<< status << ": " \ | ||
<< hipGetErrorString(status) \ | ||
<< " at " << __FILE__ << ":" \ | ||
<< __LINE__ << std::endl; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add an exit
or std::abort
here if check fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid the exit
usage. We mentioned this at the error handling page:
https://rocm.docs.amd.com/projects/HIP/en/docs-develop/how-to/hip_runtime_api/error_handling.html#hip-check-macros
constexpr int numOfBlocks = 256; | ||
constexpr int threadsPerBlock = 4096; | ||
constexpr int numberOfIterations = 50; | ||
size_t arraySize = 1U << 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might as well make this constexpr too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the requested changes.
} | ||
|
||
// Wait for all operations to complete | ||
HIP_CHECK(hipDeviceSynchronize()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do some sort of validation here. All we did was see if kernels got executed and user did not get any error.
This should help user get confidence that the output is same regardless of sync or async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the requested changes.
<< status << ": " \ | ||
<< hipGetErrorString(status) \ | ||
<< " at " << __FILE__ << ":" \ | ||
<< __LINE__ << std::endl; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as sync variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the requested changes.
<< status << ": " \ | ||
<< hipGetErrorString(status) \ | ||
<< " at " << __FILE__ << ":" \ | ||
<< __LINE__ << std::endl; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as sync variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the requested changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, some minor comments
those streams might be fed from multiple concurrent host-side threads. Execution | ||
on multiple streams might be concurrent but isn't required to be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit vague for quite an important detail
those streams might be fed from multiple concurrent host-side threads. Execution | |
on multiple streams might be concurrent but isn't required to be. | |
those streams might be fed from multiple concurrent host-side threads. Multiple streams | |
tied to the same device are not guaranteed to execute their commands in order. |
Streams enable the overlap of computation and data transfer, ensuring | ||
continuous GPU activity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[NIT] Seems vague and out of place in.
Asynchronous memory operations allow data to be transferred between the host | ||
and device while kernels are being executed on the GPU. Using operations like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this sentence is a bit misleading. A reader could miss the fact that this operation must be on a different streams to get this behavior.
Asynchronous memory operations, do not block the host while copying this data.
Asynchronous memory operations on multiple streams allow for data to be transferred between the host and device while kernels are executed. (and do not block the host while copying this data)
One of the primary benefits of asynchronous operations is the ability to | ||
overlap data transfer with kernel execution, leading to better resource | ||
utilization and improved performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[NIT] you could clarify that multiple streams are needed to copy while executing a kernel in parallel.
another. This technique is especially useful in applications with large data | ||
sets that need to be processed quickly. | ||
|
||
Concurrent data transfers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this differ to the Asynchronous memory operations section
?
This feels repetitive, and it is not clear on how you wish to distinguish between concurrent and asynchronous within this context
Co-authored-by: AidanBeltonS <[email protected]>
Co-authored-by: AidanBeltonS <[email protected]>
No description provided.