-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC for histogram CPU implementation #1930
Conversation
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
more formatting fixes Signed-off-by: Dan Hoeflinger <[email protected]>
Overall, this all sounds good enough for the "proposed" stage, where it's expected that some details are unknown and need to be determined. I am happy to approve it but will wait for a few days in case @danhoeflinger wants to update the document with some follow-up thoughts on the discussion. |
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
One question I have for the group... From my reading, it seems the answer is probably no. Execution policies have semantic meaning, and |
I agree that we should honor the user's request for a specific policy as opposed to using the serial implementation until some empirically determined cutoff point. I also imagine that the exact cutoff point where the parallel version performs better can highly vary dependent on a user's hardware setup and giving them the freedom to manually choose when to make the switch from the serial to parallel version may result in better performance than any generic decisions we could make. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken another pass through the document. A single question regarding how technical we want to get when explaining the algorithm.
The RFC looks ready to me.
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
I believe yes, we can. Generally, a serial implementation is correct for parallel execution policies, so it's more of a QoI question whether to parallelize or not. While the policies do not mean "provide the fastest version", they do not mean "always use multiple threads" either. It's a permission to use multiple threads, but not an obligation. |
|
||
### SIMD/openMP SIMD Implementation | ||
Currently oneDPL relies upon openMP SIMD to provide its vectorization, which is designed to provide vectorization across | ||
loop iterations. OneDPL does not directly use any intrinsics which may offer more complex functionality than what is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second sentence may be omitted.
Based on the first sentence we can conclude that "OneDPL does not directly use any intrinsics..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied.
increment. For the even bin API, the calculations to determine selected bin have some opportunity for vectorization as | ||
each input has the same mathematical operations applied to each. However, for the custom range API, each input element | ||
uses a binary search through a list of bin boundaries to determine the selected bin. This operation will have a | ||
different length and control flow based upon each input element and will be very difficult to vectorize. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we can calculate the bin indexes for the input data in SIMD manner.
After that we can process the result in a serial loop.
No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is applicable only for the even binned case. Without using intrinsic operations, we must do this with omp simd and the ordered
structured block. Initial investigation seemed to indicate that this was unsuccessful for generating vectorized code, and my suspicion is that it will not really help anyway. I can revisit this and attempt it, but the intention for now was to omit vectorizations from this first phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I'll ask that we leave it as described in the RFC, which gives some understanding of how this can be improved in the future, but starts without vectorization for this phase.
We can add an issue to explore using simd ordered
to get some improvement for histogram even, and leave it out for this RFC and the initial PR implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.. Ok, lets leave it as described.
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Agree with Alexey. Especially since, at least, the host patterns "sort" and "merge" use some thresholds for switching to a serial implementation, depending on number of input elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good to me.
Just one comment about adding a description that the implementation may do a fallback to a serial implementation.
Signed-off-by: Dan Hoeflinger <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-approved.
The questions in my earlier approval are now addressed. The last couple of comments I made do not hold the proposal from landing.
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Waiting for a second approval to merge, so I added the small asks you made in the mean time @akukanov . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merging, with approvals from @akukanov and @MikeDvorskiy . |
Adds an RFC for histogram CPU implementation.