Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for histogram CPU implementation #1930

Merged
merged 34 commits into from
Jan 22, 2025
Merged

Conversation

danhoeflinger
Copy link
Contributor

Adds an RFC for histogram CPU implementation.

Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger
<[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
more formatting fixes

Signed-off-by: Dan Hoeflinger <[email protected]>
@akukanov
Copy link
Contributor

akukanov commented Nov 6, 2024

Overall, this all sounds good enough for the "proposed" stage, where it's expected that some details are unknown and need to be determined. I am happy to approve it but will wait for a few days in case @danhoeflinger wants to update the document with some follow-up thoughts on the discussion.

Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
@danhoeflinger
Copy link
Contributor Author

One question I have for the group...
If we know that a serial implementation will provide better performance up to some threshold (perhaps dependent on num bins, num threads, num input elements), can / should we dispatch instead to a serial implementation?

From my reading, it seems the answer is probably no. Execution policies have semantic meaning, and par / par_unseq do not simply mean "provide the fastest version" even if that is what the users probably want.

@mmichel11
Copy link
Contributor

mmichel11 commented Jan 13, 2025

One question I have for the group... If we know that a serial implementation will provide better performance up to some threshold (perhaps dependent on num bins, num threads, num input elements), can / should we dispatch instead to a serial implementation?

From my reading, it seems the answer is probably no. Execution policies have semantic meaning, and par / par_unseq do not simply mean "provide the fastest version" even if that is what the users probably want.

I agree that we should honor the user's request for a specific policy as opposed to using the serial implementation until some empirically determined cutoff point. I also imagine that the exact cutoff point where the parallel version performs better can highly vary dependent on a user's hardware setup and giving them the freedom to manually choose when to make the switch from the serial to parallel version may result in better performance than any generic decisions we could make.

Copy link
Contributor

@mmichel11 mmichel11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken another pass through the document. A single question regarding how technical we want to get when explaining the algorithm.

The RFC looks ready to me.

rfcs/proposed/host_backend_histogram/README.md Outdated Show resolved Hide resolved
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
@akukanov
Copy link
Contributor

akukanov commented Jan 15, 2025

One question I have for the group...
If we know that a serial implementation will provide better performance up to some threshold (perhaps dependent on num bins, num threads, num input elements), can / should we dispatch instead to a serial implementation?

From my reading, it seems the answer is probably no. Execution policies have semantic meaning, and par / par_unseq do not simply mean "provide the fastest version" even if that is what the users probably want.

I believe yes, we can. Generally, a serial implementation is correct for parallel execution policies, so it's more of a QoI question whether to parallelize or not. While the policies do not mean "provide the fastest version", they do not mean "always use multiple threads" either. It's a permission to use multiple threads, but not an obligation.


### SIMD/openMP SIMD Implementation
Currently oneDPL relies upon openMP SIMD to provide its vectorization, which is designed to provide vectorization across
loop iterations. OneDPL does not directly use any intrinsics which may offer more complex functionality than what is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second sentence may be omitted.
Based on the first sentence we can conclude that "OneDPL does not directly use any intrinsics..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied.

increment. For the even bin API, the calculations to determine selected bin have some opportunity for vectorization as
each input has the same mathematical operations applied to each. However, for the custom range API, each input element
uses a binary search through a list of bin boundaries to determine the selected bin. This operation will have a
different length and control flow based upon each input element and will be very difficult to vectorize.
Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can calculate the bin indexes for the input data in SIMD manner.
After that we can process the result in a serial loop.
No?

Copy link
Contributor Author

@danhoeflinger danhoeflinger Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is applicable only for the even binned case. Without using intrinsic operations, we must do this with omp simd and the ordered structured block. Initial investigation seemed to indicate that this was unsuccessful for generating vectorized code, and my suspicion is that it will not really help anyway. I can revisit this and attempt it, but the intention for now was to omit vectorizations from this first phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I'll ask that we leave it as described in the RFC, which gives some understanding of how this can be improved in the future, but starts without vectorization for this phase.
We can add an issue to explore using simd ordered to get some improvement for histogram even, and leave it out for this RFC and the initial PR implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.. Ok, lets leave it as described.

Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
@MikeDvorskiy
Copy link
Contributor

MikeDvorskiy commented Jan 16, 2025

I believe yes, we can. Generally, a serial implementation is correct for parallel execution policies, so it's more of a QoI question whether to parallelize or not. While the policies do not mean "provide the fastest version", they do not mean "always use multiple threads" either. It's a permission to use multiple threads, but not an obligation.

Agree with Alexey. Especially since, at least, the host patterns "sort" and "merge" use some thresholds for switching to a serial implementation, depending on number of input elements.
So, probably it makes sense to add a description that the implementation may do fallback to a serial implementation for a "small" number of input elements by performance reasons.

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me.
Just one comment about adding a description that the implementation may do a fallback to a serial implementation.

akukanov
akukanov previously approved these changes Jan 16, 2025
Copy link
Contributor

@akukanov akukanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approved.

The questions in my earlier approval are now addressed. The last couple of comments I made do not hold the proposal from landing.

Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
@danhoeflinger
Copy link
Contributor Author

Waiting for a second approval to merge, so I added the small asks you made in the mean time @akukanov .

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@danhoeflinger
Copy link
Contributor Author

Merging, with approvals from @akukanov and @MikeDvorskiy .

@danhoeflinger danhoeflinger merged commit 3358cf4 into main Jan 22, 2025
2 checks passed
@danhoeflinger danhoeflinger deleted the dev/dhoeflin/rfc_histogram_cpu branch January 22, 2025 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants