RFC for histogram CPU implementation #1930

danhoeflinger · 2024-11-01T19:23:24Z

Adds an RFC for histogram CPU implementation.

Signed-off-by: Dan Hoeflinger <[email protected]>

rfcs/proposed/host_backend_histogram/README.md

Signed-off-by: Dan Hoeflinger <[email protected]>

rfcs/proposed/host_backend_histogram/README.md

Signed-off-by: Dan Hoeflinger <[email protected]>

more formatting fixes Signed-off-by: Dan Hoeflinger <[email protected]>

rfcs/proposed/host_backend_histogram/README.md

akukanov · 2024-11-06T21:37:33Z

Overall, this all sounds good enough for the "proposed" stage, where it's expected that some details are unknown and need to be determined. I am happy to approve it but will wait for a few days in case @danhoeflinger wants to update the document with some follow-up thoughts on the discussion.

Signed-off-by: Dan Hoeflinger <[email protected]>

danhoeflinger · 2025-01-13T15:15:11Z

One question I have for the group...
If we know that a serial implementation will provide better performance up to some threshold (perhaps dependent on num bins, num threads, num input elements), can / should we dispatch instead to a serial implementation?

From my reading, it seems the answer is probably no. Execution policies have semantic meaning, and par / par_unseq do not simply mean "provide the fastest version" even if that is what the users probably want.

mmichel11 · 2025-01-13T20:49:03Z

One question I have for the group... If we know that a serial implementation will provide better performance up to some threshold (perhaps dependent on num bins, num threads, num input elements), can / should we dispatch instead to a serial implementation?

From my reading, it seems the answer is probably no. Execution policies have semantic meaning, and par / par_unseq do not simply mean "provide the fastest version" even if that is what the users probably want.

I agree that we should honor the user's request for a specific policy as opposed to using the serial implementation until some empirically determined cutoff point. I also imagine that the exact cutoff point where the parallel version performs better can highly vary dependent on a user's hardware setup and giving them the freedom to manually choose when to make the switch from the serial to parallel version may result in better performance than any generic decisions we could make.

mmichel11

I've taken another pass through the document. A single question regarding how technical we want to get when explaining the algorithm.

The RFC looks ready to me.

rfcs/proposed/host_backend_histogram/README.md

Signed-off-by: Dan Hoeflinger <[email protected]>

akukanov · 2025-01-15T16:56:50Z

One question I have for the group...
If we know that a serial implementation will provide better performance up to some threshold (perhaps dependent on num bins, num threads, num input elements), can / should we dispatch instead to a serial implementation?

From my reading, it seems the answer is probably no. Execution policies have semantic meaning, and par / par_unseq do not simply mean "provide the fastest version" even if that is what the users probably want.

I believe yes, we can. Generally, a serial implementation is correct for parallel execution policies, so it's more of a QoI question whether to parallelize or not. While the policies do not mean "provide the fastest version", they do not mean "always use multiple threads" either. It's a permission to use multiple threads, but not an obligation.

MikeDvorskiy · 2025-01-15T17:14:17Z

rfcs/proposed/host_backend_histogram/README.md

+
+### SIMD/openMP SIMD Implementation
+Currently oneDPL relies upon openMP SIMD to provide its vectorization, which is designed to provide vectorization across
+loop iterations. OneDPL does not directly use any intrinsics which may offer more complex functionality than what is


The second sentence may be omitted.
Based on the first sentence we can conclude that "OneDPL does not directly use any intrinsics..."

MikeDvorskiy · 2025-01-15T17:23:07Z

rfcs/proposed/host_backend_histogram/README.md

+increment. For the even bin API, the calculations to determine selected bin have some opportunity for vectorization as
+each input has the same mathematical operations applied to each. However, for the custom range API, each input element
+uses a binary search through a list of bin boundaries to determine the selected bin. This operation will have a
+different length and control flow based upon each input element and will be very difficult to vectorize.


But we can calculate the bin indexes for the input data in SIMD manner.
After that we can process the result in a serial loop.
No?

This is applicable only for the even binned case. Without using intrinsic operations, we must do this with omp simd and the ordered structured block. Initial investigation seemed to indicate that this was unsuccessful for generating vectorized code, and my suspicion is that it will not really help anyway. I can revisit this and attempt it, but the intention for now was to omit vectorizations from this first phase.

For now I'll ask that we leave it as described in the RFC, which gives some understanding of how this can be improved in the future, but starts without vectorization for this phase.
We can add an issue to explore using simd ordered to get some improvement for histogram even, and leave it out for this RFC and the initial PR implementation.

I see.. Ok, lets leave it as described.

Signed-off-by: Dan Hoeflinger <[email protected]>

MikeDvorskiy · 2025-01-16T10:05:31Z

I believe yes, we can. Generally, a serial implementation is correct for parallel execution policies, so it's more of a QoI question whether to parallelize or not. While the policies do not mean "provide the fastest version", they do not mean "always use multiple threads" either. It's a permission to use multiple threads, but not an obligation.

Agree with Alexey. Especially since, at least, the host patterns "sort" and "merge" use some thresholds for switching to a serial implementation, depending on number of input elements.
So, probably it makes sense to add a description that the implementation may do fallback to a serial implementation for a "small" number of input elements by performance reasons.

MikeDvorskiy

Overall, looks good to me.
Just one comment about adding a description that the implementation may do a fallback to a serial implementation.

Signed-off-by: Dan Hoeflinger <[email protected]>

rfcs/proposed/host_backend_histogram/README.md

akukanov

Re-approved.

The questions in my earlier approval are now addressed. The last couple of comments I made do not hold the proposal from landing.

Signed-off-by: Dan Hoeflinger <[email protected]>

danhoeflinger · 2025-01-16T19:54:20Z

Waiting for a second approval to merge, so I added the small asks you made in the mean time @akukanov .

MikeDvorskiy

LGTM

danhoeflinger · 2025-01-22T13:30:28Z

Merging, with approvals from @akukanov and @MikeDvorskiy .

danhoeflinger added 9 commits October 30, 2024 11:27

initial rough commit

a03577e

Signed-off-by: Dan Hoeflinger <[email protected]>

minor improvements

ce117f5

Signed-off-by: Dan Hoeflinger <[email protected]>

revision

ccc001e

Signed-off-by: Dan Hoeflinger <[email protected]>

Formatting, minor

d518a14

Signed-off-by: Dan Hoeflinger <[email protected]>

spelling and grammar

6e03468

Signed-off-by: Dan Hoeflinger <[email protected]>

Minor improvements

10c4e50

Signed-off-by: Dan Hoeflinger <[email protected]>

subsection

efa7c9b

Signed-off-by: Dan Hoeflinger <[email protected]>

Adding some alternative approaches

1ac82fd

Signed-off-by: Dan Hoeflinger <[email protected]>

minor improvements

02523c4

Signed-off-by: Dan Hoeflinger <[email protected]>

danhoeflinger mentioned this pull request Nov 1, 2024

Implement the histogram algorithm for standard aligned execution policies #1900

Open

danhoeflinger requested review from reble, vossmjp, timmiesmith, akukanov, rarutyun, MikeDvorskiy, dmitriy-sobolev, SergeyKopienko, mmichel11 and adamfidel November 4, 2024 14:25

akukanov reviewed Nov 4, 2024

View reviewed changes

rfcs/proposed/host_backend_histogram/README.md Outdated Show resolved Hide resolved

akukanov reviewed Nov 4, 2024

View reviewed changes

rfcs/proposed/host_backend_histogram/README.md Outdated Show resolved Hide resolved

line widths

ac7b654

Signed-off-by: Dan Hoeflinger <[email protected]>

akukanov reviewed Nov 4, 2024

View reviewed changes

rfcs/proposed/host_backend_histogram/README.md Outdated Show resolved Hide resolved

akukanov reviewed Nov 4, 2024

View reviewed changes

rfcs/proposed/host_backend_histogram/README.md Outdated Show resolved Hide resolved

mmichel11 reviewed Nov 5, 2024

View reviewed changes

rfcs/proposed/host_backend_histogram/README.md Outdated Show resolved Hide resolved

danhoeflinger added 2 commits November 6, 2024 10:08

fixing numbering.

506fb62

Signed-off-by: Dan Hoeflinger <[email protected]>

putting in specifics for TBB / OpenMP

1c6cb47

more formatting fixes Signed-off-by: Dan Hoeflinger <[email protected]>

akukanov reviewed Nov 6, 2024

View reviewed changes

rfcs/proposed/host_backend_histogram/README.md Outdated Show resolved Hide resolved

danhoeflinger added 3 commits December 30, 2024 15:11

minor improvements

9287fd2

Signed-off-by: Dan Hoeflinger <[email protected]>

spelling

cdf5092

Signed-off-by: Dan Hoeflinger <[email protected]>

adding link to implementation

215c2b7

Signed-off-by: Dan Hoeflinger <[email protected]>

danhoeflinger mentioned this pull request Dec 30, 2024

Host Implementation of Histogram APIs #1974

Merged

danhoeflinger added this to the 2022.8.0 milestone Jan 3, 2025

mmichel11 reviewed Jan 14, 2025

View reviewed changes

rfcs/proposed/host_backend_histogram/README.md Outdated Show resolved Hide resolved

danhoeflinger added 3 commits January 15, 2025 08:49

rename to __enumerable_thread_local_storage

04d5127

Signed-off-by: Dan Hoeflinger <[email protected]>

Added sections on complexity

fe1efa2

Signed-off-by: Dan Hoeflinger <[email protected]>

spelling

60ec0e5

Signed-off-by: Dan Hoeflinger <[email protected]>

MikeDvorskiy reviewed Jan 15, 2025

View reviewed changes

danhoeflinger added 2 commits January 15, 2025 16:10

wording adjustments

54e16b6

Signed-off-by: Dan Hoeflinger <[email protected]>

minor formatting

77435a3

Signed-off-by: Dan Hoeflinger <[email protected]>

MikeDvorskiy reviewed Jan 16, 2025

View reviewed changes

describe fall back to serial implementation

52bab0d

Signed-off-by: Dan Hoeflinger <[email protected]>

akukanov reviewed Jan 16, 2025

View reviewed changes

rfcs/proposed/host_backend_histogram/README.md Outdated Show resolved Hide resolved

akukanov reviewed Jan 16, 2025

View reviewed changes

rfcs/proposed/host_backend_histogram/README.md Outdated Show resolved Hide resolved

akukanov previously approved these changes Jan 16, 2025

View reviewed changes

danhoeflinger added 2 commits January 16, 2025 13:47

rename rfc directory

b25411b

Signed-off-by: Dan Hoeflinger <[email protected]>

adding discussion of input sizes

5d23f2a

Signed-off-by: Dan Hoeflinger <[email protected]>

danhoeflinger dismissed akukanov’s stale review via 5d23f2a January 16, 2025 19:47

MikeDvorskiy approved these changes Jan 17, 2025

View reviewed changes

danhoeflinger merged commit 3358cf4 into main Jan 22, 2025
2 checks passed

danhoeflinger deleted the dev/dhoeflin/rfc_histogram_cpu branch January 22, 2025 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC for histogram CPU implementation #1930

RFC for histogram CPU implementation #1930

danhoeflinger commented Nov 1, 2024

akukanov commented Nov 6, 2024

danhoeflinger commented Jan 13, 2025

mmichel11 commented Jan 13, 2025 •

edited

Loading

mmichel11 left a comment

akukanov commented Jan 15, 2025 •

edited

Loading

MikeDvorskiy Jan 15, 2025

danhoeflinger Jan 15, 2025

MikeDvorskiy Jan 15, 2025 •

edited

Loading

danhoeflinger Jan 15, 2025 •

edited

Loading

danhoeflinger Jan 15, 2025

MikeDvorskiy Jan 16, 2025

MikeDvorskiy commented Jan 16, 2025 •

edited

Loading

MikeDvorskiy left a comment

akukanov left a comment

danhoeflinger commented Jan 16, 2025

MikeDvorskiy left a comment

danhoeflinger commented Jan 22, 2025

RFC for histogram CPU implementation #1930

RFC for histogram CPU implementation #1930

Conversation

danhoeflinger commented Nov 1, 2024

akukanov commented Nov 6, 2024

danhoeflinger commented Jan 13, 2025

mmichel11 commented Jan 13, 2025 • edited Loading

mmichel11 left a comment

Choose a reason for hiding this comment

akukanov commented Jan 15, 2025 • edited Loading

MikeDvorskiy Jan 15, 2025

Choose a reason for hiding this comment

danhoeflinger Jan 15, 2025

Choose a reason for hiding this comment

MikeDvorskiy Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

danhoeflinger Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

danhoeflinger Jan 15, 2025

Choose a reason for hiding this comment

MikeDvorskiy Jan 16, 2025

Choose a reason for hiding this comment

MikeDvorskiy commented Jan 16, 2025 • edited Loading

MikeDvorskiy left a comment

Choose a reason for hiding this comment

akukanov left a comment

Choose a reason for hiding this comment

danhoeflinger commented Jan 16, 2025

MikeDvorskiy left a comment

Choose a reason for hiding this comment

danhoeflinger commented Jan 22, 2025

mmichel11 commented Jan 13, 2025 •

edited

Loading

akukanov commented Jan 15, 2025 •

edited

Loading

MikeDvorskiy Jan 15, 2025 •

edited

Loading

danhoeflinger Jan 15, 2025 •

edited

Loading

MikeDvorskiy commented Jan 16, 2025 •

edited

Loading