Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement SYCL backend parallel_for to improve bandwidth utilization #1976

Open
wants to merge 65 commits into
base: main
Choose a base branch
from

Conversation

mmichel11
Copy link
Contributor

@mmichel11 mmichel11 commented Dec 19, 2024

High Level Description
This PR improves hardware bandwidth utilization of oneDPL's SYCL backend parallel for pattern through two ideas:

  • Process multiple input iterations per work-item which involves a switch to a nd_range kernel combined with a sub / work group strided indexing approach.
  • To generate wide loads for small types, implement a path that vectorizes loads / stores by processing adjacent indices within a single work item. This is combined with the above approach to maximize hardware bandwidth utilization. Vectorization is only applied to fundamental types of size less than 4 (e.g. uint16_t, uint8_t) under a contiguous container.

Implementation Details

  • Parallel for bricks have been reworked in the following manner:
    • Each brick contains a pack of ranges within its template parameters to define tuning parameters.
    • The following static integral members are defined (implemented with inheritance):
      • __can_vectorize
      • __preferred_vector_size (1 if __can_vectorize is false)
      • __preferred_iters_per_item
    • The following public member functions are defined
      • __scalar_path (for small input sizes this member function is explicitly called)
      • __vector_path (optional for algorithms that are not vectorizable e.g. binary_search)
      • An overloaded function call operator which dispatches to the appropriate strategy

To implement this approach, the parallel for kernel rewrite from #1870 was adopted with additional changes to handle vectorization paths. Additionally, generic vectorization and strided loop utilities have been defined with the intention for these to be applicable in other portions of the codebase as well. Tests have been expanded to ensure coverage of vectorization paths.

This PR will supersedes #1870. Initially, the plan was to merge this PR into 1870 but after comparing the diff, I believe the most straightforward approach will be to target this directly to main.

@mmichel11 mmichel11 added this to the 2022.8.0 milestone Dec 19, 2024
@mmichel11 mmichel11 marked this pull request as ready for review December 19, 2024 19:17
@mmichel11 mmichel11 changed the title [Draft] Re-implement SYCL backend parallel_for to improve bandwidth utilization Re-implement SYCL backend parallel_for to improve bandwidth utilization Dec 19, 2024
Signed-off-by: Matthew Michel <[email protected]>
128 byte memory operations are performed instead of 512 after inspecting
the assembly. Processing 512 bytes per sub-group still seems to be the
best value after experimentation.

Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
…ute work for small inputs

Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
This reverts commit e4cbceb. Small
sizes slightly slower and for horizontal vectorization no "real" benefit is
observed.
Small but measurable overheads can be observed for small inputs where
runtime dispatch in the kernel is present to check for the correct path
to take. Letting the compiler handle the the small input case in the
original kernel shows the best performance.

Signed-off-by: Matthew Michel <[email protected]>
We now flatten the user-provided ranges and find the minimum sized type
to estimate the best __iters_per_work_item. This benefits performance in
calls that wrap multiple buffers in a single input / output through a
zip_iterator (e.g. dpct::scatter_if in SYCLomatic compatibility headers).

Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
@SergeyKopienko
Copy link
Contributor

One more point: __vector_path and __scalar_path tell me about some path but not imp.
May be better to rename them to ..._impl ?

Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of review. I've not gotten to all the details yet, but this is enough to be interesting.

include/oneapi/dpl/pstl/utils.h Outdated Show resolved Hide resolved
Comment on lines +91 to +94
static inline std::tuple<std::size_t, std::size_t, bool>
__stride_recommender(const sycl::nd_item<1>& __item, std::size_t __count, std::size_t __iters_per_work_item,
std::size_t __adj_elements_per_work_item, std::size_t __work_group_size)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a general utility which might have utility for other commutative operations beyond just parallel_for or is there a reason you believe this to be specific to this algorithm / kernel?

If we think it might be useful, we could lift this to a general utility level. Obviously we don't need to incorporate it elsewhere in this PR. An alternative is to add an issue to explore this and only lift it if we find utility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKopienko and I had brief discussion regarding this in the first PR: #1870 (comment).

I think in the initial PR making it a static member was the best choice. However, in this new PR I am pushing to expose more general utilities for future use (e.g. strided loops, vectorization paths) to enforce good memory access patterns, and I think stride recommender can be a good general utility. @SergeyKopienko What are your thoughts here with adding this function as a utility?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is a clear and easy consensus to make it a utility, lets just leave it here and add an issue to explore its use elsewhere as a utility at a later time. I don't want to derail the PR for this purpose.

struct custom_brick
#if _ONEDPL_BACKEND_SYCL
template <typename Comp, typename T, typename _Range, search_algorithm func>
struct custom_brick : oneapi::dpl::unseq_backend::walk_scalar_base<_Range>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets fix the naming of this while were touching all its instances __custom_brick

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the historical convention within the internal/ directory is to not use any leading underscores although it has changed a bit over time.

I do not have a strong preference if we make this change or leave it as is, but maybe it fits in a broader discussion regarding the remaining implementations in this directory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there is compelling reason other than resistance to making purely cosmetic changes in the changelog to have a different convention here. This is why I suggest adjusting it while we are already touching all (or most) instances of it. Perhaps someone with a longer historical knowledge of this code could chime in here if there is a reason to keep this with different conventions.

Not super important to me, so optional nitpick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will wait a bit longer to see if anyone has objections. If not, then I will add this suggestion.

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h Outdated Show resolved Hide resolved
auto __raw_ptr3 = __rng3.begin();

oneapi::dpl::__internal::__lazy_ctor_storage<_ValueType1> __rng1_vector[__base_t::__preferred_vector_size];
oneapi::dpl::__internal::__lazy_ctor_storage<_ValueType2> __rng2_vector[__base_t::__preferred_vector_size];
Copy link
Contributor

@danhoeflinger danhoeflinger Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be possible to combine walk*_vectors_or_scalars together with some complicated fold instructions, lambdas, tuples, and std::apply.

Take a look at the first answer of https://stackoverflow.com/questions/7230621/how-can-i-iterate-over-a-packed-variadic-template-argument-list. I think you should do something similar, chaining together instructions by returning tuples and then with std::apply.

Here is an example I was playing with.
https://godbolt.org/z/vc8dK4ed6

In the end, I'm not sure if (1) its actually possible and (2) its worth the complexity to consolidate these structs, but its worth considering...

* Deleter is now a callable lambda returned from a static function in the class
* Deleter accepts l-value reference to __lazy_ctor_storage

Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
With uint8_t types, the icpx compiler fails to vectorize even when
calling begin() on our range within a kernel to pull out a raw pointer.
To work around this issue, begin() needs to be called on the host and
passed to the kernel

Signed-off-by: Matthew Michel <[email protected]>
@mmichel11
Copy link
Contributor Author

So now we have 3 entity with defined constexpr static bool __can_vectorize :

  1. class walk_vector_or_scalar_base
  2. class walk_scalar_base
  3. struct __brick_shift_left

Does these constexpr-variables really has different semantic?

And if the semantic of these entities are the same, may be make sense to make some re-design to have only one entity __can_vectorize ?

These three cases are all unique when you consider that they define __can_vectorize, __preferred_vector_size, and __preferred_iters_per_item. These three fields are all tightly coupled, so in my opinion it makes sense to define them together for readability. If we were to define a single __can_vectorize, then I think it would need to function more like a trait class dependent on the provided brick as the brick itself plays a roll in whether not vectorization is possible. This design would not get us much in my opinion as we would still need specializations for the different cases.

The three unique cases I mention are the following:

  1. struct walk_vector_or_scalar_base - Vectorization is possible so long as the ranges meet the requirements to be vectorizable. This is then used to determine iters per item and the vector size.
  2. struct walk_scalar_base - Vectorization is not possible due to some limitation of the brick. Binary search is a good example since its accesses are non-sequential. The iterations per work item is still set based on the size of the provided ranges.
  3. struct __brick_shift_left - This brick has a limitation that prevents vectorization and that only one iteration per item can be processed and is a special case.

@mmichel11
Copy link
Contributor Author

mmichel11 commented Jan 9, 2025

In some moments implementation details remind me tag-dispatching which were designed by @rarutyun. But with some differences: for example the walk2_vectors_or_scalars has not only information about vectorization or parallelization should be executed, but also two variant of functional staff and operator() with compile-time condition check to run one code or another code.

But what if we instead of two different functions

    template <typename _IsFull, typename _ItemId>
    void
    __vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const
    {
        // This is needed to enable vectorization
        auto __raw_ptr = __rng.begin();
        oneapi::dpl::__par_backend_hetero::__vector_walk<__base_t::__preferred_vector_size>{__n}(__is_full, __idx, __f,
                                                                                                 __raw_ptr);
    }

    // _IsFull is ignored here. We assume that boundary checking has been already performed for this index.
    template <typename _IsFull, typename _ItemId>
    void
    __scalar_path(_IsFull, const _ItemId __idx, _Range __rng) const
    {

        __f(__rng[__idx]);
    }

we will have some two functions with the same name and the format excepting the first parameter type which will be used as some tag ?

Please take a look at __parallel_policy_tag_selector_t for details.

Thanks for the suggestion. I had previously considered this approach by having two function call operators and using
a true_type / false_type as a tag:

     template <typename _IsFull, typename _ItemId>
     void
     operator()(/*__vectorize*/std::true_type, _IsFull __is_full, const _ItemId __idx, _Range __rng) const
     {
         oneapi::dpl::__par_backend_hetero::__vector_walk<__base_t::__preferred_vector_size>{__n}(__is_full, __idx, __f,__rng);
     }
 
     // _IsFull is ignored here. We assume that boundary checking has been already performed for this index.
     template <typename _IsFull, typename _ItemId>
     void
     operator()(/*__vectorize*/std::false_type, _IsFull, const _ItemId __idx, _Range __rng) const
     {
         __f(__rng[__idx]);
     }

This allows us to remove the if constexpr ... else dispatch that shows up in each brick. The reason I decided against this is that the caller of the brick, oneapi::dpl::__par_backend_hetero::__strided_loop, would also need to have an additional vectorization tag that would need to be passed through to the brick. This is intended to be a generic utility, and I do not believe that we should require the caller of __strided_loop to provide any additional tag beyond the necessary _IsFull to avoid OOB accesses.

Alternatively, we could replace __scalar_path_impl / __vector_path_impl with a single function __impl that accepts a vector tag as its first parameter. This is less clear in my opinion compared to separate function names.

This is well beyond the cutoff point to invoke the large submitter and
prevents timeouts observed on devices with many compute units when
testing CPU paths.

Signed-off-by: Matthew Michel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants