Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace SYCL 1.2.1 group barrier with SYCL 2020 alternative #1679

Open
SergeyKopienko opened this issue Jul 8, 2024 · 6 comments
Open

Replace SYCL 1.2.1 group barrier with SYCL 2020 alternative #1679

SergeyKopienko opened this issue Jul 8, 2024 · 6 comments
Assignees

Comments

@SergeyKopienko
Copy link
Contributor

SergeyKopienko commented Jul 8, 2024

@dmitriy-sobolev dmitriy-sobolev changed the title sycl::group_barrier did not used in __group_barrier(_Item __item) Replace SYCL 1.2.1 group barrier with SYCL 2020 conformant alternative Jul 9, 2024
@dmitriy-sobolev
Copy link
Contributor

dmitriy-sobolev commented Jul 9, 2024

There are two major reasons of using the outdated barrier API:

  1. Performance. You can find the details here: Deprecate and remove nd_item::barrier. intel/llvm#12531.
  2. Non-compatible semantics, as mentioned in the code comments, which may require some work: https://github.com/oneapi-src/oneDPL/blob/470df99e5f27ab8da3ee55941ff9d6a9e0aa9730/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h#L171-L174

I suppose that the most important reason is 1. However, oneDPL is claimed to be SYCL 2020 conformant, so SYCL 2020 group_barrier should be used, at least for the compilers other than oneAPI DPC++ compiler.

@dmitriy-sobolev dmitriy-sobolev changed the title Replace SYCL 1.2.1 group barrier with SYCL 2020 conformant alternative Replace SYCL 1.2.1 group barrier with SYCL 2020 alternative Jul 9, 2024
@akukanov
Copy link
Contributor

akukanov commented Jul 9, 2024

Let's finally clarify the question of semantics.

The __nd_item.barrier(sycl::access::fence_space::local_space) that is currently in use has the following semantics in SYCL 1.2.1:

Executes a work-group barrier with memory ordering on the local address space ... . The current work-item will wait at the barrier until all work-items in the current work-group have reached the barrier. In addition the barrier performs a fence operation ensuring that all memory accesses in the specified address space issued before the barrier complete before those issued after the barrier.

In other words, it serves as both a barrier for work items and as a memory fence (with unclear ordering semantics, but at least acquire-release as it seems from the description) for operations within local (i.e. work group) memory scope.

The work-group barrier sycl::group_barrier(__nd_item.get_group(), sycl::memory_scope::work_group) has the following semantics in SYCL 2020:

Synchronizes all work-items in a group. The current work-item will wait at the barrier until all work-items in the group have reached the barrier. In addition, the barrier performs mem-fence operations ensuring that memory accesses issued before the barrier are not re-ordered with those issued after the barrier: all work-items in the group execute a release fence prior to synchronizing at the barrier, all work-items in the group execute an acquire fence afterwards, and there is an implicit synchronization of these fences as if provided by an explicit atomic operation on an atomic object. By default, the scope of these fences is set to the narrowest scope including all work-items in the group ... This scope may be optionally overridden with a wider scope, specified by the fence_scope argument.

In other words, it serves as both the barrier and the acquire-release memory fence in the specified memory scope, which is the group scope sycl::memory_scope::work_group (and which is also the default for work-groups, so it can as well be implicit).

The differences I observe are:

  • SYCL 2020 barrier allows to explicitly specify a group, while SYCL 1.2.1 barrier is always for the current work-group of an nd_item. Of course using get_group() aligns the barrier scope with that in 1.2.1.
  • SYCL 2020 barrier uses acquire-release memory ordering semantics, while SYCL 1.2.1 barrier is not clear about it. But as I said, it seems that its semantics is at least acquire-release as well, given the wording of "all memory accesses issued before the barrier complete before those issued after the barrier".

All in all, it seems that the new group_barrier can be used in the same way / with the same effect as the old one.

I think the comment telling about them being "not quite equivalent" is there either because earlier versions of SYCL 2020 did not provide enough clarity or because the barrier memory ordering semantics of 1.2.1 were confused with those of atomics, for which 1.2.1 only supported relaxed memory ordering. But I do not believe that the relaxed ordering would satisfy the described "complete before" requirement.

@al42and
Copy link

al42and commented Jul 10, 2024

All in all, it seems that the new group_barrier can be used in the same way / with the same effect as the old one.

As far as I understand, another difference (and the reason for performance drop mentioned in the linked issues) is that the old version "Executes a work-group barrier with memory ordering on the local address space", while the new version affects all memory operations, in both local and global address spaces (but both only within work-group/local scope):

sycl::memory_scope::work_group The ordering constraint applies only to work-items in the same work-group as the calling work-item;

@akukanov
Copy link
Contributor

Thanks @al42and - indeed, this is an important difference that I missed, and that impacts performance.

If that difference is important for oneDPL code, then we should make it visible - either in the oneDPL wrapper name or maybe with a template parameter - that this barrier orders operations only for data in local memory but does not order global data accesses.

@akukanov
Copy link
Contributor

The comment intel/llvm#12531 (comment) refers to a device compiler bug that has been fixed (intel/intel-graphics-compiler@ed639f6) and that should improve performance.

Despite the subtle semantical difference, we still need to drop the use of the outdated API.

@dmitriy-sobolev
Copy link
Contributor

dmitriy-sobolev commented Jan 5, 2025

I tried to understand the impact of the replacement of SYCL 1.2.1 barrier with SYCL 2020 one. Below is a table with speed-ups (times) after switching to SYCL 2020 barrier when using different drivers: LTS - long-term support, and Rolling - "experimental" with the latest features. Less than 1 is a slow down (times). The selected algorithms use group barriers.

Algorithm LTS (2350.61) - Jun 2024 Rolling (2441.21) - Nov 2024 LTS (2350.125) - Dec 2024
reduce 1.00 1.01 0.99
inclusive_scan 0.99 1.00 0.99
sort (merge) 0.97 0.99 0.97
sort (radix) 0.96 0.99 0.98
inclusive_scan_by_segment 0.91 0.99 0.91
reduce_by_segment 0.99 1.00 0.99
histogram (1024 bins) 0.97 1.00 0.98
  • oneAPI 2025.0, Ubuntu 24.04, Intel® Data Center GPU Max 1550, uint32_t

The fix is not available with LTS drivers. I think that it would be better to continue using SYCL 1.2.1 to avoid regression, and switch to SYCL 2020 later, e.g. when all LTS drivers listed here get that fix. We know that with DPC++ compiler still supports SYCL 1.2.1 barrier, but it is not applicable to other compilers, so SYCL 2020 is better to be used by default. This is what #1988 implements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants