Replace SYCL 1.2.1 group barrier with SYCL 2020 alternative #1679

SergeyKopienko · 2024-07-08T12:52:42Z

https://github.com/oneapi-src/oneDPL/blob/470df99e5f27ab8da3ee55941ff9d6a9e0aa9730/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h#L170

Please see details in intel/llvm#12531
Please see details in https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_group_broadcast

dmitriy-sobolev · 2024-07-09T15:12:27Z

There are two major reasons of using the outdated barrier API:

Performance. You can find the details here: Deprecate and remove nd_item::barrier. intel/llvm#12531.
Non-compatible semantics, as mentioned in the code comments, which may require some work: https://github.com/oneapi-src/oneDPL/blob/470df99e5f27ab8da3ee55941ff9d6a9e0aa9730/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h#L171-L174

I suppose that the most important reason is 1. However, oneDPL is claimed to be SYCL 2020 conformant, so SYCL 2020 group_barrier should be used, at least for the compilers other than oneAPI DPC++ compiler.

akukanov · 2024-07-09T16:36:26Z

Let's finally clarify the question of semantics.

The __nd_item.barrier(sycl::access::fence_space::local_space) that is currently in use has the following semantics in SYCL 1.2.1:

Executes a work-group barrier with memory ordering on the local address space ... . The current work-item will wait at the barrier until all work-items in the current work-group have reached the barrier. In addition the barrier performs a fence operation ensuring that all memory accesses in the specified address space issued before the barrier complete before those issued after the barrier.

In other words, it serves as both a barrier for work items and as a memory fence (with unclear ordering semantics, but at least acquire-release as it seems from the description) for operations within local (i.e. work group) memory scope.

The work-group barrier sycl::group_barrier(__nd_item.get_group(), sycl::memory_scope::work_group) has the following semantics in SYCL 2020:

Synchronizes all work-items in a group. The current work-item will wait at the barrier until all work-items in the group have reached the barrier. In addition, the barrier performs mem-fence operations ensuring that memory accesses issued before the barrier are not re-ordered with those issued after the barrier: all work-items in the group execute a release fence prior to synchronizing at the barrier, all work-items in the group execute an acquire fence afterwards, and there is an implicit synchronization of these fences as if provided by an explicit atomic operation on an atomic object. By default, the scope of these fences is set to the narrowest scope including all work-items in the group ... This scope may be optionally overridden with a wider scope, specified by the fence_scope argument.

In other words, it serves as both the barrier and the acquire-release memory fence in the specified memory scope, which is the group scope sycl::memory_scope::work_group (and which is also the default for work-groups, so it can as well be implicit).

The differences I observe are:

SYCL 2020 barrier allows to explicitly specify a group, while SYCL 1.2.1 barrier is always for the current work-group of an nd_item. Of course using get_group() aligns the barrier scope with that in 1.2.1.
SYCL 2020 barrier uses acquire-release memory ordering semantics, while SYCL 1.2.1 barrier is not clear about it. But as I said, it seems that its semantics is at least acquire-release as well, given the wording of "all memory accesses issued before the barrier complete before those issued after the barrier".

All in all, it seems that the new group_barrier can be used in the same way / with the same effect as the old one.

I think the comment telling about them being "not quite equivalent" is there either because earlier versions of SYCL 2020 did not provide enough clarity or because the barrier memory ordering semantics of 1.2.1 were confused with those of atomics, for which 1.2.1 only supported relaxed memory ordering. But I do not believe that the relaxed ordering would satisfy the described "complete before" requirement.

al42and · 2024-07-10T12:26:23Z

All in all, it seems that the new group_barrier can be used in the same way / with the same effect as the old one.

As far as I understand, another difference (and the reason for performance drop mentioned in the linked issues) is that the old version "Executes a work-group barrier with memory ordering on the local address space", while the new version affects all memory operations, in both local and global address spaces (but both only within work-group/local scope):

sycl::memory_scope::work_group The ordering constraint applies only to work-items in the same work-group as the calling work-item;

akukanov · 2024-07-10T12:54:36Z

Thanks @al42and - indeed, this is an important difference that I missed, and that impacts performance.

If that difference is important for oneDPL code, then we should make it visible - either in the oneDPL wrapper name or maybe with a template parameter - that this barrier orders operations only for data in local memory but does not order global data accesses.

akukanov · 2024-09-25T15:08:29Z

The comment intel/llvm#12531 (comment) refers to a device compiler bug that has been fixed (intel/intel-graphics-compiler@ed639f6) and that should improve performance.

Despite the subtle semantical difference, we still need to drop the use of the outdated API.

dmitriy-sobolev · 2025-01-05T17:45:19Z

I tried to understand the impact of the replacement of SYCL 1.2.1 barrier with SYCL 2020 one. Below is a table with speed-ups (times) after switching to SYCL 2020 barrier when using different drivers: LTS - long-term support, and Rolling - "experimental" with the latest features. Less than 1 is a slow down (times). The selected algorithms use group barriers.

Algorithm	LTS (2350.61) - Jun 2024	Rolling (2441.21) - Nov 2024	LTS (2350.125) - Dec 2024
reduce	1.00	1.01	0.99
inclusive_scan	0.99	1.00	0.99
sort (merge)	0.97	0.99	0.97
sort (radix)	0.96	0.99	0.98
inclusive_scan_by_segment	0.91	0.99	0.91
reduce_by_segment	0.99	1.00	0.99
histogram (1024 bins)	0.97	1.00	0.98

oneAPI 2025.0, Ubuntu 24.04, Intel® Data Center GPU Max 1550, uint32_t

The fix is not available with LTS drivers. I think that it would be better to continue using SYCL 1.2.1 to avoid regression, and switch to SYCL 2020 later, e.g. when all LTS drivers listed here get that fix. We know that with DPC++ compiler still supports SYCL 1.2.1 barrier, but it is not applicable to other compilers, so SYCL 2020 is better to be used by default. This is what #1988 implements.

SergeyKopienko mentioned this issue Jul 8, 2024

Remove support of old compilers from oneDPL code #1678

Draft

dmitriy-sobolev changed the title ~~sycl::group_barrier did not used in __group_barrier(_Item __item)~~ Replace SYCL 1.2.1 group barrier with SYCL 2020 conformant alternative Jul 9, 2024

dmitriy-sobolev changed the title ~~Replace SYCL 1.2.1 group barrier with SYCL 2020 conformant alternative~~ Replace SYCL 1.2.1 group barrier with SYCL 2020 alternative Jul 9, 2024

akukanov mentioned this issue Jul 9, 2024

Invalid macro check in __dpl_sycl::__group_barrier implementation #1615

Closed

dmitriy-sobolev self-assigned this Jan 2, 2025

dmitriy-sobolev mentioned this issue Jan 5, 2025

Use SYCL 2020 barrier instead of SYCL 1.2.1 one by default #1988

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace SYCL 1.2.1 group barrier with SYCL 2020 alternative #1679

Replace SYCL 1.2.1 group barrier with SYCL 2020 alternative #1679

SergeyKopienko commented Jul 8, 2024 •

edited by akukanov

Loading

dmitriy-sobolev commented Jul 9, 2024 •

edited

Loading

akukanov commented Jul 9, 2024 •

edited

Loading

al42and commented Jul 10, 2024

akukanov commented Jul 10, 2024

akukanov commented Sep 25, 2024

dmitriy-sobolev commented Jan 5, 2025 •

edited

Loading

Replace SYCL 1.2.1 group barrier with SYCL 2020 alternative #1679

Replace SYCL 1.2.1 group barrier with SYCL 2020 alternative #1679

Comments

SergeyKopienko commented Jul 8, 2024 • edited by akukanov Loading

dmitriy-sobolev commented Jul 9, 2024 • edited Loading

akukanov commented Jul 9, 2024 • edited Loading

al42and commented Jul 10, 2024

akukanov commented Jul 10, 2024

akukanov commented Sep 25, 2024

dmitriy-sobolev commented Jan 5, 2025 • edited Loading

SergeyKopienko commented Jul 8, 2024 •

edited by akukanov

Loading

dmitriy-sobolev commented Jul 9, 2024 •

edited

Loading

akukanov commented Jul 9, 2024 •

edited

Loading

dmitriy-sobolev commented Jan 5, 2025 •

edited

Loading