-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace SYCL 1.2.1 group barrier with SYCL 2020 alternative #1679
Comments
sycl::group_barrier
did not used in __group_barrier(_Item __item)
There are two major reasons of using the outdated barrier API:
I suppose that the most important reason is 1. However, oneDPL is claimed to be SYCL 2020 conformant, so SYCL 2020 group_barrier should be used, at least for the compilers other than oneAPI DPC++ compiler. |
Let's finally clarify the question of semantics. The
In other words, it serves as both a barrier for work items and as a memory fence (with unclear ordering semantics, but at least acquire-release as it seems from the description) for operations within local (i.e. work group) memory scope. The work-group barrier
In other words, it serves as both the barrier and the acquire-release memory fence in the specified memory scope, which is the group scope The differences I observe are:
All in all, it seems that the new I think the comment telling about them being "not quite equivalent" is there either because earlier versions of SYCL 2020 did not provide enough clarity or because the barrier memory ordering semantics of 1.2.1 were confused with those of atomics, for which 1.2.1 only supported relaxed memory ordering. But I do not believe that the relaxed ordering would satisfy the described "complete before" requirement. |
As far as I understand, another difference (and the reason for performance drop mentioned in the linked issues) is that the old version "Executes a work-group barrier with memory ordering on the local address space", while the new version affects all memory operations, in both local and global address spaces (but both only within work-group/local scope):
|
Thanks @al42and - indeed, this is an important difference that I missed, and that impacts performance. If that difference is important for oneDPL code, then we should make it visible - either in the oneDPL wrapper name or maybe with a template parameter - that this barrier orders operations only for data in local memory but does not order global data accesses. |
The comment intel/llvm#12531 (comment) refers to a device compiler bug that has been fixed (intel/intel-graphics-compiler@ed639f6) and that should improve performance. Despite the subtle semantical difference, we still need to drop the use of the outdated API. |
I tried to understand the impact of the replacement of SYCL 1.2.1 barrier with SYCL 2020 one. Below is a table with speed-ups (times) after switching to SYCL 2020 barrier when using different drivers: LTS - long-term support, and Rolling - "experimental" with the latest features. Less than
The fix is not available with LTS drivers. I think that it would be better to continue using SYCL 1.2.1 to avoid regression, and switch to SYCL 2020 later, e.g. when all LTS drivers listed here get that fix. We know that with DPC++ compiler still supports SYCL 1.2.1 barrier, but it is not applicable to other compilers, so SYCL 2020 is better to be used by default. This is what #1988 implements. |
https://github.com/oneapi-src/oneDPL/blob/470df99e5f27ab8da3ee55941ff9d6a9e0aa9730/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h#L170
Please see details in intel/llvm#12531
Please see details in https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_group_broadcast
The text was updated successfully, but these errors were encountered: