RFC: UCS/ARCH: Avoid direct usage of REP MOVSB in favor of memcpy #10356

arun-chandran-edarath · 2024-12-05T10:46:35Z

Glibc has two tunable parameters that control the use of 'REP MOVSB':

a) __x86_rep_movsb_threshold
b) __x86_rep_movsb_stop_threshold

Ref: https://elixir.bootlin.com/glibc/glibc-2.34/source/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S

Users can set the desired value of __x86_rep_movsb_threshold through the tunable glibc.cpu.x86_rep_movsb_threshold and can decide which length range uses ERMS and which range uses
vectorized routines.

Ref: https://www.gnu.org/software/libc/manual/html_node/Tunables.html

These tunable parameters are set based on the x86 CPU type and available hardware features. Instead of replicating this infrastructure in UCX, we can utilize the 'REP MOVSB' mechanism via memcpy().

Glibc has two tunable parameters that control the use of 'REP MOVSB': a) __x86_rep_movsb_threshold b) __x86_rep_movsb_stop_threshold Ref: https://elixir.bootlin.com/glibc/glibc-2.34/source/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Users can set the desired value of __x86_rep_movsb_threshold through the tunable glibc.cpu.x86_rep_movsb_threshold and can decide which length range uses ERMS and which range uses vectorized routines. Ref: https://www.gnu.org/software/libc/manual/html_node/Tunables.html These tunable parameters are set based on the x86 CPU type and available hardware features. Instead of replicating this infrastructure in UCX, we can utilize the 'REP MOVSB' mechanism provided by memcpy().

yosefe · 2024-12-06T00:39:35Z

@arun-chandran-edarath setting the global configs would affect all invocations of memcpy, but we only want the affect those done in specific contexts. It will even affect memcpy calls done in user application outside of UCX.

arun-chandran-edarath · 2024-12-06T04:33:08Z

@arun-chandran-edarath setting the global configs would affect all invocations of memcpy, but we only want the affect those done in specific contexts. It will even affect memcpy calls done in user application outside of UCX.

I was not suggesting to use global configs of glibc to control the memcpy() behaviour while running ucx. My point is choice of 'REP MOVSB' instruction to copy data depends on lot of hardware factors and glibc does a good job in deciding when to use it. We are using a miniature version of that decision logic in ucx by using only two variables builtin_memcpy_min and builtin_memcpy_max, which may not give the right result on every x86 hardware.

yosefe · 2024-12-08T13:46:22Z

@arun-chandran-edarath setting the global configs would affect all invocations of memcpy, but we only want the affect those done in specific contexts. It will even affect memcpy calls done in user application outside of UCX.

I was not suggesting to use global configs of glibc to control the memcpy() behaviour while running ucx. My point is choice of 'REP MOVSB' instruction to copy data depends on lot of hardware factors and glibc does a good job in deciding when to use it. We are using a miniature version of that decision logic in ucx by using only two variables builtin_memcpy_min and builtin_memcpy_max, which may not give the right result on every x86 hardware.

@arun-chandran-edarath we observed that intoroducing the "rep movsb" memory copy does improve the performance in some cases. was the "__x86_rep_movsb_threshold" availbale in older glibc versions as well (rh7/rh8)?

arun-chandran-edarath · 2024-12-09T15:18:51Z

glibc versions as well (rh7/rh8)?

I think rh7 glibc version is 2.17 and rh8 version is 2.28.
Ref: https://superuser.com/questions/18078/which-versions-of-glibc-do-different-versions-of-redhat-ship-with

The tunable "__x86_rep_movsb_threshold" came much later in version 2.32
arun@arun-ubuntu-2310:/glibc$ git show origin/release/2.31/master:sysdeps/x86/dl-tunables.list | grep movsb
arun@arun-ubuntu-2310:/glibc$ git show origin/release/2.32/master:sysdeps/x86/dl-tunables.list | grep movsb
x86_rep_movsb_threshold {

Checking the version history

arun@arun-ubuntu-2310:~/glibc$ git blame sysdeps/x86/dl-tunables.list
..

46b5e98ef6f (Noah Goldstein 46b5e98ef6f (Noah Goldstein 46b5e98ef6f (Noah Goldstein 3f4b61a0b8d (H.J. Lu 3f4b61a0b8d (H.J. Lu 475b63702ef (Noah Goldstein 475b63702ef (Noah Goldstein 475b63702ef (Noah Goldstein 475b63702ef (Noah Goldstein 475b63702ef (Noah Goldstein 475b63702ef (Noah Goldstein 475b63702ef (Noah Goldstein 475b63702ef (Noah Goldstein 475b63702ef (Noah Goldstein 475b63702ef (Noah Goldstein 475b63702ef (Noah Goldstein 475b63702ef (Noah Goldstein 475b63702ef (Noah Goldstein 475b63702ef (Noah Goldstein 475b63702ef (Noah Goldstein 3f4b61a0b8d (H.J. Lu 3f4b61a0b8d (H.J. Lu 2024-05-24 12:38:51 -0500 33) x86_memset_non_temporal_threshold {
2024-05-24 12:38:51 -0500 34) type: SIZE_T
2024-05-24 12:38:51 -0500 35) }
2020-07-06 11:48:09 -0700 36) x86_rep_movsb_threshold {
2020-07-06 11:48:09 -0700 37) type: SIZE_T
2021-11-01 00:49:52 -0500 38) # Since there is overhead to set up REP MOVSB operation, REP
2021-11-01 00:49:52 -0500 39) # MOVSB isn't faster on short data. The memcpy micro benchmark
2021-11-01 00:49:52 -0500 40) # in glibc shows that 2KB is the approximate value above which
2021-11-01 00:49:52 -0500 41) # REP MOVSB becomes faster than SSE2 optimization on processors
2021-11-01 00:49:52 -0500 42) # with Enhanced REP MOVSB. Since larger register size can move
2021-11-01 00:49:52 -0500 43) # more data with a single load and store, the threshold is
2021-11-01 00:49:52 -0500 44) # higher with larger register size. Micro benchmarks show AVX
2021-11-01 00:49:52 -0500 45) # REP MOVSB becomes faster apprximately at 8KB. The AVX512
2021-11-01 00:49:52 -0500 46) # threshold is extrapolated to 16KB. For machines with FSRM the
2021-11-01 00:49:52 -0500 47) # threshold is universally set at 2112 bytes. Note: Since the
2021-11-01 00:49:52 -0500 48) # REP MOVSB threshold must be greater than 8 times of vector
2021-11-01 00:49:52 -0500 49) # size and the default value is 4096 * (vector size / 16), the
2021-11-01 00:49:52 -0500 50) # default value and the minimum value must be updated at
2021-11-01 00:49:52 -0500 51) # run-time. NB: Don't set the default value since we can't tell
2021-11-01 00:49:52 -0500 52) # if the tunable value is set by user or not [BZ #27069].
2020-07-06 11:48:09 -0700 53) minval: 1
2020-07-06 11:48:09 -0700 54) }

arun@arun-ubuntu-2310:~/glibc$ git show 3f4b61a0b8d
commit 3f4b61a0b8de67ef9f20737919c713ddfc4bd620
Author: H.J. Lu [email protected]
Date: Mon Jul 6 11:48:09 2020 -0700 -----> Addition of this tunable appeared first

x86: Add thresholds for "rep movsb/stosb" to tunables
                                         
Add x86_rep_movsb_threshold and x86_rep_stosb_threshold to tunables
to update thresholds for "rep movsb" and "rep stosb" at run-time.
                                             
Note that the user specified threshold for "rep movsb" smaller than 
the minimum threshold will be ignored.                                                             
                                                                                                   
Reviewed-by: Carlos O'Donell <[email protected]>

yosefe · 2024-12-09T17:45:58Z

So i think that we need to keep ucx internal "rep movsb" code, maybe disabale it depending to glibc version or the presence of x86_rep_stosb_threshold symbol

arun-chandran-edarath · 2024-12-10T09:43:46Z

So i think that we need to keep ucx internal "rep movsb" code, maybe disabale it depending to glibc version or the presence of x86_rep_stosb_threshold symbol

Even though rh8 is using glibc version 2.28 it supports tuning knob 'glibc.cpu.x86_rep_movsb_threshold', that means they have backported the fix to 2.28.

arun-chandran-edarath · 2024-12-10T09:50:53Z

So i think that we need to keep ucx internal "rep movsb" code, maybe disabale it depending to glibc version or the presence of x86_rep_stosb_threshold symbol

Even though rh8 is using glibc version 2.28 it supports tuning knob 'glibc.cpu.x86_rep_movsb_threshold', that means they have backported the fix to 2.28.

rh7 is in it's 'Extended life cycle support'

Ref: https://access.redhat.com/support/policy/updates/errata

Do we need to care for such older os-releases?

arun-chandran-edarath · 2024-12-10T15:58:10Z

@arun-chandran-edarath setting the global configs would affect all invocations of memcpy, but we only want the affect those done in specific contexts. It will even affect memcpy calls done in user application outside of UCX.

I was not suggesting to use global configs of glibc to control the memcpy() behaviour while running ucx. My point is choice of 'REP MOVSB' instruction to copy data depends on lot of hardware factors and glibc does a good job in deciding when to use it. We are using a miniature version of that decision logic in ucx by using only two variables builtin_memcpy_min and builtin_memcpy_max, which may not give the right result on every x86 hardware.

Please look at the bug below for a case when the distance between src and dst degrades the 'rep movsb' performance.

https://sourceware.org/bugzilla/show_bug.cgi?id=27130

Commit: https://sourceware.org/git/?p=glibc.git;a=commit;h=3ec5d83d2a237d39e7fd6ef7a0bc8ac4c171a4a5

Description:
x86-64: Avoid rep movsb with short distance [BZ #27130]

When copying with "rep movsb", if the distance between source and
destination is N*4GB + [1..63] with N >= 0, performance may be very
slow. This patch updates memmove-vec-unaligned-erms.S for AVX and
AVX512 versions with the distance in RCX:

arun-chandran-edarath · 2024-12-12T06:13:16Z

Glibc has the below logic depending on the cpu features to set default values for 'rep_movsb_threshold' and decide the window of usage for 'rep movsb'

https://codebrowser.dev/glibc/glibc/sysdeps/x86/dl-cacheinfo.h.html#963
I don't think we need to replicate that in ucx.

@yosefe what do you think? keep the current code as it is, and disable compiling it by default with '#undef ENABLE_BUILTIN_MEMCPY' while configuring ucx?

Note: The current value set for builtin_memcpy_min, in UCX for Intel processors is 1KB, it differs from what glibc uses as min value, it is 2112 for glibc for processors with FSRM feature.

yosefe · 2024-12-12T08:14:18Z

Glibc has the below logic depending on the cpu features to set default values for 'rep_movsb_threshold' and decide the window of usage for 'rep movsb'

https://codebrowser.dev/glibc/glibc/sysdeps/x86/dl-cacheinfo.h.html#963 I don't think we need to replicate that in ucx.

@yosefe what do you think? keep the current code as it is, and disable compiling it by default with '#undef ENABLE_BUILTIN_MEMCPY' while configuring ucx?

Note: The current value set for builtin_memcpy_min, in UCX for Intel processors is 1KB, it differs from what glibc uses as min value, it is 2112 for glibc for processors with FSRM feature.

I think that in order to disable by default or remove the existing code we need to ensure it does not make performance worse. Today it is enabled on both Intel and AMD CPUs, so need to check on both I guess. Also, the current code enables inlining the memcpy function in the calling code (though AFAIU some compilers can do it anyway with builtin memcpy)

arun-chandran-edarath · 2024-12-12T09:40:39Z

Glibc has the below logic depending on the cpu features to set default values for 'rep_movsb_threshold' and decide the window of usage for 'rep movsb'
https://codebrowser.dev/glibc/glibc/sysdeps/x86/dl-cacheinfo.h.html#963 I don't think we need to replicate that in ucx.
@yosefe what do you think? keep the current code as it is, and disable compiling it by default with '#undef ENABLE_BUILTIN_MEMCPY' while configuring ucx?
Note: The current value set for builtin_memcpy_min, in UCX for Intel processors is 1KB, it differs from what glibc uses as min value, it is 2112 for glibc for processors with FSRM feature.

I think that in order to disable by default or remove the existing code we need to ensure it does not make performance worse.
Today it is enabled on both Intel and AMD CPUs, so need to check on both I guess. Also, the current code enables inlining the memcpy function in the calling code (though AFAIU some compilers can do it anyway with builtin memcpy)

Currently, it is disabled by default (at runtime) on AMD CPUs in UCX, and I am not sure if any users override this via the UCX command line. Perhaps we could consider a compile-time check and disabling it for AMD CPUs?

From src/ucs/arch/cpu.c:

[UCS_CPU_VENDOR_AMD] = {
    .min = UCS_MEMUNITS_INF,
    .max = UCS_MEMUNITS_INF
},

So I think we need to check performance degradation only on Intel CPUs, if we want to disable/remove it.

Generally, using 'rep movsb' is not recommended on Zen3+ CPUs from AMD due to performance issues observed when 'src' and 'dst' are not aligned (https://sourceware.org/bugzilla/show_bug.cgi?id=30994). However, we see slight performance improvements when | src -dst | = 0 or multiple of 32

yosefe · 2024-12-12T09:53:09Z

Currently, it is disabled by default (at runtime) on AMD CPUs in UCX, and I am not sure if any users override this via the UCX command line. Perhaps we could consider a compile-time check and disabling it for AMD CPUs?

I don't think we can can disable it by default at compile time (at least, not yet), but maybe add configure flag to allow custom builds to disable it?

So I think we need to check performance degradation only on Intel CPUs, if we want to disable/remove it.

Right

Generally, using 'rep movsb' is not recommended on Zen3+ CPUs from AMD due to performance issues observed when 'src' and 'dst' are not aligned (https://sourceware.org/bugzilla/show_bug.cgi?id=30994). However, we see slight performance improvements when | src -dst | = 0 or multiple of 32

So do we need to make further adjustments in the code to make it optimal for AMD CPU?

arun-chandran-edarath · 2024-12-12T10:37:12Z

Currently, it is disabled by default (at runtime) on AMD CPUs in UCX, and I am not sure if any users override this via the UCX command line. Perhaps we could consider a compile-time check and disabling it for AMD CPUs?

I don't think we can can disable it by default at compile time (at least, not yet), but maybe add configure flag to allow custom builds to disable it?

I will explore this further

So I think we need to check performance degradation only on Intel CPUs, if we want to disable/remove it.

Right

Generally, using 'rep movsb' is not recommended on Zen3+ CPUs from AMD due to performance issues observed when 'src' and 'dst' are not aligned (https://sourceware.org/bugzilla/show_bug.cgi?id=30994). However, we see slight performance improvements when | src -dst | = 0 or multiple of 32

So do we need to make further adjustments in the code to make it optimal for AMD CPU?

It helps only the cases where |src-dst| & 0x1f == 0, we have seen the performance improvements in raw memcpy benchmarks.

To incorporate this to ucx we need to add additional branches, other unaligned cases have to go through the regular memcpy() path. So we need a very detailed investigation to check the usefulness of it within ucx framework.

Then another problem will arise, whether to use nt_buffer_transfer or 'rep movsb' for a given 'len' ? The 'rep movsb' code path will win the selection as it uses 'len' for comparison and a transfer of length 'total_len' can get divided into length 'len' that can fall between 'builtin_memcpy_min' & builtin_memcpy_max. The code is given below.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: UCS/ARCH: Avoid direct usage of REP MOVSB in favor of memcpy #10356

RFC: UCS/ARCH: Avoid direct usage of REP MOVSB in favor of memcpy #10356

arun-chandran-edarath commented Dec 5, 2024 •

edited

Loading

yosefe commented Dec 6, 2024

arun-chandran-edarath commented Dec 6, 2024

yosefe commented Dec 8, 2024

arun-chandran-edarath commented Dec 9, 2024 •

edited

Loading

yosefe commented Dec 9, 2024

arun-chandran-edarath commented Dec 10, 2024

arun-chandran-edarath commented Dec 10, 2024

arun-chandran-edarath commented Dec 10, 2024

arun-chandran-edarath commented Dec 12, 2024 •

edited

Loading

yosefe commented Dec 12, 2024

arun-chandran-edarath commented Dec 12, 2024

yosefe commented Dec 12, 2024

arun-chandran-edarath commented Dec 12, 2024

RFC: UCS/ARCH: Avoid direct usage of REP MOVSB in favor of memcpy #10356

Are you sure you want to change the base?

RFC: UCS/ARCH: Avoid direct usage of REP MOVSB in favor of memcpy #10356

Conversation

arun-chandran-edarath commented Dec 5, 2024 • edited Loading

yosefe commented Dec 6, 2024

arun-chandran-edarath commented Dec 6, 2024

yosefe commented Dec 8, 2024

arun-chandran-edarath commented Dec 9, 2024 • edited Loading

Checking the version history

yosefe commented Dec 9, 2024

arun-chandran-edarath commented Dec 10, 2024

arun-chandran-edarath commented Dec 10, 2024

arun-chandran-edarath commented Dec 10, 2024

arun-chandran-edarath commented Dec 12, 2024 • edited Loading

yosefe commented Dec 12, 2024

arun-chandran-edarath commented Dec 12, 2024

yosefe commented Dec 12, 2024

arun-chandran-edarath commented Dec 12, 2024

arun-chandran-edarath commented Dec 5, 2024 •

edited

Loading

arun-chandran-edarath commented Dec 9, 2024 •

edited

Loading

arun-chandran-edarath commented Dec 12, 2024 •

edited

Loading