Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: UCS/ARCH: Avoid direct usage of REP MOVSB in favor of memcpy #10356

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

arun-chandran-edarath
Copy link
Contributor

@arun-chandran-edarath arun-chandran-edarath commented Dec 5, 2024

Glibc has two tunable parameters that control the use of 'REP MOVSB':

a) __x86_rep_movsb_threshold
b) __x86_rep_movsb_stop_threshold

Ref: https://elixir.bootlin.com/glibc/glibc-2.34/source/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S

Users can set the desired value of __x86_rep_movsb_threshold through the tunable glibc.cpu.x86_rep_movsb_threshold and can decide which length range uses ERMS and which range uses
vectorized routines.

Ref: https://www.gnu.org/software/libc/manual/html_node/Tunables.html

These tunable parameters are set based on the x86 CPU type and available hardware features. Instead of replicating this infrastructure in UCX, we can utilize the 'REP MOVSB' mechanism via memcpy().

Glibc has two tunable parameters that control the use of 'REP MOVSB':

a) __x86_rep_movsb_threshold
b) __x86_rep_movsb_stop_threshold

Ref: https://elixir.bootlin.com/glibc/glibc-2.34/source/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S

Users can set the desired value of __x86_rep_movsb_threshold through
the tunable glibc.cpu.x86_rep_movsb_threshold and can decide
which length range uses ERMS and which range uses
vectorized routines.

Ref: https://www.gnu.org/software/libc/manual/html_node/Tunables.html

These tunable parameters are set based on the x86 CPU type and
available hardware features.  Instead of replicating this infrastructure
in UCX, we can utilize the 'REP MOVSB' mechanism provided by memcpy().
@yosefe
Copy link
Contributor

yosefe commented Dec 6, 2024

@arun-chandran-edarath setting the global configs would affect all invocations of memcpy, but we only want the affect those done in specific contexts. It will even affect memcpy calls done in user application outside of UCX.

@arun-chandran-edarath
Copy link
Contributor Author

@arun-chandran-edarath setting the global configs would affect all invocations of memcpy, but we only want the affect those done in specific contexts. It will even affect memcpy calls done in user application outside of UCX.

I was not suggesting to use global configs of glibc to control the memcpy() behaviour while running ucx. My point is choice of 'REP MOVSB' instruction to copy data depends on lot of hardware factors and glibc does a good job in deciding when to use it. We are using a miniature version of that decision logic in ucx by using only two variables builtin_memcpy_min and builtin_memcpy_max, which may not give the right result on every x86 hardware.

@yosefe
Copy link
Contributor

yosefe commented Dec 8, 2024

@arun-chandran-edarath setting the global configs would affect all invocations of memcpy, but we only want the affect those done in specific contexts. It will even affect memcpy calls done in user application outside of UCX.

I was not suggesting to use global configs of glibc to control the memcpy() behaviour while running ucx. My point is choice of 'REP MOVSB' instruction to copy data depends on lot of hardware factors and glibc does a good job in deciding when to use it. We are using a miniature version of that decision logic in ucx by using only two variables builtin_memcpy_min and builtin_memcpy_max, which may not give the right result on every x86 hardware.

@arun-chandran-edarath we observed that intoroducing the "rep movsb" memory copy does improve the performance in some cases. was the "__x86_rep_movsb_threshold" availbale in older glibc versions as well (rh7/rh8)?

@arun-chandran-edarath
Copy link
Contributor Author

arun-chandran-edarath commented Dec 9, 2024

glibc versions as well (rh7/rh8)?

I think rh7 glibc version is 2.17 and rh8 version is 2.28.
Ref: https://superuser.com/questions/18078/which-versions-of-glibc-do-different-versions-of-redhat-ship-with

The tunable "__x86_rep_movsb_threshold" came much later in version 2.32
arun@arun-ubuntu-2310:/glibc$ git show origin/release/2.31/master:sysdeps/x86/dl-tunables.list | grep movsb
arun@arun-ubuntu-2310:
/glibc$ git show origin/release/2.32/master:sysdeps/x86/dl-tunables.list | grep movsb
x86_rep_movsb_threshold {

Checking the version history

arun@arun-ubuntu-2310:~/glibc$ git blame sysdeps/x86/dl-tunables.list
..

46b5e98ef6f (Noah Goldstein 2024-05-24 12:38:51 -0500 33) x86_memset_non_temporal_threshold {
46b5e98ef6f (Noah Goldstein 2024-05-24 12:38:51 -0500 34) type: SIZE_T
46b5e98ef6f (Noah Goldstein 2024-05-24 12:38:51 -0500 35) }
3f4b61a0b8d (H.J. Lu 2020-07-06 11:48:09 -0700 36) x86_rep_movsb_threshold {
3f4b61a0b8d (H.J. Lu 2020-07-06 11:48:09 -0700 37) type: SIZE_T
475b63702ef (Noah Goldstein 2021-11-01 00:49:52 -0500 38) # Since there is overhead to set up REP MOVSB operation, REP
475b63702ef (Noah Goldstein 2021-11-01 00:49:52 -0500 39) # MOVSB isn't faster on short data. The memcpy micro benchmark
475b63702ef (Noah Goldstein 2021-11-01 00:49:52 -0500 40) # in glibc shows that 2KB is the approximate value above which
475b63702ef (Noah Goldstein 2021-11-01 00:49:52 -0500 41) # REP MOVSB becomes faster than SSE2 optimization on processors
475b63702ef (Noah Goldstein 2021-11-01 00:49:52 -0500 42) # with Enhanced REP MOVSB. Since larger register size can move
475b63702ef (Noah Goldstein 2021-11-01 00:49:52 -0500 43) # more data with a single load and store, the threshold is
475b63702ef (Noah Goldstein 2021-11-01 00:49:52 -0500 44) # higher with larger register size. Micro benchmarks show AVX
475b63702ef (Noah Goldstein 2021-11-01 00:49:52 -0500 45) # REP MOVSB becomes faster apprximately at 8KB. The AVX512
475b63702ef (Noah Goldstein 2021-11-01 00:49:52 -0500 46) # threshold is extrapolated to 16KB. For machines with FSRM the
475b63702ef (Noah Goldstein 2021-11-01 00:49:52 -0500 47) # threshold is universally set at 2112 bytes. Note: Since the
475b63702ef (Noah Goldstein 2021-11-01 00:49:52 -0500 48) # REP MOVSB threshold must be greater than 8 times of vector
475b63702ef (Noah Goldstein 2021-11-01 00:49:52 -0500 49) # size and the default value is 4096 * (vector size / 16), the
475b63702ef (Noah Goldstein 2021-11-01 00:49:52 -0500 50) # default value and the minimum value must be updated at
475b63702ef (Noah Goldstein 2021-11-01 00:49:52 -0500 51) # run-time. NB: Don't set the default value since we can't tell
475b63702ef (Noah Goldstein 2021-11-01 00:49:52 -0500 52) # if the tunable value is set by user or not [BZ #27069].
3f4b61a0b8d (H.J. Lu 2020-07-06 11:48:09 -0700 53) minval: 1
3f4b61a0b8d (H.J. Lu 2020-07-06 11:48:09 -0700 54) }

arun@arun-ubuntu-2310:~/glibc$ git show 3f4b61a0b8d
commit 3f4b61a0b8de67ef9f20737919c713ddfc4bd620
Author: H.J. Lu [email protected]
Date: Mon Jul 6 11:48:09 2020 -0700 -----> Addition of this tunable appeared first

x86: Add thresholds for "rep movsb/stosb" to tunables
                                         
Add x86_rep_movsb_threshold and x86_rep_stosb_threshold to tunables
to update thresholds for "rep movsb" and "rep stosb" at run-time.
                                             
Note that the user specified threshold for "rep movsb" smaller than 
the minimum threshold will be ignored.                                                             
                                                                                                   
Reviewed-by: Carlos O'Donell <[email protected]>

@yosefe
Copy link
Contributor

yosefe commented Dec 9, 2024

So i think that we need to keep ucx internal "rep movsb" code, maybe disabale it depending to glibc version or the presence of x86_rep_stosb_threshold symbol

@arun-chandran-edarath
Copy link
Contributor Author

So i think that we need to keep ucx internal "rep movsb" code, maybe disabale it depending to glibc version or the presence of x86_rep_stosb_threshold symbol

Even though rh8 is using glibc version 2.28 it supports tuning knob 'glibc.cpu.x86_rep_movsb_threshold', that means they have backported the fix to 2.28.

@arun-chandran-edarath
Copy link
Contributor Author

So i think that we need to keep ucx internal "rep movsb" code, maybe disabale it depending to glibc version or the presence of x86_rep_stosb_threshold symbol

Even though rh8 is using glibc version 2.28 it supports tuning knob 'glibc.cpu.x86_rep_movsb_threshold', that means they have backported the fix to 2.28.

rh7 is in it's 'Extended life cycle support'

image
Ref: https://access.redhat.com/support/policy/updates/errata

Do we need to care for such older os-releases?

@arun-chandran-edarath
Copy link
Contributor Author

@arun-chandran-edarath setting the global configs would affect all invocations of memcpy, but we only want the affect those done in specific contexts. It will even affect memcpy calls done in user application outside of UCX.

I was not suggesting to use global configs of glibc to control the memcpy() behaviour while running ucx. My point is choice of 'REP MOVSB' instruction to copy data depends on lot of hardware factors and glibc does a good job in deciding when to use it. We are using a miniature version of that decision logic in ucx by using only two variables builtin_memcpy_min and builtin_memcpy_max, which may not give the right result on every x86 hardware.

Please look at the bug below for a case when the distance between src and dst degrades the 'rep movsb' performance.

https://sourceware.org/bugzilla/show_bug.cgi?id=27130

Commit: https://sourceware.org/git/?p=glibc.git;a=commit;h=3ec5d83d2a237d39e7fd6ef7a0bc8ac4c171a4a5

Description:
x86-64: Avoid rep movsb with short distance [BZ #27130]

When copying with "rep movsb", if the distance between source and
destination is N*4GB + [1..63] with N >= 0, performance may be very
slow. This patch updates memmove-vec-unaligned-erms.S for AVX and
AVX512 versions with the distance in RCX:

@arun-chandran-edarath
Copy link
Contributor Author

arun-chandran-edarath commented Dec 12, 2024

Glibc has the below logic depending on the cpu features to set default values for 'rep_movsb_threshold' and decide the window of usage for 'rep movsb'

https://codebrowser.dev/glibc/glibc/sysdeps/x86/dl-cacheinfo.h.html#963
I don't think we need to replicate that in ucx.

@yosefe what do you think? keep the current code as it is, and disable compiling it by default with '#undef ENABLE_BUILTIN_MEMCPY' while configuring ucx?

Note: The current value set for builtin_memcpy_min, in UCX for Intel processors is 1KB, it differs from what glibc uses as min value, it is 2112 for glibc for processors with FSRM feature.

@yosefe
Copy link
Contributor

yosefe commented Dec 12, 2024

Glibc has the below logic depending on the cpu features to set default values for 'rep_movsb_threshold' and decide the window of usage for 'rep movsb'

https://codebrowser.dev/glibc/glibc/sysdeps/x86/dl-cacheinfo.h.html#963 I don't think we need to replicate that in ucx.

@yosefe what do you think? keep the current code as it is, and disable compiling it by default with '#undef ENABLE_BUILTIN_MEMCPY' while configuring ucx?

Note: The current value set for builtin_memcpy_min, in UCX for Intel processors is 1KB, it differs from what glibc uses as min value, it is 2112 for glibc for processors with FSRM feature.

I think that in order to disable by default or remove the existing code we need to ensure it does not make performance worse. Today it is enabled on both Intel and AMD CPUs, so need to check on both I guess. Also, the current code enables inlining the memcpy function in the calling code (though AFAIU some compilers can do it anyway with builtin memcpy)

@arun-chandran-edarath
Copy link
Contributor Author

Glibc has the below logic depending on the cpu features to set default values for 'rep_movsb_threshold' and decide the window of usage for 'rep movsb'
https://codebrowser.dev/glibc/glibc/sysdeps/x86/dl-cacheinfo.h.html#963 I don't think we need to replicate that in ucx.
@yosefe what do you think? keep the current code as it is, and disable compiling it by default with '#undef ENABLE_BUILTIN_MEMCPY' while configuring ucx?
Note: The current value set for builtin_memcpy_min, in UCX for Intel processors is 1KB, it differs from what glibc uses as min value, it is 2112 for glibc for processors with FSRM feature.

I think that in order to disable by default or remove the existing code we need to ensure it does not make performance worse.
Today it is enabled on both Intel and AMD CPUs, so need to check on both I guess. Also, the current code enables inlining the memcpy function in the calling code (though AFAIU some compilers can do it anyway with builtin memcpy)

Currently, it is disabled by default (at runtime) on AMD CPUs in UCX, and I am not sure if any users override this via the UCX command line. Perhaps we could consider a compile-time check and disabling it for AMD CPUs?

From src/ucs/arch/cpu.c:

[UCS_CPU_VENDOR_AMD] = {
    .min = UCS_MEMUNITS_INF,
    .max = UCS_MEMUNITS_INF
},

So I think we need to check performance degradation only on Intel CPUs, if we want to disable/remove it.

Generally, using 'rep movsb' is not recommended on Zen3+ CPUs from AMD due to performance issues observed when 'src' and 'dst' are not aligned (https://sourceware.org/bugzilla/show_bug.cgi?id=30994). However, we see slight performance improvements when | src -dst | = 0 or multiple of 32

@yosefe
Copy link
Contributor

yosefe commented Dec 12, 2024

Currently, it is disabled by default (at runtime) on AMD CPUs in UCX, and I am not sure if any users override this via the UCX command line. Perhaps we could consider a compile-time check and disabling it for AMD CPUs?

I don't think we can can disable it by default at compile time (at least, not yet), but maybe add configure flag to allow custom builds to disable it?

So I think we need to check performance degradation only on Intel CPUs, if we want to disable/remove it.

Right

Generally, using 'rep movsb' is not recommended on Zen3+ CPUs from AMD due to performance issues observed when 'src' and 'dst' are not aligned (https://sourceware.org/bugzilla/show_bug.cgi?id=30994). However, we see slight performance improvements when | src -dst | = 0 or multiple of 32

So do we need to make further adjustments in the code to make it optimal for AMD CPU?

@arun-chandran-edarath
Copy link
Contributor Author

Currently, it is disabled by default (at runtime) on AMD CPUs in UCX, and I am not sure if any users override this via the UCX command line. Perhaps we could consider a compile-time check and disabling it for AMD CPUs?

I don't think we can can disable it by default at compile time (at least, not yet), but maybe add configure flag to allow custom builds to disable it?

I will explore this further

So I think we need to check performance degradation only on Intel CPUs, if we want to disable/remove it.

Right

Generally, using 'rep movsb' is not recommended on Zen3+ CPUs from AMD due to performance issues observed when 'src' and 'dst' are not aligned (https://sourceware.org/bugzilla/show_bug.cgi?id=30994). However, we see slight performance improvements when | src -dst | = 0 or multiple of 32

So do we need to make further adjustments in the code to make it optimal for AMD CPU?

It helps only the cases where |src-dst| & 0x1f == 0, we have seen the performance improvements in raw memcpy benchmarks.

To incorporate this to ucx we need to add additional branches, other unaligned cases have to go through the regular memcpy() path. So we need a very detailed investigation to check the usefulness of it within ucx framework.

Then another problem will arise, whether to use nt_buffer_transfer or 'rep movsb' for a given 'len' ? The 'rep movsb' code path will win the selection as it uses 'len' for comparison and a transfer of length 'total_len' can get divided into length 'len' that can fall between 'builtin_memcpy_min' & builtin_memcpy_max. The code is given below.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants