-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harley Seal AVX-512 implementations #138
base: main-dev
Are you sure you want to change the base?
Conversation
This commit adds the optimized Harley Seal kernel from the `WojciechMula/sse-popcount` library to the benchmarking suite to investigate optimization opportunities on Intel Sapphire Rapids and AMD Genoa chips.
@@ -223,6 +221,182 @@ void vdot_f64c_blas(simsimd_f64_t const* a, simsimd_f64_t const* b, simsimd_size | |||
|
|||
#endif | |||
|
|||
namespace AVX512_harley_seal { | |||
|
|||
uint8_t lookup8bit[256] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it help to add const/constexpr? I wonder if it would encourage the table to be cached. It might also help to run a loop over it to pre-load it into cache too (although I figure prefetching would most likely get the whole table in the first access).
In my own experiments in the past, I did find the built in instructions to be faster vs LUTs, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the size of the inputs - the tail will never be evaluated separately. I've just copied that part of the code for completeness.
uint64_t lower_qword(const __m128i v) { return _mm_cvtsi128_si64(v); } | ||
|
||
uint64_t higher_qword(const __m128i v) { return lower_qword(_mm_srli_si128(v, 8)); } | ||
|
||
uint64_t simd_sum_epu64(const __m128i v) { return lower_qword(v) + higher_qword(v); } | ||
|
||
uint64_t simd_sum_epu64(const __m256i v) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think modern compilers might do this without asking in some cases, but using inline
might encourage it (and could help with these small functions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes I've suggested so far are just low hanging fruit though. Have you used profiling tools to find which lines of code each approach is spending the most time in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most time is spent in the main loop computing CSAs. Sadly, I can't access hardware performance counters on those machines.
b0bc0da
to
b816617
Compare
I'm interested in experimenting with this, but I don't have a CPU supporting AVX512. Do you test all these different instruction sets on cloud machines or do you have many CPUs? 😄 Maybe I could do some comparative experiments emulating with QEMU, but this most likely won't give enough info for finetuning. |
@Wyctus, QEMU is a nightmare, I recommend avoiding it. I used to have some CPUs, but cloud is the way to go for R&D of such kernels. I recommend r7iz instances for x86 and r8g for Arm on AWS. 2-4 vCPUs should be enough 😉 |
Thank you, I'll try AWS! 🙂 You are right, I messed a few hours with QEMU, and made me sick already.... The reason I picked this issue is that I used to mess with popcount stuff in the past, so I'm planning to dig up what I did and see if it's competitive enough, I don't remember. But if I have time, I'll try to look into the other mentioned issues as well! |
Hi @Wyctus! Any luck with this? |
54ae495
to
fb1e864
Compare
679a813
to
252fba7
Compare
More context for this.
Optimizing for Genoa and Turin we may want to combine the first and second approach. |
48ac9e4
to
5d9a219
Compare
More context. We can use the lookup table with
|
Binary representations are becoming increasingly popular in Machine Learning and I'd love to explore the opportunity for faster Hamming and Jaccard distance calculations. I've looked into several benchmarks, most importantly the
WojciechMula/sse-popcount
library, that compares several optimizations for population-counts -the most expensive part of the Hamming/Jaccard kernel.Extensive benchmarks and the design itself suggest that AVX-512 Harley Seal variant should be the fastest on long inputs beyond 1 KB. Here is a sample of the most recent results obtained on an i3 Cannonlake Intel CPU:
I've tried copying the best solution into SimSIMD benchmarking suite and sadly didn't achieve similar improvements on more recent CPUs. On Intel Sapphire Rapids CPUs:
On AMD Genoa:
_mm_popcnt_u64
._mm512_popcnt_epi64
.icehs
is an adaptation of the Harley Seal transform that "zip"-s two input streams withxor
.To reproduce the results:
Please let me know if there is a better way to accelerate this kernel 🤗