Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Mat-Mul Instructions, like Arm SME and Intel AMX #2

Open
ashvardanian opened this issue Dec 30, 2024 · 4 comments
Open

Using Mat-Mul Instructions, like Arm SME and Intel AMX #2

ashvardanian opened this issue Dec 30, 2024 · 4 comments
Labels
good first issue Good for newcomers

Comments

@ashvardanian
Copy link
Owner

Array reductions can be represented as a two-stage pipeline built on top of matrix-vector multiplications, where the vector is made of all ones.

Let's say our hardware supports fast 16 by 16 matrix multiplications with a single instruction. We can reshape the input array of length $N$ as a matrix of $16$ rows and $N/16$ columns, and use a tiled matrix-multiplication instruction sliding through that wide matrix, multiplying it by a $16$-element vector of ones, and accumulating into $16$ other floats.

In reality, we can't user Intel AMX with float32 inputs, but we can use Arm SME, and later apply similar techniques to SimSIMD.

@alexbarev
Copy link

@ashvardanian I’m picking this up.

@alexbarev
Copy link

@ashvardanian Apparently, neither AWS Graviton 4 nor GCS analogs support SME, so this task might face significant delays.

Running cat /proc/cpuinfo shows support only for sve and sve2.

And directly this can be verified that none of these instruction works using clang 18.1.3:

// Streaming mode
asm("smstart SM"); 
asm("smstop SM");

// ZA storage
asm("smstart ZA");
asm("smstop ZA");

And according to this https://arxiv.org/pdf/2409.18779 M4 chip is the first to support SME.

@ashvardanian
Copy link
Owner Author

Then we are out of luck for now, @alexbarev. Let's wait for the next get CPUs.

ashvardanian added a commit to ashvardanian/less_slow.cpp that referenced this issue Jan 12, 2025
The next steps would include more
AVX-512, AVX2, AMX, and SME on Arm.

ashvardanian/ParallelReductionsBenchmark#2
@ashvardanian
Copy link
Owner Author

Just wanted to link Linux kernel docs on SME for future use 🤗

@ashvardanian ashvardanian added the good first issue Good for newcomers label Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants