Using Mat-Mul Instructions, like Arm SME and Intel AMX #2

ashvardanian · 2024-12-30T20:38:06Z

Array reductions can be represented as a two-stage pipeline built on top of matrix-vector multiplications, where the vector is made of all ones.

Let's say our hardware supports fast 16 by 16 matrix multiplications with a single instruction. We can reshape the input array of length $N$ as a matrix of $16$ rows and $N/16$ columns, and use a tiled matrix-multiplication instruction sliding through that wide matrix, multiplying it by a $16$-element vector of ones, and accumulating into $16$ other floats.

In reality, we can't user Intel AMX with float32 inputs, but we can use Arm SME, and later apply similar techniques to SimSIMD.

The text was updated successfully, but these errors were encountered:

alexbarev · 2024-12-30T20:50:03Z

@ashvardanian I’m picking this up.

alexbarev · 2025-01-05T15:39:21Z

@ashvardanian Apparently, neither AWS Graviton 4 nor GCS analogs support SME, so this task might face significant delays.

Running cat /proc/cpuinfo shows support only for sve and sve2.

And directly this can be verified that none of these instruction works using clang 18.1.3:

// Streaming mode
asm("smstart SM"); 
asm("smstop SM");

// ZA storage
asm("smstart ZA");
asm("smstop ZA");

And according to this https://arxiv.org/pdf/2409.18779 M4 chip is the first to support SME.

ashvardanian · 2025-01-05T15:40:32Z

Then we are out of luck for now, @alexbarev. Let's wait for the next get CPUs.

The next steps would include more AVX-512, AVX2, AMX, and SME on Arm. ashvardanian/ParallelReductionsBenchmark#2

ashvardanian · 2025-01-13T10:06:18Z

Just wanted to link Linux kernel docs on SME for future use 🤗

ashvardanian added a commit to ashvardanian/less_slow.cpp that referenced this issue Jan 12, 2025

Add: i8, f16, and bf16 kernels

3f54200

The next steps would include more AVX-512, AVX2, AMX, and SME on Arm. ashvardanian/ParallelReductionsBenchmark#2

ashvardanian added the good first issue Good for newcomers label Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Mat-Mul Instructions, like Arm SME and Intel AMX #2

Using Mat-Mul Instructions, like Arm SME and Intel AMX #2

ashvardanian commented Dec 30, 2024

alexbarev commented Dec 30, 2024

alexbarev commented Jan 5, 2025

ashvardanian commented Jan 5, 2025

ashvardanian commented Jan 13, 2025

Using Mat-Mul Instructions, like Arm SME and Intel AMX #2

Using Mat-Mul Instructions, like Arm SME and Intel AMX #2

Comments

ashvardanian commented Dec 30, 2024

alexbarev commented Dec 30, 2024

alexbarev commented Jan 5, 2025

ashvardanian commented Jan 5, 2025

ashvardanian commented Jan 13, 2025