You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Array reductions can be represented as a two-stage pipeline built on top of matrix-vector multiplications, where the vector is made of all ones.
Let's say our hardware supports fast 16 by 16 matrix multiplications with a single instruction. We can reshape the input array of length $N$ as a matrix of $16$ rows and $N/16$ columns, and use a tiled matrix-multiplication instruction sliding through that wide matrix, multiplying it by a $16$-element vector of ones, and accumulating into $16$ other floats.
In reality, we can't user Intel AMX with float32 inputs, but we can use Arm SME, and later apply similar techniques to SimSIMD.
The text was updated successfully, but these errors were encountered:
Array reductions can be represented as a two-stage pipeline built on top of matrix-vector multiplications, where the vector is made of all ones.
Let's say our hardware supports fast 16 by 16 matrix multiplications with a single instruction. We can reshape the input array of length$N$ as a matrix of $16$ rows and $N/16$ columns, and use a tiled matrix-multiplication instruction sliding through that wide matrix, multiplying it by a $16$ -element vector of ones, and accumulating into $16$ other floats.
In reality, we can't user Intel AMX with
float32
inputs, but we can use Arm SME, and later apply similar techniques to SimSIMD.The text was updated successfully, but these errors were encountered: