-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ARM CPU] hgemm optimized for gqa #23107
base: main
Are you sure you want to change the base?
Conversation
9557efc
to
1214402
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can commit the suggested changes from lintrunner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can commit the suggested changes from lintrunner.
size_t k = CountK; | ||
constexpr size_t step = 8 * 16; // pack 8 * 16 | ||
for (; k >= 8; k -= 8, b += 8, PackedB_data += step) { | ||
float16x8_t v0 = MlasLoadFloat16x8(b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall #pragma unroll with loop be used?
Description
Add fp16 kernels for GQA matmul on ARM CPU.
The kernels are mlas hgemm for C = alpha * A x B' + beta * C
Motivation and Context
Add fp16 support for GQA, speed up the operator and reduce memory usage.
Token Generation
Prompting