[ARM CPU] hgemm optimized for gqa #23107

fajin-corp · 2024-12-14T02:10:09Z

Description

Add fp16 kernels for GQA matmul on ARM CPU.
The kernels are mlas hgemm for C = alpha * A x B' + beta * C

Motivation and Context

Add fp16 support for GQA, speed up the operator and reduce memory usage.

Token Generation

	HGEMM Runtime (ns)	SGEMM Runtime (ns)	Speed-up (%)
M:1/N:4096/K:4096	251551	1775905	85.84
M:1/N:11008/K:4096	892507	4649145	80.80
M:1/N:4096/K:11008	866860	3240015	73.25
M:1/N:11008/K:11008	2631615	`8783877`	70.04

Prompting

	HGEMM Runtime (ns)	SGEMM Runtime (ns)	Speed-up (%)
M:1024/N:4096/K:4096	90508701	111283029	18.67
M:2048/N:4096/K:4096	181307522	240211107	24.52
M:1024/N:11008/K:4096	241120234	307707933	21.64
M:2048/N:11008/K:4096	481091232	648921367	25.86
M:1024/N:4096/K:11008	241736343	310129880	22.05
M:2048/N:4096/K:11008	480456703	644814999	25.49
M:1024/N:11008/K:11008	642121440	847925766	24.27
M:2048/N:11008/K:11008	1276097154	1731314509	26.29

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h

onnxruntime/core/mlas/lib/halfgemm.cpp

onnxruntime/test/mlas/bench/bench_hgemm.cpp

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/mlas/bench/bench_hgemm.cpp

liqunfu · 2025-01-24T20:28:58Z

onnxruntime/core/mlas/lib/halfgemm_kernel_neon_fp16.cpp

+        size_t k = CountK;
+        constexpr size_t step = 8 * 16; // pack 8 * 16
+        for (; k >= 8; k -= 8, b += 8, PackedB_data += step) {
+            float16x8_t v0 = MlasLoadFloat16x8(b);


shall #pragma unroll with loop be used?

fajin-corp added 5 commits January 13, 2025 22:12

init

8f48fe5

connect kernels

ee3ff7e

added threading and blocking code

94c4030

added fp16 kernel file

d894f9a

adding kernels

1214402

fajin-corp force-pushed the fajin/gqa-hgemm branch from 9557efc to 1214402 Compare January 13, 2025 22:13

added some notes to enrich the context

0bf54a0

github-actions bot reviewed Jan 16, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h Outdated Show resolved Hide resolved

fajin-corp added 9 commits January 16, 2025 23:17

finishd packB kernel

0e3baeb

finished hgemm m1 kernel

f37276f

finished m2 kernel

5f73132

added template kernels for packedB hgemm

2601bcd

finished transposed packed hgemm m2 kernel

389e328

passed build

d72f45e

fixed build warnings

946c5b2

fix mac build

7559335

fix ci

b2b47c8

github-advanced-security bot found potential problems Jan 22, 2025

View reviewed changes

onnxruntime/core/mlas/lib/halfgemm.cpp Fixed Show fixed Hide fixed

fajin-corp added 5 commits January 22, 2025 22:21

fix windows build

a71f736

added kernel UT

7ff22bf

passed all kernel uts

59348cf

added mlas hgemm ut

5d16333

fix linting

8eae201

fajin-corp marked this pull request as ready for review January 23, 2025 23:06

fajin-corp requested a review from a team as a code owner January 23, 2025 23:06

fixed qnn build

8c8943b

github-advanced-security bot found potential problems Jan 24, 2025

View reviewed changes

onnxruntime/test/mlas/bench/bench_hgemm.cpp Fixed Show fixed Hide fixed

github-actions bot reviewed Jan 24, 2025

View reviewed changes

onnxruntime/test/mlas/bench/bench_hgemm.cpp Outdated Show resolved Hide resolved

fix linting

bf8bfb0

liqunfu approved these changes Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM CPU] hgemm optimized for gqa #23107

[ARM CPU] hgemm optimized for gqa #23107

fajin-corp commented Dec 14, 2024 •

edited

Loading

github-actions bot left a comment

github-actions bot left a comment

liqunfu Jan 24, 2025

[ARM CPU] hgemm optimized for gqa #23107

Are you sure you want to change the base?

[ARM CPU] hgemm optimized for gqa #23107

Conversation

fajin-corp commented Dec 14, 2024 • edited Loading

Description

Motivation and Context

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

liqunfu Jan 24, 2025

Choose a reason for hiding this comment

fajin-corp commented Dec 14, 2024 •

edited

Loading