Tianxing/moe gemm #685

Chi-Chu319 · 2024-12-18T14:15:34Z

Implemented moe gemm, test and benchmarking.

run python python/perf-kernels/flash-attention.py -model all to benchmark with the mistral models
run python python/perf-kernels/flash-attention.py -routed_weight to benchmark with the routed weight

The benchmark shows the memory bandwidth

benchmark result:

M	K	N	E	top_k	TFLOPS	Bandwidth (GB/s)
64.0	128.0	256.0	8.0	2.0	0.092229	7.422335
64.0	1024.0	1792.0	8.0	2.0	5.079523	339.276790
64.0	4096.0	7168.0	8.0	2.0	27.586739	1725.612375
128.0	4096.0	7168.0	8.0	2.0	53.671204	1677.165903
1024.0	4096.0	7168.0	8.0	2.0	179.189020	755.743558
4096.0	4096.0	7168.0	8.0	2.0	365.320720	485.117674
64.0	4096.0	14336.0	8.0	2.0	35.497301	2211.471846
128.0	4096.0	14336.0	8.0	2.0	71.006905	2160.232271
256.0	4096.0	14336.0	8.0	2.0	126.936021	1970.342294
512.0	4096.0	14336.0	8.0	2.0	206.943225	1605.343648
1024.0	4096.0	14336.0	8.0	2.0	246.641795	986.634938
2048.0	4096.0	14336.0	8.0	2.0	351.595861	798.413673
4096.0	4096.0	14336.0	8.0	2.0	415.472299	511.619561

mistral benchmark result:

Model	M	N	K	E	top_k	TFLOPS	Bandwidth (GB/s)
mistral-7B	4096	28672	4096	8	2	409.995953	517.772213
mistral-7B	4096	4096	14336	8	2	408.037832	477.285160
mistral-22B	4096	32768	6144	8	2	411.230913	449.228758
mistral-22B	4096	6144	16384	8	2	410.775487	465.042536

references:

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/fused_moe.py#L261
https://github.com/vllm-project/vllm/blob/main/csrc/moe/moe_align_sum_kernels.cu
I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

#435

The gemm support weights so as the benchmark. You can tune the gemm with bencmark option -tune

vgokhale · 2025-01-02T23:56:14Z

python/perf-kernels/fused_moe/benchmark_utils.py

@@ -0,0 +1,211 @@
+from typing import TypedDict, List, Optional


Hmm I think I should have clarified this a bit more - my bad. I don't think we want to take the benchmarking stuff. We want this to remain with autotune like the other kernels here.

We do eventually want to go down the route of tuning like this, but not this comprehensively.

Is it possible to revert these have have just the moe kernel and friends it needs to set up the inputs and autotune?

vgokhale · 2025-01-03T22:10:44Z

python/perf-kernels/fused_moe/configs/device_name=AMD_Instinct_MI300X.json

+{
+  "small_M": {
+    "BLOCK_SIZE_M": 16,
+    "BLOCK_SIZE_N": 16,


Was this the best config it picked?

vgokhale · 2025-01-03T22:11:25Z

python/perf-kernels/fused_moe/moe-gemm.py

+import argparse
+import sys
+
+M_THRESHOLD = 1024


I think the threshold should be much smaller. Like 128.

I will change it and check for the best config

vgokhale · 2025-01-03T22:12:55Z

python/perf-kernels/fused_moe/moe-gemm.py

+    return args
+
+
+arg_to_torch_dtype = {'fp16': torch.float16, 'bf16': torch.bfloat16, 'fp32': torch.float32}


Can we also add int8/fp8 similar to FA? May be in a follow up PR?

Yes, there is a draft pr #693 I will open it once this is merged

vgokhale · 2025-01-03T22:13:42Z

python/perf-kernels/fused_moe/moe-gemm.py

+        allow_abbrev=False,
+    )
+    parser.add_argument("-M", type=int, default=0, help="M dimension")
+    parser.add_argument("-K", type=int, default=0, help="K dimension")


Since Juuso checked in his PR for the models, can you also add mixtral 22B and 7B? you can find their configs online.

vgokhale · 2025-01-10T22:38:10Z

python/perf-kernels/fused_moe/moe-gemm.py

+            M, N, K, top_k, E, routed_weight=routed_weight, dtype=dtype)
+
+        flops = 2.0 * M * top_k * K * N
+        if routed_weight:


May be a comment here on why this increase.

Also below for bytes calculation. Since it is not intuitive.

Chi-Chu319 self-assigned this Dec 18, 2024

Chi-Chu319 requested a review from vgokhale December 20, 2024 15:37

Chi-Chu319 marked this pull request as ready for review December 20, 2024 15:40

Chi-Chu319 force-pushed the tianxing/moe-gemm branch from cd06c06 to b9a9504 Compare December 23, 2024 14:33

Chi-Chu319 added 2 commits December 31, 2024 15:53

Implemented moe gemm, test and benchmarking.

71649a6

The gemm support weights so as the benchmark. You can tune the gemm with bencmark option -tune

removed benchmark files

5adb971

Chi-Chu319 force-pushed the tianxing/moe-gemm branch from 3d044a2 to 5adb971 Compare December 31, 2024 16:17

Chi-Chu319 and others added 3 commits December 31, 2024 18:18

Merge branch 'main_perf' into tianxing/moe-gemm

6852a96

removed all the benchmark files

ce4c4e8

updated readme

826cf5d

vgokhale reviewed Jan 2, 2025

View reviewed changes

Chi-Chu319 added 2 commits January 3, 2025 16:29

remove the -tune option, and consolidated the config files

c06c61e

pre commit

34130b9

Chi-Chu319 requested a review from vgokhale January 3, 2025 16:36

vgokhale reviewed Jan 3, 2025

View reviewed changes

Chi-Chu319 added 7 commits January 7, 2025 08:08

Merge branch 'main_perf' into tianxing/moe-gemm

c2b85ff

updated M_THRESHOLD and configs after tunnig

dc68ce1

Merge branch 'main_perf' into tianxing/moe-gemm

04a1629

mistral model benchmarking

53fa702

pre commit

1698d79

noqa: E402

42d8dbc

pre commit

3a04e90

Chi-Chu319 requested a review from vgokhale January 7, 2025 14:54

Chi-Chu319 added 4 commits January 8, 2025 12:12

more fine tuned model config. show mem throught put in benchmark

9c83fd7

pre-commit

e9d3dc2

fixed bandwidth computation

ad2daad

First and second gemm odel benchmarking

30a488c

Chi-Chu319 added 6 commits January 10, 2025 11:10

reversed k n

39eca09

pre commit

0da016e

pre-commit fix format

be6520b

pre commit fix

54d207d

pre commit fix

2fe49dd

pre commit

6959222

vgokhale requested a review from zhanglx13 January 10, 2025 22:35

vgokhale reviewed Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tianxing/moe gemm #685

Tianxing/moe gemm #685

Chi-Chu319 commented Dec 18, 2024 •

edited

Loading

vgokhale Jan 2, 2025

vgokhale Jan 2, 2025

vgokhale Jan 3, 2025

vgokhale Jan 3, 2025

Chi-Chu319 Jan 4, 2025

vgokhale Jan 3, 2025

Chi-Chu319 Jan 4, 2025

vgokhale Jan 3, 2025

vgokhale Jan 10, 2025

		@@ -0,0 +1,211 @@
		from typing import TypedDict, List, Optional

		return args


		arg_to_torch_dtype = {'fp16': torch.float16, 'bf16': torch.bfloat16, 'fp32': torch.float32}

Tianxing/moe gemm #685

Are you sure you want to change the base?

Tianxing/moe gemm #685

Conversation

Chi-Chu319 commented Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Chi-Chu319 commented Dec 18, 2024 •

edited

Loading