Load mma operands to shared memory with TMA #3320

rdspring1 · 2024-10-31T04:49:09Z

This PR modifies schedulePrologues to use TMA loads to move mma operands to shared memory. Stacked on #3324 and #3310.

Details

Input operands are loaded into shared memory via CpAsyncBulkTensorTile LoadStoreOp.
Replace LdMatrix operation with basic set.
Modified scheduleOperandSmemStores to apply swizzling to avoid bank conflicts.
Refactor swizzleSharedMemory by moving the analysis component to a separate function named analyzeSwizzleSharedMemory.
Create tmaSwizzleSharedMemory function that uses analyzeSwizzleSharedMemory and then finds the appropriate tma swizzle format.
Disable loop rotation. There is an issue with tma loads and circular buffering. Not sure if loop rotation is required for hopper matmul.
Expect hopper matmul tests to give incorrect results.

rdspring1 · 2024-11-02T00:41:29Z

!test

jacobhinkle

First pass. Looks good so far. One question: how will we handle partial vectorization?

csrc/scheduler/hopper_multi_matmul.cpp

jacobhinkle · 2024-11-02T16:37:39Z

csrc/scheduler/hopper_multi_matmul.cpp

+    return MmaInputSmemSwizzle::None; // No need to swizzle in this case.
+  }
+
+  // 128B swizzle results in 8 x 8 matrix given half precision inputs.


Are we only using 128B swizzle or do we plan to support the smaller swizzles as well?

I need to update this comment.

csrc/scheduler/hopper_multi_matmul.cpp

tests/cpp/test_matmul_scheduler.cpp

tests/cpp/test_matmul.cpp

jacobhinkle · 2024-11-02T16:53:42Z

csrc/scheduler/hopper_multi_matmul.cpp

@@ -1332,6 +1396,8 @@ void HopperMultipleMatmulScheduler::setUpCircularBuffering() {
    }
  }

+  /*
+  // TODO Investigate. Disable loop rotation with tma circular buffering


My first thought was can we just disable this parameter for loop rotation, but I realized that we do not actually respect the rotate_ldmatrix_out_of_main_loop parameter in MatmulParams. In fact, I just ran git log --patch csrc/scheduler/ | grep -C20 rotate_ and sifted through the results. I don't think we've ever used that parameter :-D cc @zasdfgbnm .

add tmaSwizzleSharedMemory and disable loop rotation update cacheOperandsToSmem

rdspring1 · 2024-11-04T17:41:06Z

how will we handle partial vectorization?

Do you mean when the tensor is not 16B aligned? You can overcopy with TMA, cp.async, or regular LDG + STS.

jacobhinkle · 2024-11-05T17:17:12Z

how will we handle partial vectorization?

Do you mean when the tensor is not 16B aligned? You can overcopy with TMA, cp.async, or regular LDG + STS.

Yeah exactly. So if we had K=60 and that is the inner dimension of each of the operands, in the Ampere scheduler we need to handle them differently when we generate the kernel since we can only do 4-element reads for the cp.async call then in stead of 8-element reads. But I don't see where that kind of alignment analysis comes in when using TMA; will TMA handle misaligned boxes dynamically using the same compiled kernel as for fully-aligned inputs?

EDIT: is this computed on the host side in the TMA descriptor?

rdspring1 · 2024-11-05T17:34:51Z

TMA should automatically handle the case when K=60 by filling the out-of-bounds accesses.
If the tensor is not 16B aligned, TMA will fail and you need to use regular LDG + STS accesses.

jacobhinkle

LGTM

…mul_tma

rdspring1 · 2024-11-07T02:14:48Z

!test

jacobhinkle · 2024-11-07T15:56:27Z

Looks like you just need to guard AmpereMatmulBroadcastBatch. I noticed I needed this in #3278 but I was too lazy to merge that upstream to this PR for you. https://github.com/NVIDIA/Fuser/pull/3278/files#diff-64fc4e7bfbc5b9f95ac3dc5823bd99b683b048926805c13310ce6a8ef8032289R147-R148

rdspring1 · 2024-11-07T16:30:39Z

!test

Stacked on #3320 This PR: * Schedules the MMA instruction result for the HopperMultiMatmulScheduler. * Removes some unused methods that are no longer necessary. * Checks that there is "no prologue". Specifically, that we have `gmem -LoadStoreOp-> smem -MmaOp->`. This can currently not be done unless we create the MmaOp at definition using `fusedMultiplySum` (see #1628). * Checks that MmaOp output has logical order MNK. If not then a root->logical reorder should have been created at definition. (maybe this should be made easier as an option in `fusedMultiplySum`). This PR does not schedule split-K or TMA stores of the output. --------- Co-authored-by: Ryan Spring <[email protected]>

rdspring1 added Matmuls Top-Down Matmul Dev labels Oct 31, 2024

rdspring1 force-pushed the multi_matmul_tma branch from 523c05b to b0b6f22 Compare November 1, 2024 02:52

rdspring1 force-pushed the hopper_matmul_tests branch from d8bc1a6 to 7c8f375 Compare November 1, 2024 02:54

rdspring1 marked this pull request as ready for review November 1, 2024 03:08

rdspring1 changed the title ~~Loading operands with TMA~~ Load mma operands to shared memory with TMA Nov 1, 2024

rdspring1 force-pushed the multi_matmul_tma branch 2 times, most recently from a04848d to 3002112 Compare November 2, 2024 00:41

jacobhinkle reviewed Nov 2, 2024

View reviewed changes

Base automatically changed from hopper_matmul_tests to main November 3, 2024 17:26

rdspring1 added 11 commits November 4, 2024 09:18

use id model for producer index

97859ec

add tmaSwizzleSharedMemory and disable loop rotation update cacheOperandsToSmem

create analyzeSwizzleSharedMemory

9f0281f

refactor

ee48bfa

comments

3fdc3ae

update tmaSwizzleSharedMemory

82c5be8

reorder cta tile to (K, M/N) before tma load swizzling

731e784

disable ldmatrix

2356b0d

expect tests to fail

a9fd476

guard tests

881580b

test comments

53aeda4

scheduler comments

f8aa777

rdspring1 force-pushed the multi_matmul_tma branch from 3002112 to f8aa777 Compare November 4, 2024 17:32

jacobhinkle mentioned this pull request Nov 5, 2024

Schedule Hopper mma instruction #3278

Merged

jacobhinkle approved these changes Nov 7, 2024

View reviewed changes

Merge branch 'main' of https://github.com/nvidia/fuser into multi_mat…

7ecafd1

…mul_tma

rdspring1 added 2 commits November 7, 2024 08:11

guard

360f39a

Merge branch 'main' into multi_matmul_tma

beb38aa

rdspring1 merged commit 114e9a1 into main Nov 8, 2024
47 checks passed

rdspring1 deleted the multi_matmul_tma branch November 8, 2024 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load mma operands to shared memory with TMA #3320

Load mma operands to shared memory with TMA #3320

rdspring1 commented Oct 31, 2024 •

edited

Loading

rdspring1 commented Nov 2, 2024

jacobhinkle left a comment

jacobhinkle Nov 2, 2024

rdspring1 Nov 4, 2024

jacobhinkle Nov 2, 2024

rdspring1 commented Nov 4, 2024

jacobhinkle commented Nov 5, 2024 •

edited

Loading

rdspring1 commented Nov 5, 2024 •

edited

Loading

jacobhinkle left a comment

rdspring1 commented Nov 7, 2024

jacobhinkle commented Nov 7, 2024

rdspring1 commented Nov 7, 2024

Load mma operands to shared memory with TMA #3320

Load mma operands to shared memory with TMA #3320

Conversation

rdspring1 commented Oct 31, 2024 • edited Loading

Details

rdspring1 commented Nov 2, 2024

jacobhinkle left a comment

Choose a reason for hiding this comment

jacobhinkle Nov 2, 2024

Choose a reason for hiding this comment

rdspring1 Nov 4, 2024

Choose a reason for hiding this comment

jacobhinkle Nov 2, 2024

Choose a reason for hiding this comment

rdspring1 commented Nov 4, 2024

jacobhinkle commented Nov 5, 2024 • edited Loading

rdspring1 commented Nov 5, 2024 • edited Loading

jacobhinkle left a comment

Choose a reason for hiding this comment

rdspring1 commented Nov 7, 2024

jacobhinkle commented Nov 7, 2024

rdspring1 commented Nov 7, 2024

rdspring1 commented Oct 31, 2024 •

edited

Loading

jacobhinkle commented Nov 5, 2024 •

edited

Loading

rdspring1 commented Nov 5, 2024 •

edited

Loading