Switch dynamic FP8 grouped gemm to accept tensor inputs #3552
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
As we continue the long march towards optimal MOE performance, we've identifed that in prefill, having to artificially split inputs and then check that each group is valid introduces non-trivial overhead. Since all the inputs must be in consecutive memory anyway, it's better to just require them to be contiguous tensors rather than TensorLists.
While making this change may sound simple, it required switching the kernels to a templated implementation. This is the most elegant way to support various input and output types for shared kernels, despite it being a large refactor.
I also removed some of the now outdated fbgemm profiling scripts. They likely arent useful going forward anyway.
Differential Revision: D67881909