Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couple of FA optimizations #608

Merged
merged 11 commits into from
Jul 19, 2024
Merged

Couple of FA optimizations #608

merged 11 commits into from
Jul 19, 2024

Conversation

vgokhale
Copy link
Collaborator

Set SM scale multiplication to a constexpr. Minor asm improvement.

Changed acc scaling to adjust for softmax division to multiplication with reciprocal. ~10% perf improvement.

micmelesse and others added 5 commits June 19, 2024 08:21
Add Perf Kernels

This is a combination of 2 commits.

Add Perf Kernels

Add Perf Kernels

This is a combination of 6 commits.

add perf-kernels

fix formating issues

fix unused variables and other bugs

fix other issues

remove scripts

save

check changes

format

save

save

try

pre-commit check

save
Change all block pointers to tensor pointers

Block pointers are for nvidia TMAs. They are useful for regular loads as well but not well supported.

Also cleaned up some code I came across along the way and updated comment at the top.
Add support for layouts commonly used by users.

Add option for varlen / thd layout to specify equal context lengths for all batches. Also often used by users.
Set SM scale multiplication to a constexpr. Minor asm improvement.

Changed acc scaling to adjust for softmax division to
multiplication with reciprocal. ~10% perf improvement.
@vgokhale vgokhale requested a review from micmelesse June 27, 2024 19:33
@vgokhale vgokhale self-assigned this Jun 27, 2024
@vgokhale vgokhale merged commit df4c4d3 into main_perf Jul 19, 2024
4 checks passed
@vgokhale vgokhale deleted the fa_optims branch July 19, 2024 22:50
micmelesse added a commit that referenced this pull request Oct 28, 2024
Couple of FA optimizations

Set SM scale multiplication to a constexpr. Minor asm improvement.

Changed acc scaling to adjust for softmax division to
multiplication with reciprocal. ~10% perf improvement.

---------

Co-authored-by: Michael Melesse <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants