Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow FP16 accumulation with --fast #6453

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

catboxanon
Copy link
Contributor

@catboxanon catboxanon commented Jan 13, 2025

This PR enables a new PyTorch flag, currently only available in nightly releases as of 2025-01-12, that enables FP16 accumulation in matmul ops for NVIDIA GPUs. On my system with a RTX 3090, running the default provided image generation workflow at a resolution of 1024x1024, provides an it/s bump from 4it/s to 5it/s.

I've opted to only enable this when --fast is used since it seems to be "potentially quality deteriorating", as the --fast arg suggests.

Note that performance improvement only really applies to the 3090 or newer GPUs (i.e. 4000 series). Older cards will likely see no performance improvement.


For reference:
pytorch/pytorch#144441
https://docs-preview.pytorch.org/pytorch/pytorch/144441/notes/cuda.html#full-fp16-accmumulation-in-fp16-gemms
pytorch/pytorch#144441 (comment)

3090/4090 users would also likely benefit from fp16 accumulation inside attention and convolution operations. with this, perhaps the 4090 could approximate the speed of an A100.

For future work it may also be worth to look at this (retain performance improvement w/o quality deterioration):
pytorch/pytorch#144441 (comment)

Currently only applies to PyTorch nightly releases. (>=20250112)
@liesened
Copy link

pytorch version checks are probably needed for this patch to not break old pytorch installations?

@ao899
Copy link

ao899 commented Jan 13, 2025

pytorch version: 2.7.0.dev20250112+cu126
I'm not entirely confident if it's working properly, but I was able to calculate a result showing a 12.09% speed improvement.

@Panchovix
Copy link

I have applied this on reforge and speed bump seems to be variable on batch size.

On my 3090/4090, with batch size 1 i see about 10-15% improvement, and then 25-33% at batch size 2-3 or more.

I guess with single images you can have a RAM/CPU bottleneck.

try:
if is_nvidia() and args.fast:
torch.backends.cuda.matmul.allow_fp16_accumulation = True
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can catch AttributeError here, it doesn't need to be a wildcard except.

Copy link
Contributor Author

@catboxanon catboxanon Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just following the style of the other code in this file (which could use this type of improvement too...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. For whatever my opinion is worth (which probably isn't much!) I think it would be good if new changes avoided "code smells" linters will complain about. Seems like the ComfyUI codebase is slowly trying to modernize with adding type annotations and stuff.

@Dahvikiin
Copy link

isn't the --fast argument only for Ampere+? pytorch documentation says “Full FP16 Accmumulation in FP16 GEMMs” it works in Turing and Volta too (7.0+).

@comfyanonymous
Copy link
Owner

Looks like the change got reverted in pytorch so I'll merge this once they add it back.

@Panchovix
Copy link

Panchovix commented Jan 15, 2025

It got merged again pytorch/pytorch@a6763b7

So this PR will work if building from source now, or in the nightly build in the next 1-2 days.

EDIT: It got reverted again

@ao899
Copy link

ao899 commented Jan 23, 2025

pytorch/pytorch@de945d7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants