Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native FP8 Support Training For 40XX and the Future 50XX Cards #43

Open
7 tasks
Nottlespike opened this issue Dec 29, 2024 · 0 comments
Open
7 tasks

Native FP8 Support Training For 40XX and the Future 50XX Cards #43

Nottlespike opened this issue Dec 29, 2024 · 0 comments

Comments

@Nottlespike
Copy link

Nottlespike commented Dec 29, 2024

Hi @tdrussell!

I’ve been exploring the QLoRA pipeline parallel trainer and noticed that the project currently doesn’t support native FP8 training. With the advent of NVIDIA’s 40XX series GPUs and the upcoming 50XX series, FP8 precision is becoming increasingly relevant for efficient and high-performance training. FP8 offers significant memory savings and computational efficiency, which could greatly benefit this project, especially for large-scale models.

Motivation:

  • - Hardware Advancements: NVIDIA’s 40XX and 50XX GPUs natively support FP8, making it a natural fit for leveraging their full potential.
  • - Efficiency Gains: FP8 reduces memory bandwidth and storage requirements, enabling larger models or batch sizes within the same hardware constraints.
  • - Future-Proofing: Adding FP8 support ensures compatibility with next-generation hardware and keeps the project at the forefront of efficient training techniques.

Proposal:
I’d like to contribute to implementing native FP8 support in the QLoRA pipeline parallel trainer. This would involve:

  • - FP8 Data Type Integration: Adding support for FP8 data types in the training pipeline.
  • - Mixed Precision Training: Ensuring compatibility with mixed precision training workflows, combining FP8 with other precisions (e.g., FP16, BF16) where necessary.
  • - Hardware Optimization: Leveraging NVIDIA’s libraries (e.g., Tensor Cores, cuDNN) to maximize performance on 40XX and 50XX GPUs.
  • - Testing and Validation: Thoroughly testing FP8 training to ensure numerical stability and performance gains.

Questions for Discussion:

Are there any existing plans or roadblocks related to FP8 support?

Are there specific areas of the codebase or workflows that would benefit most from FP8 integration?

Are there any preferred libraries or tools that should be used for this implementation?

I’m excited to collaborate on this and contribute to making the QLoRA pipeline parallel trainer even more efficient and future-ready. Let me know your thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant