Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] Benchmark Fast-LLM Pretraining Throughput Against Contemporary Frameworks #108

Open
tscholak opened this issue Jan 8, 2025 · 0 comments
Labels
documentation Improvements or additions to documentation

Comments

@tscholak
Copy link
Collaborator

tscholak commented Jan 8, 2025

🧐 Problem Description

Fast-LLM aims to be a competitive pretraining framework for large-scale language models. To understand its performance relative to other contemporary frameworks, we need a comprehensive and reproducible benchmarking effort.

Pretraining large language models, particularly with decoder-only architectures, is computationally intensive and requires frameworks optimized for scalability and efficiency. Comparing Fast-LLM with leading pretraining frameworks will provide insights into its strengths, areas for improvement, and best practices for users.

Frameworks like NVIDIA Megatron-LM, MosaicML Composer, Hugging Face Nanotron, Levanter, and Colossal-AI offer distinct advantages in this domain. To make the comparison fair and actionable, we must benchmark these frameworks under identical conditions and clearly document the results.

💡 Proposed Solution

Develop and execute a benchmarking methodology that measures and compares pretraining throughput (tokens processed per second) across Fast-LLM and other contemporary frameworks. The proposed plan includes the following:

1. Select Frameworks for Comparison

  • NVIDIA Megatron-LM
  • Hugging Face Nanotron
  • MosaicML Composer
  • Levanter
  • Colossal-AI
  • Additional frameworks, if identified, that focus on large-scale pretraining (e.g., decoder-only models).

2. Define Model Architectures and Sizes

  • Use decoder-only architectures such as Llama 3, with parameter sizes ranging from 3B to 405B.
  • Allocate the number of DGX nodes dynamically based on model size (e.g., 1 node for 3B, multiple nodes for larger models).

3. Standardize the Benchmark Environment

  • Use identical hardware for all frameworks (e.g., DGX nodes with consistent GPU, CPU, and network configurations).
  • Account for hardware variability (e.g., flaky GPUs, fabric bottlenecks) by running multiple trials.

4. Optimize Framework Settings

  • Tune each framework independently to achieve its best-case performance using:
    • Batch size
    • ZeRO strategy
    • Parallelism strategy (data, tensor, pipeline, sequence).

5. Measure Throughput

  • Focus on pretraining throughput (tokens processed per second) and FLOPS.
  • Log additional metrics, such as:
    • GPU utilization
    • Memory efficiency
    • Communication overhead (e.g., NCCL performance).

6. Data Loading Strategy

  • Benchmark with synthetic data generated on the fly to remove storage bottlenecks.
  • Include a real data benchmark to evaluate storage-dependent performance for frameworks optimized for slower storage environments.

7. Deliverables

  • Blog Post: Publish a comprehensive analysis detailing the benchmark results, including:
    • Performance graphs (e.g., throughput vs. model size or number of nodes).
    • Scalability curves for each framework.
    • Insights into where Fast-LLM excels or could improve.
  • Documentation Page: Create a new section in Fast-LLM's documentation with:
    • Benchmark results.
    • Best practices for achieving optimal performance with Fast-LLM.
  • Reproducibility Package: Provide all configuration files, scripts, and setup instructions used for the benchmarks. Ensure full transparency and openness in the methodology to allow replication by others.

🔄 Alternatives Considered

  • Excluding Real Data Benchmarks: Synthetic data ensures consistent performance across frameworks but may disadvantage frameworks designed to handle slow storage. Including both synthetic and real data benchmarks adds fairness.
  • Manual Configuration: While manual tuning ensures fairness, it might miss optimizations unless framework-specific best practices are followed.

📈 Potential Benefits

  • Performance Insights: Identify areas where Fast-LLM outperforms or lags behind competing frameworks.
  • Reproducible Results: Foster trust and credibility by providing clear methodologies and open configuration files.
  • Scalability Analysis: Understand how Fast-LLM scales with increasing hardware resources.
  • User Empowerment: Equip users with actionable benchmarks and best practices to make informed decisions.

📝 Additional Context

  • Focus exclusively on pretraining with a language modelling objective. Fine-tuning (e.g., SFT) is out of scope for this benchmarking effort.
  • Frameworks like Hugging Face Transformers + Accelerate, Fairseq, and T5X are excluded as they are more relevant for fine-tuning or encoder-decoder training.
@tscholak tscholak added enhancement New feature or request documentation Improvements or additions to documentation and removed enhancement New feature or request labels Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant