[doc] Benchmark Fast-LLM Pretraining Throughput Against Contemporary Frameworks #108

tscholak · 2025-01-08T15:16:49Z

🧐 Problem Description

Fast-LLM aims to be a competitive pretraining framework for large-scale language models. To understand its performance relative to other contemporary frameworks, we need a comprehensive and reproducible benchmarking effort.

Pretraining large language models, particularly with decoder-only architectures, is computationally intensive and requires frameworks optimized for scalability and efficiency. Comparing Fast-LLM with leading pretraining frameworks will provide insights into its strengths, areas for improvement, and best practices for users.

Frameworks like NVIDIA Megatron-LM, MosaicML Composer, Hugging Face Nanotron, Levanter, and Colossal-AI offer distinct advantages in this domain. To make the comparison fair and actionable, we must benchmark these frameworks under identical conditions and clearly document the results.

💡 Proposed Solution

Develop and execute a benchmarking methodology that measures and compares pretraining throughput (tokens processed per second) across Fast-LLM and other contemporary frameworks. The proposed plan includes the following:

1. Select Frameworks for Comparison

NVIDIA Megatron-LM
Hugging Face Nanotron
MosaicML Composer
Levanter
Colossal-AI
Additional frameworks, if identified, that focus on large-scale pretraining (e.g., decoder-only models).

2. Define Model Architectures and Sizes

Use decoder-only architectures such as Llama 3, with parameter sizes ranging from 3B to 405B.
Allocate the number of DGX nodes dynamically based on model size (e.g., 1 node for 3B, multiple nodes for larger models).

3. Standardize the Benchmark Environment

Use identical hardware for all frameworks (e.g., DGX nodes with consistent GPU, CPU, and network configurations).
Account for hardware variability (e.g., flaky GPUs, fabric bottlenecks) by running multiple trials.

4. Optimize Framework Settings

Tune each framework independently to achieve its best-case performance using:
- Batch size
- ZeRO strategy
- Parallelism strategy (data, tensor, pipeline, sequence).

5. Measure Throughput

Focus on pretraining throughput (tokens processed per second) and FLOPS.
Log additional metrics, such as:
- GPU utilization
- Memory efficiency
- Communication overhead (e.g., NCCL performance).

6. Data Loading Strategy

Benchmark with synthetic data generated on the fly to remove storage bottlenecks.
Include a real data benchmark to evaluate storage-dependent performance for frameworks optimized for slower storage environments.

7. Deliverables

Blog Post: Publish a comprehensive analysis detailing the benchmark results, including:
- Performance graphs (e.g., throughput vs. model size or number of nodes).
- Scalability curves for each framework.
- Insights into where Fast-LLM excels or could improve.
Documentation Page: Create a new section in Fast-LLM's documentation with:
- Benchmark results.
- Best practices for achieving optimal performance with Fast-LLM.
Reproducibility Package: Provide all configuration files, scripts, and setup instructions used for the benchmarks. Ensure full transparency and openness in the methodology to allow replication by others.

🔄 Alternatives Considered

Excluding Real Data Benchmarks: Synthetic data ensures consistent performance across frameworks but may disadvantage frameworks designed to handle slow storage. Including both synthetic and real data benchmarks adds fairness.
Manual Configuration: While manual tuning ensures fairness, it might miss optimizations unless framework-specific best practices are followed.

📈 Potential Benefits

Performance Insights: Identify areas where Fast-LLM outperforms or lags behind competing frameworks.
Reproducible Results: Foster trust and credibility by providing clear methodologies and open configuration files.
Scalability Analysis: Understand how Fast-LLM scales with increasing hardware resources.
User Empowerment: Equip users with actionable benchmarks and best practices to make informed decisions.

📝 Additional Context

Focus exclusively on pretraining with a language modelling objective. Fine-tuning (e.g., SFT) is out of scope for this benchmarking effort.
Frameworks like Hugging Face Transformers + Accelerate, Fairseq, and T5X are excluded as they are more relevant for fine-tuning or encoder-decoder training.

tscholak added enhancement New feature or request documentation Improvements or additions to documentation and removed enhancement New feature or request labels Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[doc] Benchmark Fast-LLM Pretraining Throughput Against Contemporary Frameworks #108

[doc] Benchmark Fast-LLM Pretraining Throughput Against Contemporary Frameworks #108

tscholak commented Jan 8, 2025

[doc] Benchmark Fast-LLM Pretraining Throughput Against Contemporary Frameworks #108

[doc] Benchmark Fast-LLM Pretraining Throughput Against Contemporary Frameworks #108

Comments

tscholak commented Jan 8, 2025

🧐 Problem Description

💡 Proposed Solution

1. Select Frameworks for Comparison

2. Define Model Architectures and Sizes

3. Standardize the Benchmark Environment

4. Optimize Framework Settings

5. Measure Throughput

6. Data Loading Strategy

7. Deliverables

🔄 Alternatives Considered

📈 Potential Benefits

📝 Additional Context