This repository provides a training pipeline for fine-tuning language models using the Transformers library. It includes scripts for dataset preparation, model training, and monitoring. Special thanks to my teammates for their contributions to the Capstone Project. For more details, please refer to the original repository.
This project enables efficient fine-tuning of large language models using Hugging Face Transformers and DeepSpeed. It also supports:
- Dynamic dataset preparation and caching.
- Training with advanced optimization techniques like DeepSpeed Zero Init.
- Real-time monitoring via WandB.
- Dataset Preparation: Tokenize and preprocess large datasets for model training.
- DeepSpeed Integration: Efficient training of large models with Zero optimization.
- WandB Integration: Tracks training metrics and visualizations.
- Configuration-Driven: Define training parameters and paths via JSON files or CLI arguments.
-
Clone this repository:
git clone <repository-url> cd <repository-directory>
-
Install Python dependencies:
pip install -r requirements.txt
-
Install additional Linux tools (optional, e.g., for GPU monitoring):
bash tool_install.sh
Use load_data.py
to preprocess and tokenize your dataset. The script supports custom tokenizers and sequence packing.
python load_data.py \
--tokenizer_path /workspace/ML_team/llama_tokenizer_1b \
--dataset_name DKYoon/SlimPajama-6B \
--output_dir ./datasets_pack_full/tokenized_data \
--num_proc 16
--tokenizer_path
: Path to your custom tokenizer directory.--dataset_name
: Dataset name from Hugging Face hub (e.g.,DKYoon/SlimPajama-6B
).--output_dir
: Directory to save the tokenized dataset.--max_seq_len
: Maximum sequence length for tokenization (default: 1024).--num_proc
: Number of processes for parallel tokenization (default: 8).
Once the dataset is prepared, use train.py
to fine-tune your model. Training parameters and paths are specified in a configuration file.
DeepSpeed is used to optimize large-scale model training. To run the training with DeepSpeed, use the following command:
deepspeed train.py --config config.json
deepspeed
: Invokes the DeepSpeed runtime to manage distributed training and memory optimizations.train.py
: The training script, which initializes the model, tokenizer, and datasets based on theconfig.json
file.--config config.json
: Specifies the path to the configuration file containing all training parameters.
- Initialization Logs:
- DeepSpeed initialization with
Zero Init
(if enabled). - GPU allocation and compute type (
fp16
orbf16
).
- DeepSpeed initialization with
- Real-Time Metrics:
- Progress logs and metrics will be reported to WandB, if configured.
- Model Checkpoints:
- Checkpoints will be saved periodically as defined in
save_steps
andcheckpoint_output_dir
.
- Checkpoints will be saved periodically as defined in
Customize via CLI arguments:
--tokenizer_path
: Path to tokenizer.--dataset_name
: Hugging Face dataset name.--max_seq_len
: Maximum token length (default: 1024).--num_proc
: Number of processes for tokenization.
python load_data.py \\
--tokenizer_path /workspace/ML_team/llama_tokenizer_1b \\
--dataset_name DKYoon/SlimPajama-6B \\
--output_dir ./datasets_pack_full/tokenized_data \\
--num_proc 16
Edit config.json
to:
- Set model paths, optimizer settings, and training hyperparameters.
- Configure DeepSpeed optimizations (
use_zero3
, gradient accumulation). - Enable WandB for real-time monitoring.
Example config.json
:
{
"seed": 42,
"cuda_visible_devices": "0,1",
"master_addr": "localhost",
"master_port": "9994",
"data_path": "./datasets_pack_full/tokenized_data",
"model_path": "./configs/model_configs/llama_190M_config.json",
"checkpoint_output_dir": "./model_checkpoints",
"deepspeed_config": "./configs/deepspeed_configs/test_ds_zero2_config.json",
"tokenizer_config": "./configs/llama_tokenizer_configs",
"logging_dir": "./logs",
"eval_strategy": "steps",
"save_strategy": "steps",
"save_steps": 50000,
"logging_steps": 20,
"eval_steps": 10000,
"num_epoch": 1,
"batch_size": 16,
"gradient_checkpointing": false,
"fp16": true,
"bf16": false,
"learning_rate": 0.0003,
"gradient_accumulation": 1,
"weight_decay": 0.00003,
"save_total_limit": 2,
"warmup_steps": 500,
"use_zero3": true,
"lr_scheduler_type": "cosine",
"vis_app": "wandb",
"final_model": {
"path": "./final_model/target_model_config",
"tokenizer_path": "./final_model/target_tokenizer_config"
},
"wandb": {
"key": "<your-wandb-key>",
"project_name": "llama-training",
"entity_name": "<your-wandb-entity>"
}
}
deepspeed train.py --config config.json
Install and use tools like nvtop
:
sudo apt install nvtop
Track training progress:
- Log in via API key in the configuration.
- View real-time metrics on the WandB dashboard.
Install via requirements.txt
:
pip install -r requirements.txt
Run tool_install.sh
to install:
nvtop
: Monitor GPU usage.- Other tools: Extend as needed.
Feel free to fork the repository and submit PRs for improvements. Issues and feature requests are welcome!