Language Model Training Pipeline

This repository provides a training pipeline for fine-tuning language models using the Transformers library. It includes scripts for dataset preparation, model training, and monitoring. Special thanks to my teammates for their contributions to the Capstone Project. For more details, please refer to the original repository.

Project Overview

This project enables efficient fine-tuning of large language models using Hugging Face Transformers and DeepSpeed. It also supports:

Dynamic dataset preparation and caching.
Training with advanced optimization techniques like DeepSpeed Zero Init.
Real-time monitoring via WandB.

Features

Dataset Preparation: Tokenize and preprocess large datasets for model training.
DeepSpeed Integration: Efficient training of large models with Zero optimization.
WandB Integration: Tracks training metrics and visualizations.
Configuration-Driven: Define training parameters and paths via JSON files or CLI arguments.

Installation

Clone this repository:

git clone <repository-url>
cd <repository-directory>

Install Python dependencies:
```
pip install -r requirements.txt
```
Install additional Linux tools (optional, e.g., for GPU monitoring):
```
bash tool_install.sh
```

Usage

1. Prepare and Cache Dataset

Use load_data.py to preprocess and tokenize your dataset. The script supports custom tokenizers and sequence packing.

Example:

python load_data.py \
  --tokenizer_path /workspace/ML_team/llama_tokenizer_1b \
  --dataset_name DKYoon/SlimPajama-6B \
  --output_dir ./datasets_pack_full/tokenized_data \
  --num_proc 16

Key Arguments:

--tokenizer_path: Path to your custom tokenizer directory.
--dataset_name: Dataset name from Hugging Face hub (e.g., DKYoon/SlimPajama-6B).
--output_dir: Directory to save the tokenized dataset.
--max_seq_len: Maximum sequence length for tokenization (default: 1024).
--num_proc: Number of processes for parallel tokenization (default: 8).

2. Train the Model

Once the dataset is prepared, use train.py to fine-tune your model. Training parameters and paths are specified in a configuration file.

Running Training with DeepSpeed:

DeepSpeed is used to optimize large-scale model training. To run the training with DeepSpeed, use the following command:

deepspeed train.py --config config.json

Explanation of the Command:

deepspeed: Invokes the DeepSpeed runtime to manage distributed training and memory optimizations.
train.py: The training script, which initializes the model, tokenizer, and datasets based on the config.json file.
--config config.json: Specifies the path to the configuration file containing all training parameters.

What to Expect:

Initialization Logs:
- DeepSpeed initialization with Zero Init (if enabled).
- GPU allocation and compute type (fp16 or bf16).
Real-Time Metrics:
- Progress logs and metrics will be reported to WandB, if configured.
Model Checkpoints:
- Checkpoints will be saved periodically as defined in save_steps and checkpoint_output_dir.

Configuration

`load_data.py` Configuration:

Customize via CLI arguments:

--tokenizer_path: Path to tokenizer.
--dataset_name: Hugging Face dataset name.
--max_seq_len: Maximum token length (default: 1024).
--num_proc: Number of processes for tokenization.

python load_data.py \\
  --tokenizer_path /workspace/ML_team/llama_tokenizer_1b \\
  --dataset_name DKYoon/SlimPajama-6B \\
  --output_dir ./datasets_pack_full/tokenized_data \\
  --num_proc 16

`train.py` Configuration:

Edit config.json to:

Set model paths, optimizer settings, and training hyperparameters.
Configure DeepSpeed optimizations (use_zero3, gradient accumulation).
Enable WandB for real-time monitoring.

Example config.json:

{
    "seed": 42,
    "cuda_visible_devices": "0,1",
    "master_addr": "localhost",
    "master_port": "9994",
    "data_path": "./datasets_pack_full/tokenized_data",
    "model_path": "./configs/model_configs/llama_190M_config.json",
    "checkpoint_output_dir": "./model_checkpoints",
    "deepspeed_config": "./configs/deepspeed_configs/test_ds_zero2_config.json",
    "tokenizer_config": "./configs/llama_tokenizer_configs",
    "logging_dir": "./logs",
    "eval_strategy": "steps",
    "save_strategy": "steps",
    "save_steps": 50000,
    "logging_steps": 20,
    "eval_steps": 10000,
    "num_epoch": 1,
    "batch_size": 16,
    "gradient_checkpointing": false,
    "fp16": true,
    "bf16": false,
    "learning_rate": 0.0003,
    "gradient_accumulation": 1,
    "weight_decay": 0.00003,
    "save_total_limit": 2,
    "warmup_steps": 500,
    "use_zero3": true,
    "lr_scheduler_type": "cosine",
    "vis_app": "wandb",
    "final_model": {
        "path": "./final_model/target_model_config",
        "tokenizer_path": "./final_model/target_tokenizer_config"
    },
    "wandb": {
        "key": "<your-wandb-key>",
        "project_name": "llama-training",
        "entity_name": "<your-wandb-entity>"
    }
}

deepspeed train.py --config config.json

Monitoring and Debugging

GPU Monitoring

Install and use tools like nvtop:

sudo apt install nvtop

WandB Metrics

Track training progress:

Log in via API key in the configuration.
View real-time metrics on the WandB dashboard.

Dependencies

Python Dependencies

Install via requirements.txt:

pip install -r requirements.txt

Linux Tools

Run tool_install.sh to install:

nvtop: Monitor GPU usage.
Other tools: Extend as needed.

Contribution

Feel free to fork the repository and submit PRs for improvements. Issues and feature requests are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
utilities		utilities
.gitignore		.gitignore
README.md		README.md
load_data.py		load_data.py
requirements.txt		requirements.txt
test.py		test.py
tool_install.sh		tool_install.sh
train.py		train.py
train_config.yaml		train_config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Model Training Pipeline

Table of Contents

Project Overview

Features

Installation

Usage

1. Prepare and Cache Dataset

Example:

Key Arguments:

2. Train the Model

Running Training with DeepSpeed:

Explanation of the Command:

What to Expect:

Configuration

`load_data.py` Configuration:

`train.py` Configuration:

Monitoring and Debugging

GPU Monitoring

WandB Metrics

Dependencies

Python Dependencies

Linux Tools

Contribution

About

Releases

Packages

Languages

Yuxuan-Zhang-Dexter/LLM_Scaling_Project

Folders and files

Latest commit

History

Repository files navigation

Language Model Training Pipeline

Table of Contents

Project Overview

Features

Installation

Usage

1. Prepare and Cache Dataset

Example:

Key Arguments:

2. Train the Model

Running Training with DeepSpeed:

Explanation of the Command:

What to Expect:

Configuration

load_data.py Configuration:

train.py Configuration:

Monitoring and Debugging

GPU Monitoring

WandB Metrics

Dependencies

Python Dependencies

Linux Tools

Contribution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`load_data.py` Configuration:

`train.py` Configuration:

Packages