Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GPU Training #64

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

TideDra
Copy link

@TideDra TideDra commented Jul 26, 2024

This PR supports training Cambrian-8b on GPU with deepspeed zero2. Main modification includes

  1. We remove all the .float() that satisfies TPU's precision and use bf16 unifiedly.
  2. We fix the llama3 chat template bug and tokenizer bug which adds additional bos token
  3. We revert checkpoint loading and resuming impl to the default Huggingface Trainer impl
  4. We optimize the loading of data and model

@ellisbrown
Copy link
Member

@TideDra @flyskywalkerlby thanks so much for your contribution!

I'll look through the code and add any comments/suggestions. We will also have to fire off some TPU training runs to verify that nothing impacts our TPU training before merging.

Comment on lines 15 to +19
dependencies = [
"torch==2.2.0", "torchvision==0.17.0",
"transformers==4.37.0", "tokenizers==0.15.0", "sentencepiece==0.1.99", "shortuuid",
"accelerate==0.23.0", "peft==0.4.0",
"pydantic", "markdown2[all]", "numpy==1.26.4", "scikit-learn==1.2.2",
"torch==2.3.1", "torchvision==0.18.1",
"transformers==4.42.4", "tokenizers==0.19.1", "sentencepiece==0.2.0", "shortuuid",
"accelerate==0.32.1", "peft==0.11.1",
"pydantic", "markdown2[all]", "numpy==1.26.4", "scikit-learn==1.5.1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsb0601 @penghao-wu

we need to check if the most recent version of accelerate still has the TPU bugs we encountered

inference.py Outdated Show resolved Hide resolved
@flyskywalkerlby
Copy link

flyskywalkerlby commented Jul 29, 2024 via email

@wufeim
Copy link

wufeim commented Oct 24, 2024

Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!

@TideDra
Copy link
Author

TideDra commented Oct 24, 2024

Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!

VLMEvalKit supports to evaluate cambrian.

@wufeim
Copy link

wufeim commented Oct 24, 2024

Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!

VLMEvalKit supports to evaluate cambrian.

Got it, thanks!

@dfan
Copy link

dfan commented Oct 28, 2024

Is there a reason why the checkpoint saving uses torch.save()? It seems that the full model weights are stored per rank instead of the sharded model weights, so the overall size of the checkpoints is huge

@wufeim
Copy link

wufeim commented Oct 28, 2024

Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!

VLMEvalKit supports to evaluate cambrian.

It seems to me there are some issues with the VLMEvalKit codebase when evaluating LLaVA on TextVQA. With the released LLaVA or models trained with this cambrian+gpu code, I couldn't reproduce the results reproduced in the LLaVA v1.5 paper. Not sure what's the difference between the evaluations but we probably need to modify the evaluation code from LLaVA to reproduce exact results?

@ellisbrown
Copy link
Member

Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!

VLMEvalKit supports to evaluate cambrian.

It seems to me there are some issues with the VLMEvalKit codebase when evaluating LLaVA on TextVQA. With the released LLaVA or models trained with this cambrian+gpu code, I couldn't reproduce the results reproduced in the LLaVA v1.5 paper. Not sure what's the difference between the evaluations but we probably need to modify the evaluation code from LLaVA to reproduce exact results?

@wufeim note that we have released our eval code here https://github.com/cambrian-mllm/cambrian/tree/main/eval

@wufeim
Copy link

wufeim commented Oct 28, 2024

@wufeim note that we have released our eval code here https://github.com/cambrian-mllm/cambrian/tree/main/eval

Oh I see it now. Thanks so much! I will check it out.

I was looking at the documentation here and thought they were not out yet. Maybe update the link in the README?

@wufeim
Copy link

wufeim commented Oct 29, 2024

@wufeim note that we have released our eval code here https://github.com/cambrian-mllm/cambrian/tree/main/eval

Hi @ellisbrown , quick questions on the evaluation code:

  1. It seems that eval/requirements.txt is missing. I guess mainly the datasets package?
  2. When I was evaluating cambrian-8b with the following command, all four GPUs are evaluating on the whole TextVQA instead of one of the four subparts. Is this correct? Or am I using a wrong command?
    bash scripts/run_benchmark.sh --benchmark textvqa --ckpt nyu-visionx/cambrian-8b --conv_mode llama_3
    

@dfan
Copy link

dfan commented Oct 30, 2024

+1 the eval/requirements.txt is missing. It'd be nice to know if a specific version of datasets is needed

@ellisbrown
Copy link
Member

@wufeim @dfan sorry the requirements was masked by gitignore. added in #82

2. When I was evaluating cambrian-8b with the following command, all four GPUs are evaluating on the whole TextVQA instead of one of the four subparts. Is this correct? Or am I using a wrong command?

@wufeim have a read through run_benchmark.sh. the questions are chunked and each gpu handles one chunk.


let's move further discussion unrelated to this GPU training PR #64 to separate issues please.

@wufeim
Copy link

wufeim commented Nov 1, 2024

Hi @TideDra,

I'm trying out the GPU training code. I see that you used zero2 for both pretraining and finetuning. Meanwhile LLaVA used zero2 for pretraining and zero3 for finetuning. I am not an expert with deepspeed but I did encounter some issues with zero3, possibly related to this. Did you have similar issues? Or how did you decide on zero 2/3?

Thanks!

@TideDra
Copy link
Author

TideDra commented Nov 1, 2024

Hi @TideDra,

I'm trying out the GPU training code. I see that you used zero2 for both pretraining and finetuning. Meanwhile LLaVA used zero2 for pretraining and zero3 for finetuning. I am not an expert with deepspeed but I did encounter some issues with zero3, possibly related to this. Did you have similar issues? Or how did you decide on zero 2/3?

Thanks!

In general, zero3 reduces GPU memory usage while increases training time, compared with zero2, but does not affect model perfomance theoretically. So zero2 is preferred if memory is enough. I didn't try zero3 so I didn't encounter your issue :). But actually, zero3 indeed has more bugs than zero2 in practice.

@nku-zhichengzhang
Copy link

nku-zhichengzhang commented Nov 7, 2024

OOM when finetuning Cambrian in the 80GB A100 cluster even with a batch size of 1

Description:
I am encountering an Out-of-Memory (OOM) error while attempting to fine-tune the Cambrian model using A100 GPUs. Despite setting the batch size and gradient accumulation steps to 1, the issue persists.

Environment:

Model: Cambrian
GPU: A100
DeepSpeed Version: 0.14.4
CUDA Version: 12.1
PyTorch Version: 2.3.1

Python Packages:
accelerate 0.32.1
aiofiles 23.2.1
aiohappyeyeballs 2.4.3
aiohttp 3.10.10
aiosignal 1.3.1
altair 5.4.1
annotated-types 0.7.0
anyio 4.6.2.post1
asttokens 2.4.1
async-timeout 4.0.3
attrs 24.2.0
bitsandbytes 0.43.1
cachetools 5.5.0
cambrian 1.0.0
certifi 2024.8.30
charset-normalizer 3.4.0
click 8.1.7
colorama 0.4.6
coloredlogs 15.0.1
contourpy 1.3.0
cos-python-sdk-v5 1.9.32
crcmod 1.7
cycler 0.12.1
decorator 5.1.1
deepspeed 0.14.4
diffusers 0.31.0
docker-pycreds 0.4.0
einops 0.8.0
einops-exts 0.0.4
exceptiongroup 1.2.2
executing 2.1.0
EzColorLog 1.0.3
fastapi 0.115.4
ffmpy 0.4.0
filelock 3.16.1
flash-attn 2.6.3
fonttools 4.54.1
frozenlist 1.5.0
fsspec 2024.10.0
ftfy 6.3.1
gcsfs 2024.10.0
gitdb 4.0.11
GitPython 3.1.43
google-api-core 2.22.0
google-auth 2.36.0
google-auth-oauthlib 1.2.1
google-cloud-core 2.4.1
google-cloud-storage 2.18.2
google-crc32c 1.6.0
google-resumable-media 2.7.2
googleapis-common-protos 1.65.0
gradio 4.16.0
gradio_client 0.8.1
h11 0.14.0
hf_transfer 0.1.8
hjson 3.1.0
httpcore 0.17.3
httpx 0.24.0
huggingface-hub 0.26.2
humanfriendly 10.0
idna 3.10
importlib_metadata 8.5.0
importlib_resources 6.4.5
ipython 8.29.0
jedi 0.19.1
Jinja2 3.1.4
joblib 1.4.2
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
kiwisolver 1.4.7
latex2mathml 3.77.0
markdown-it-py 3.0.0
markdown2 2.5.1
MarkupSafe 2.1.5
matplotlib 3.9.2
matplotlib-inline 0.1.7
mdurl 0.1.2
mpmath 1.3.0
multidict 6.1.0
narwhals 1.13.2
networkx 3.4.2
ninja 1.11.1.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 12.560.30
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.6.77
nvidia-nvtx-cu12 12.1.105
oauthlib 3.2.2
open_clip_torch 2.29.0
orjson 3.10.11
packaging 24.1
pandas 2.2.3
parso 0.8.4
peewee 3.17.7
peft 0.11.1
pexpect 4.9.0
pillow 10.4.0
pip 24.2
platformdirs 4.3.6
prompt_toolkit 3.0.48
propcache 0.2.0
proto-plus 1.25.0
protobuf 5.28.3
psutil 6.1.0
ptyprocess 0.7.0
pure_eval 0.2.3
py-cpuinfo 9.0.0
pyasn1 0.6.1
pyasn1_modules 0.4.1
pycryptodome 3.21.0
pydantic 2.9.2
pydantic_core 2.23.4
pydub 0.25.1
Pygments 2.18.0
pynvml 11.5.3
pyparsing 3.2.0
python-dateutil 2.9.0.post0
python-multipart 0.0.17
pytz 2024.2
PyYAML 6.0.2
referencing 0.35.1
regex 2024.11.6
requests 2.32.3
requests-oauthlib 2.0.0
rich 13.9.4
rpds-py 0.21.0
rsa 4.9
ruff 0.7.2
safetensors 0.4.5
scikit-learn 1.5.1
scipy 1.14.1
semantic-version 2.10.0
sentencepiece 0.2.0
sentry-sdk 2.18.0
setproctitle 1.3.3
setuptools 75.1.0
shellingham 1.5.4
shortuuid 1.0.13
six 1.16.0
smmap 5.0.1
sniffio 1.3.1
stack-data 0.6.3
starlette 0.41.2
svgwrite 1.4.3
swanboard 0.1.4b2
swankit 0.1.1b3
swanlab 0.3.23
sympy 1.13.3
threadpoolctl 3.5.0
timm 1.0.7
tokenizers 0.19.1
tomlkit 0.12.0
torch 2.3.1
torchtext 0.18.0
torchvision 0.18.1
tqdm 4.67.0
traitlets 5.14.3
transformers 4.42.4
triton 2.3.1
typer 0.12.5
typing_extensions 4.12.2
tzdata 2024.2
ujson 5.10.0
urllib3 2.2.3
uvicorn 0.32.0
wandb 0.18.6
wavedrom 2.0.3.post3
wcwidth 0.2.13
websockets 11.0.3
wheel 0.44.0
xmltodict 0.14.2
yarl 1.17.1
zipp 3.20.2

Configuration:

DeepSpeed Config: Using ZeRO stage 2 (attach zero2.json if possible)
Batch Size: 1
Gradient Accumulation Steps: 1
Mixed Precision: bf16

Steps to Reproduce:

Run the training script with the following command:

deepspeed
--num_nodes $SLURM_JOB_NUM_NODES
--num_gpus $SLURM_GPUS_PER_NODE
--master_addr localhost
--master_port 12345
--hostfile hostfile_temp
--no_ssh_check
cambrian/train/train_gpu.py
--deepspeed ./scripts/zero2.json
--model_name_or_path $ROOT_DIR/LLM/llama3-llava-next-8b
--version llama_v3
--data_path "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/jsons/Cambrian150K_withsystemprompt.jsonl"
--image_folder "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/"
--pretrain_mm_mlp_adapter "$ROOT_DIR/cambrian-models/models--nyu-visionx--cambrian-8b_projector/mm_projector.bin"
--vision_tower_aux_list '["siglip/CLIP-ViT-SO400M-14-384", "openai/clip-vit-large-patch14-336", "facebook/dinov2-giant-res378", "clip-convnext-XXL-multi-stage"]'
--vision_tower_aux_token_len_list '[576, 576, 576, 9216]'
--image_token_len 576
--num_query_group 1
--query_num_list '[576]'
--connector_depth 3
--image_position 91
--vision_hidden_size 1024
--connector_only False
--num_of_vision_sampler_layers 10
--start_of_vision_sampler_layers 0
--stride_of_vision_sampler_layers 3
--mm_projector_type sva
--unfreeze_mm_vision_tower False
--mm_vision_select_layer -2
--mm_use_im_start_end False
--mm_use_im_patch_token False
--image_aspect_ratio pad
--group_by_modality_length True
--bf16 True
--output_dir $CKPT_DIR
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 4
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 500
--save_total_limit 5
--learning_rate 4e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length 2048
--gradient_checkpointing True
--dataloader_num_workers 4
--lazy_preprocess True
--run_name $CKPT_NAME
--report_to wandb

Behavior:
The training process fails due to OOM errors, even with minimal batch size and gradient accumulation settings.

image

@TideDra
Copy link
Author

TideDra commented Nov 7, 2024

OOM when finetuning Cambrian in the 80GB A100 cluster even with a batch size of 1

Description: I am encountering an Out-of-Memory (OOM) error while attempting to fine-tune the Cambrian model using A100 GPUs. Despite setting the batch size and gradient accumulation steps to 1, the issue persists.

Environment:

Model: Cambrian GPU: A100 DeepSpeed Version: 0.14.4 CUDA Version: 12.1 PyTorch Version: 2.3.1

Python Packages: accelerate 0.32.1 aiofiles 23.2.1 aiohappyeyeballs 2.4.3 aiohttp 3.10.10 aiosignal 1.3.1 altair 5.4.1 annotated-types 0.7.0 anyio 4.6.2.post1 asttokens 2.4.1 async-timeout 4.0.3 attrs 24.2.0 bitsandbytes 0.43.1 cachetools 5.5.0 cambrian 1.0.0 certifi 2024.8.30 charset-normalizer 3.4.0 click 8.1.7 colorama 0.4.6 coloredlogs 15.0.1 contourpy 1.3.0 cos-python-sdk-v5 1.9.32 crcmod 1.7 cycler 0.12.1 decorator 5.1.1 deepspeed 0.14.4 diffusers 0.31.0 docker-pycreds 0.4.0 einops 0.8.0 einops-exts 0.0.4 exceptiongroup 1.2.2 executing 2.1.0 EzColorLog 1.0.3 fastapi 0.115.4 ffmpy 0.4.0 filelock 3.16.1 flash-attn 2.6.3 fonttools 4.54.1 frozenlist 1.5.0 fsspec 2024.10.0 ftfy 6.3.1 gcsfs 2024.10.0 gitdb 4.0.11 GitPython 3.1.43 google-api-core 2.22.0 google-auth 2.36.0 google-auth-oauthlib 1.2.1 google-cloud-core 2.4.1 google-cloud-storage 2.18.2 google-crc32c 1.6.0 google-resumable-media 2.7.2 googleapis-common-protos 1.65.0 gradio 4.16.0 gradio_client 0.8.1 h11 0.14.0 hf_transfer 0.1.8 hjson 3.1.0 httpcore 0.17.3 httpx 0.24.0 huggingface-hub 0.26.2 humanfriendly 10.0 idna 3.10 importlib_metadata 8.5.0 importlib_resources 6.4.5 ipython 8.29.0 jedi 0.19.1 Jinja2 3.1.4 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.7 latex2mathml 3.77.0 markdown-it-py 3.0.0 markdown2 2.5.1 MarkupSafe 2.1.5 matplotlib 3.9.2 matplotlib-inline 0.1.7 mdurl 0.1.2 mpmath 1.3.0 multidict 6.1.0 narwhals 1.13.2 networkx 3.4.2 ninja 1.11.1.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.77 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 open_clip_torch 2.29.0 orjson 3.10.11 packaging 24.1 pandas 2.2.3 parso 0.8.4 peewee 3.17.7 peft 0.11.1 pexpect 4.9.0 pillow 10.4.0 pip 24.2 platformdirs 4.3.6 prompt_toolkit 3.0.48 propcache 0.2.0 proto-plus 1.25.0 protobuf 5.28.3 psutil 6.1.0 ptyprocess 0.7.0 pure_eval 0.2.3 py-cpuinfo 9.0.0 pyasn1 0.6.1 pyasn1_modules 0.4.1 pycryptodome 3.21.0 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pynvml 11.5.3 pyparsing 3.2.0 python-dateutil 2.9.0.post0 python-multipart 0.0.17 pytz 2024.2 PyYAML 6.0.2 referencing 0.35.1 regex 2024.11.6 requests 2.32.3 requests-oauthlib 2.0.0 rich 13.9.4 rpds-py 0.21.0 rsa 4.9 ruff 0.7.2 safetensors 0.4.5 scikit-learn 1.5.1 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 sentry-sdk 2.18.0 setproctitle 1.3.3 setuptools 75.1.0 shellingham 1.5.4 shortuuid 1.0.13 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 stack-data 0.6.3 starlette 0.41.2 svgwrite 1.4.3 swanboard 0.1.4b2 swankit 0.1.1b3 swanlab 0.3.23 sympy 1.13.3 threadpoolctl 3.5.0 timm 1.0.7 tokenizers 0.19.1 tomlkit 0.12.0 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tqdm 4.67.0 traitlets 5.14.3 transformers 4.42.4 triton 2.3.1 typer 0.12.5 typing_extensions 4.12.2 tzdata 2024.2 ujson 5.10.0 urllib3 2.2.3 uvicorn 0.32.0 wandb 0.18.6 wavedrom 2.0.3.post3 wcwidth 0.2.13 websockets 11.0.3 wheel 0.44.0 xmltodict 0.14.2 yarl 1.17.1 zipp 3.20.2

Configuration:

DeepSpeed Config: Using ZeRO stage 2 (attach zero2.json if possible) Batch Size: 1 Gradient Accumulation Steps: 1 Mixed Precision: bf16

Steps to Reproduce:

Run the training script with the following command:

deepspeed --num_nodes $SLURM_JOB_NUM_NODES --num_gpus $SLURM_GPUS_PER_NODE --master_addr localhost --master_port 12345 --hostfile hostfile_temp --no_ssh_check cambrian/train/train_gpu.py --deepspeed ./scripts/zero2.json --model_name_or_path $ROOT_DIR/LLM/llama3-llava-next-8b --version llama_v3 --data_path "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/jsons/Cambrian150K_withsystemprompt.jsonl" --image_folder "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/" --pretrain_mm_mlp_adapter "$ROOT_DIR/cambrian-models/models--nyu-visionx--cambrian-8b_projector/mm_projector.bin" --vision_tower_aux_list '["siglip/CLIP-ViT-SO400M-14-384", "openai/clip-vit-large-patch14-336", "facebook/dinov2-giant-res378", "clip-convnext-XXL-multi-stage"]' --vision_tower_aux_token_len_list '[576, 576, 576, 9216]' --image_token_len 576 --num_query_group 1 --query_num_list '[576]' --connector_depth 3 --image_position 91 --vision_hidden_size 1024 --connector_only False --num_of_vision_sampler_layers 10 --start_of_vision_sampler_layers 0 --stride_of_vision_sampler_layers 3 --mm_projector_type sva --unfreeze_mm_vision_tower False --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir $CKPT_DIR --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 500 --save_total_limit 5 --learning_rate 4e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --run_name $CKPT_NAME --report_to wandb

Behavior: The training process fails due to OOM errors, even with minimal batch size and gradient accumulation settings.

image

How many gpu do you use? Besides we only test with vicuna-7b.

@nku-zhichengzhang
Copy link

nku-zhichengzhang commented Nov 7, 2024

OOM when finetuning Cambrian in the 80GB A100 cluster even with a batch size of 1
Description: I am encountering an Out-of-Memory (OOM) error while attempting to fine-tune the Cambrian model using A100 GPUs. Despite setting the batch size and gradient accumulation steps to 1, the issue persists.
Environment:
Model: Cambrian GPU: A100 DeepSpeed Version: 0.14.4 CUDA Version: 12.1 PyTorch Version: 2.3.1
Python Packages: accelerate 0.32.1 aiofiles 23.2.1 aiohappyeyeballs 2.4.3 aiohttp 3.10.10 aiosignal 1.3.1 altair 5.4.1 annotated-types 0.7.0 anyio 4.6.2.post1 asttokens 2.4.1 async-timeout 4.0.3 attrs 24.2.0 bitsandbytes 0.43.1 cachetools 5.5.0 cambrian 1.0.0 certifi 2024.8.30 charset-normalizer 3.4.0 click 8.1.7 colorama 0.4.6 coloredlogs 15.0.1 contourpy 1.3.0 cos-python-sdk-v5 1.9.32 crcmod 1.7 cycler 0.12.1 decorator 5.1.1 deepspeed 0.14.4 diffusers 0.31.0 docker-pycreds 0.4.0 einops 0.8.0 einops-exts 0.0.4 exceptiongroup 1.2.2 executing 2.1.0 EzColorLog 1.0.3 fastapi 0.115.4 ffmpy 0.4.0 filelock 3.16.1 flash-attn 2.6.3 fonttools 4.54.1 frozenlist 1.5.0 fsspec 2024.10.0 ftfy 6.3.1 gcsfs 2024.10.0 gitdb 4.0.11 GitPython 3.1.43 google-api-core 2.22.0 google-auth 2.36.0 google-auth-oauthlib 1.2.1 google-cloud-core 2.4.1 google-cloud-storage 2.18.2 google-crc32c 1.6.0 google-resumable-media 2.7.2 googleapis-common-protos 1.65.0 gradio 4.16.0 gradio_client 0.8.1 h11 0.14.0 hf_transfer 0.1.8 hjson 3.1.0 httpcore 0.17.3 httpx 0.24.0 huggingface-hub 0.26.2 humanfriendly 10.0 idna 3.10 importlib_metadata 8.5.0 importlib_resources 6.4.5 ipython 8.29.0 jedi 0.19.1 Jinja2 3.1.4 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.7 latex2mathml 3.77.0 markdown-it-py 3.0.0 markdown2 2.5.1 MarkupSafe 2.1.5 matplotlib 3.9.2 matplotlib-inline 0.1.7 mdurl 0.1.2 mpmath 1.3.0 multidict 6.1.0 narwhals 1.13.2 networkx 3.4.2 ninja 1.11.1.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.77 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 open_clip_torch 2.29.0 orjson 3.10.11 packaging 24.1 pandas 2.2.3 parso 0.8.4 peewee 3.17.7 peft 0.11.1 pexpect 4.9.0 pillow 10.4.0 pip 24.2 platformdirs 4.3.6 prompt_toolkit 3.0.48 propcache 0.2.0 proto-plus 1.25.0 protobuf 5.28.3 psutil 6.1.0 ptyprocess 0.7.0 pure_eval 0.2.3 py-cpuinfo 9.0.0 pyasn1 0.6.1 pyasn1_modules 0.4.1 pycryptodome 3.21.0 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pynvml 11.5.3 pyparsing 3.2.0 python-dateutil 2.9.0.post0 python-multipart 0.0.17 pytz 2024.2 PyYAML 6.0.2 referencing 0.35.1 regex 2024.11.6 requests 2.32.3 requests-oauthlib 2.0.0 rich 13.9.4 rpds-py 0.21.0 rsa 4.9 ruff 0.7.2 safetensors 0.4.5 scikit-learn 1.5.1 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 sentry-sdk 2.18.0 setproctitle 1.3.3 setuptools 75.1.0 shellingham 1.5.4 shortuuid 1.0.13 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 stack-data 0.6.3 starlette 0.41.2 svgwrite 1.4.3 swanboard 0.1.4b2 swankit 0.1.1b3 swanlab 0.3.23 sympy 1.13.3 threadpoolctl 3.5.0 timm 1.0.7 tokenizers 0.19.1 tomlkit 0.12.0 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tqdm 4.67.0 traitlets 5.14.3 transformers 4.42.4 triton 2.3.1 typer 0.12.5 typing_extensions 4.12.2 tzdata 2024.2 ujson 5.10.0 urllib3 2.2.3 uvicorn 0.32.0 wandb 0.18.6 wavedrom 2.0.3.post3 wcwidth 0.2.13 websockets 11.0.3 wheel 0.44.0 xmltodict 0.14.2 yarl 1.17.1 zipp 3.20.2
Configuration:
DeepSpeed Config: Using ZeRO stage 2 (attach zero2.json if possible) Batch Size: 1 Gradient Accumulation Steps: 1 Mixed Precision: bf16
Steps to Reproduce:
Run the training script with the following command:
deepspeed --num_nodes $SLURM_JOB_NUM_NODES --num_gpus
ROOT_DIR/LLM/llama3-llava-next-8b --version llama_v3 --data_path "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/jsons/Cambrian150K_withsystemprompt.jsonl" --image_folder "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/" --pretrain_mm_mlp_adapter "$ROOT_DIR/cambrian-models/models--nyu-visionx--cambrian-8b_projector/mm_projector.bin" --vision_tower_aux_list '["siglip/CLIP-ViT-SO400M-14-384", "openai/clip-vit-large-patch14-336", "facebook/dinov2-giant-res378", "clip-convnext-XXL-multi-stage"]' --vision_tower_aux_token_len_list '[576, 576, 576, 9216]' --image_token_len 576 --num_query_group 1 --query_num_list '[576]' --connector_depth 3 --image_position 91 --vision_hidden_size 1024 --connector_only False --num_of_vision_sampler_layers 10 --start_of_vision_sampler_layers 0 --stride_of_vision_sampler_layers 3 --mm_projector_type sva --unfreeze_mm_vision_tower False --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir $CKPT_DIR --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 500 --save_total_limit 5 --learning_rate 4e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --run_name $CKPT_NAME --report_to wandb
Behavior: The training process fails due to OOM errors, even with minimal batch size and gradient accumulation settings.
image

How many gpu do you use? Besides we only test with vicuna-7b.

2 A100 by now, for Llama3 8B.

But I only assign 1 sample per GPU. It really confused me. Could you give me any suggestions?

@TideDra
Copy link
Author

TideDra commented Nov 7, 2024

@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.

@nku-zhichengzhang
Copy link

@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.

Loading the Llama3 model occupies 15GB of memory. BTW, how many GPUs did you use to train the Vicuna model?

@TideDra
Copy link
Author

TideDra commented Nov 7, 2024

@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.

Loading the Llama3 model occupies 15GB of memory. BTW, how many GPUs did you use to train the Vicuna model?

We use at least 8 gpus for pretraining and 32 gpus for finetuning. You may try zero3, which requires less memories?

@nku-zhichengzhang
Copy link

@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.

Loading the Llama3 model occupies 15GB of memory. BTW, how many GPUs did you use to train the Vicuna model?

We use at least 8 gpus for pretraining and 32 gpus for finetuning. You may try zero3, which requires less memories?

Okay, thanks for the reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants