Support GPU Training #64

TideDra · 2024-07-26T06:54:09Z

This PR supports training Cambrian-8b on GPU with deepspeed zero2. Main modification includes

We remove all the .float() that satisfies TPU's precision and use bf16 unifiedly.
We fix the llama3 chat template bug and tokenizer bug which adds additional bos token
We revert checkpoint loading and resuming impl to the default Huggingface Trainer impl
We optimize the loading of data and model

Co-authored-by: flyskywalkerlby <[email protected]>

ellisbrown · 2024-07-29T12:39:27Z

@TideDra @flyskywalkerlby thanks so much for your contribution!

I'll look through the code and add any comments/suggestions. We will also have to fire off some TPU training runs to verify that nothing impacts our TPU training before merging.

ellisbrown · 2024-07-29T12:40:46Z

pyproject.toml

 dependencies = [
-    "torch==2.2.0", "torchvision==0.17.0",
-    "transformers==4.37.0", "tokenizers==0.15.0", "sentencepiece==0.1.99", "shortuuid",
-    "accelerate==0.23.0", "peft==0.4.0",
-    "pydantic", "markdown2[all]", "numpy==1.26.4", "scikit-learn==1.2.2",
+    "torch==2.3.1", "torchvision==0.18.1",
+    "transformers==4.42.4", "tokenizers==0.19.1", "sentencepiece==0.2.0", "shortuuid",
+    "accelerate==0.32.1", "peft==0.11.1",
+    "pydantic", "markdown2[all]", "numpy==1.26.4", "scikit-learn==1.5.1",


@tsb0601 @penghao-wu

we need to check if the most recent version of accelerate still has the TPU bugs we encountered

inference.py

flyskywalkerlby · 2024-07-29T13:57:59Z

No problem. We changed this path to verify the experimental results. 发自我的iPhone

…

------------------ Original ------------------ From: Ellis Brown ***@***.***> Date: Mon, Jul 29, 2024 9:29 PM To: cambrian-mllm/cambrian ***@***.***> Cc: Boyang Liu ***@***.***>, Mention ***@***.***> Subject: Re: [cambrian-mllm/cambrian] Support GPU Training (PR #64) @ellisbrown commented on this pull request. In inference.py: > -model_path = os.path.expanduser("nyu-visionx/cambrian-8b") +model_path = os.path.expanduser("./checkpoints/cambrian-8b-finetune") let's not change this to preserve the default behavior? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

wufeim · 2024-10-24T09:39:18Z

Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!

TideDra · 2024-10-24T12:45:52Z

Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!

VLMEvalKit supports to evaluate cambrian.

wufeim · 2024-10-24T14:03:01Z

Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!

VLMEvalKit supports to evaluate cambrian.

Got it, thanks!

dfan · 2024-10-28T17:57:04Z

Is there a reason why the checkpoint saving uses torch.save()? It seems that the full model weights are stored per rank instead of the sharded model weights, so the overall size of the checkpoints is huge

wufeim · 2024-10-28T18:23:17Z

Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!

VLMEvalKit supports to evaluate cambrian.

It seems to me there are some issues with the VLMEvalKit codebase when evaluating LLaVA on TextVQA. With the released LLaVA or models trained with this cambrian+gpu code, I couldn't reproduce the results reproduced in the LLaVA v1.5 paper. Not sure what's the difference between the evaluations but we probably need to modify the evaluation code from LLaVA to reproduce exact results?

ellisbrown · 2024-10-28T18:29:17Z

Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!

VLMEvalKit supports to evaluate cambrian.

It seems to me there are some issues with the VLMEvalKit codebase when evaluating LLaVA on TextVQA. With the released LLaVA or models trained with this cambrian+gpu code, I couldn't reproduce the results reproduced in the LLaVA v1.5 paper. Not sure what's the difference between the evaluations but we probably need to modify the evaluation code from LLaVA to reproduce exact results?

@wufeim note that we have released our eval code here https://github.com/cambrian-mllm/cambrian/tree/main/eval

wufeim · 2024-10-28T19:21:49Z

@wufeim note that we have released our eval code here https://github.com/cambrian-mllm/cambrian/tree/main/eval

Oh I see it now. Thanks so much! I will check it out.

I was looking at the documentation here and thought they were not out yet. Maybe update the link in the README?

wufeim · 2024-10-29T05:17:49Z

@wufeim note that we have released our eval code here https://github.com/cambrian-mllm/cambrian/tree/main/eval

Hi @ellisbrown , quick questions on the evaluation code:

It seems that eval/requirements.txt is missing. I guess mainly the datasets package?
When I was evaluating cambrian-8b with the following command, all four GPUs are evaluating on the whole TextVQA instead of one of the four subparts. Is this correct? Or am I using a wrong command?
```
bash scripts/run_benchmark.sh --benchmark textvqa --ckpt nyu-visionx/cambrian-8b --conv_mode llama_3
```

dfan · 2024-10-30T01:16:18Z

+1 the eval/requirements.txt is missing. It'd be nice to know if a specific version of datasets is needed

ellisbrown · 2024-10-30T01:53:33Z

@wufeim @dfan sorry the requirements was masked by gitignore. added in #82

2. When I was evaluating cambrian-8b with the following command, all four GPUs are evaluating on the whole TextVQA instead of one of the four subparts. Is this correct? Or am I using a wrong command?

@wufeim have a read through run_benchmark.sh. the questions are chunked and each gpu handles one chunk.

let's move further discussion unrelated to this GPU training PR #64 to separate issues please.

wufeim · 2024-11-01T09:49:17Z

Hi @TideDra,

I'm trying out the GPU training code. I see that you used zero2 for both pretraining and finetuning. Meanwhile LLaVA used zero2 for pretraining and zero3 for finetuning. I am not an expert with deepspeed but I did encounter some issues with zero3, possibly related to this. Did you have similar issues? Or how did you decide on zero 2/3?

Thanks!

TideDra · 2024-11-01T10:00:58Z

Hi @TideDra,

I'm trying out the GPU training code. I see that you used zero2 for both pretraining and finetuning. Meanwhile LLaVA used zero2 for pretraining and zero3 for finetuning. I am not an expert with deepspeed but I did encounter some issues with zero3, possibly related to this. Did you have similar issues? Or how did you decide on zero 2/3?

Thanks!

In general, zero3 reduces GPU memory usage while increases training time, compared with zero2, but does not affect model perfomance theoretically. So zero2 is preferred if memory is enough. I didn't try zero3 so I didn't encounter your issue :). But actually, zero3 indeed has more bugs than zero2 in practice.

nku-zhichengzhang · 2024-11-07T15:53:44Z

OOM when finetuning Cambrian in the 80GB A100 cluster even with a batch size of 1

Description:
I am encountering an Out-of-Memory (OOM) error while attempting to fine-tune the Cambrian model using A100 GPUs. Despite setting the batch size and gradient accumulation steps to 1, the issue persists.

Environment:

Model: Cambrian
GPU: A100
DeepSpeed Version: 0.14.4
CUDA Version: 12.1
PyTorch Version: 2.3.1

Python Packages:
accelerate 0.32.1
aiofiles 23.2.1
aiohappyeyeballs 2.4.3
aiohttp 3.10.10
aiosignal 1.3.1
altair 5.4.1
annotated-types 0.7.0
anyio 4.6.2.post1
asttokens 2.4.1
async-timeout 4.0.3
attrs 24.2.0
bitsandbytes 0.43.1
cachetools 5.5.0
cambrian 1.0.0
certifi 2024.8.30
charset-normalizer 3.4.0
click 8.1.7
colorama 0.4.6
coloredlogs 15.0.1
contourpy 1.3.0
cos-python-sdk-v5 1.9.32
crcmod 1.7
cycler 0.12.1
decorator 5.1.1
deepspeed 0.14.4
diffusers 0.31.0
docker-pycreds 0.4.0
einops 0.8.0
einops-exts 0.0.4
exceptiongroup 1.2.2
executing 2.1.0
EzColorLog 1.0.3
fastapi 0.115.4
ffmpy 0.4.0
filelock 3.16.1
flash-attn 2.6.3
fonttools 4.54.1
frozenlist 1.5.0
fsspec 2024.10.0
ftfy 6.3.1
gcsfs 2024.10.0
gitdb 4.0.11
GitPython 3.1.43
google-api-core 2.22.0
google-auth 2.36.0
google-auth-oauthlib 1.2.1
google-cloud-core 2.4.1
google-cloud-storage 2.18.2
google-crc32c 1.6.0
google-resumable-media 2.7.2
googleapis-common-protos 1.65.0
gradio 4.16.0
gradio_client 0.8.1
h11 0.14.0
hf_transfer 0.1.8
hjson 3.1.0
httpcore 0.17.3
httpx 0.24.0
huggingface-hub 0.26.2
humanfriendly 10.0
idna 3.10
importlib_metadata 8.5.0
importlib_resources 6.4.5
ipython 8.29.0
jedi 0.19.1
Jinja2 3.1.4
joblib 1.4.2
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
kiwisolver 1.4.7
latex2mathml 3.77.0
markdown-it-py 3.0.0
markdown2 2.5.1
MarkupSafe 2.1.5
matplotlib 3.9.2
matplotlib-inline 0.1.7
mdurl 0.1.2
mpmath 1.3.0
multidict 6.1.0
narwhals 1.13.2
networkx 3.4.2
ninja 1.11.1.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 12.560.30
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.6.77
nvidia-nvtx-cu12 12.1.105
oauthlib 3.2.2
open_clip_torch 2.29.0
orjson 3.10.11
packaging 24.1
pandas 2.2.3
parso 0.8.4
peewee 3.17.7
peft 0.11.1
pexpect 4.9.0
pillow 10.4.0
pip 24.2
platformdirs 4.3.6
prompt_toolkit 3.0.48
propcache 0.2.0
proto-plus 1.25.0
protobuf 5.28.3
psutil 6.1.0
ptyprocess 0.7.0
pure_eval 0.2.3
py-cpuinfo 9.0.0
pyasn1 0.6.1
pyasn1_modules 0.4.1
pycryptodome 3.21.0
pydantic 2.9.2
pydantic_core 2.23.4
pydub 0.25.1
Pygments 2.18.0
pynvml 11.5.3
pyparsing 3.2.0
python-dateutil 2.9.0.post0
python-multipart 0.0.17
pytz 2024.2
PyYAML 6.0.2
referencing 0.35.1
regex 2024.11.6
requests 2.32.3
requests-oauthlib 2.0.0
rich 13.9.4
rpds-py 0.21.0
rsa 4.9
ruff 0.7.2
safetensors 0.4.5
scikit-learn 1.5.1
scipy 1.14.1
semantic-version 2.10.0
sentencepiece 0.2.0
sentry-sdk 2.18.0
setproctitle 1.3.3
setuptools 75.1.0
shellingham 1.5.4
shortuuid 1.0.13
six 1.16.0
smmap 5.0.1
sniffio 1.3.1
stack-data 0.6.3
starlette 0.41.2
svgwrite 1.4.3
swanboard 0.1.4b2
swankit 0.1.1b3
swanlab 0.3.23
sympy 1.13.3
threadpoolctl 3.5.0
timm 1.0.7
tokenizers 0.19.1
tomlkit 0.12.0
torch 2.3.1
torchtext 0.18.0
torchvision 0.18.1
tqdm 4.67.0
traitlets 5.14.3
transformers 4.42.4
triton 2.3.1
typer 0.12.5
typing_extensions 4.12.2
tzdata 2024.2
ujson 5.10.0
urllib3 2.2.3
uvicorn 0.32.0
wandb 0.18.6
wavedrom 2.0.3.post3
wcwidth 0.2.13
websockets 11.0.3
wheel 0.44.0
xmltodict 0.14.2
yarl 1.17.1
zipp 3.20.2

Configuration:

DeepSpeed Config: Using ZeRO stage 2 (attach zero2.json if possible)
Batch Size: 1
Gradient Accumulation Steps: 1
Mixed Precision: bf16

Steps to Reproduce:

Run the training script with the following command:

deepspeed
--num_nodes $SLURM_JOB_NUM_NODES
--num_gpus $SLURM_GPUS_PER_NODE
--master_addr localhost
--master_port 12345
--hostfile hostfile_temp
--no_ssh_check
cambrian/train/train_gpu.py
--deepspeed ./scripts/zero2.json
--model_name_or_path $ROOT_DIR/LLM/llama3-llava-next-8b
--version llama_v3
--data_path "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/jsons/Cambrian150K_withsystemprompt.jsonl"
--image_folder "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/"
--pretrain_mm_mlp_adapter "$ROOT_DIR/cambrian-models/models--nyu-visionx--cambrian-8b_projector/mm_projector.bin"
--vision_tower_aux_list '["siglip/CLIP-ViT-SO400M-14-384", "openai/clip-vit-large-patch14-336", "facebook/dinov2-giant-res378", "clip-convnext-XXL-multi-stage"]'
--vision_tower_aux_token_len_list '[576, 576, 576, 9216]'
--image_token_len 576
--num_query_group 1
--query_num_list '[576]'
--connector_depth 3
--image_position 91
--vision_hidden_size 1024
--connector_only False
--num_of_vision_sampler_layers 10
--start_of_vision_sampler_layers 0
--stride_of_vision_sampler_layers 3
--mm_projector_type sva
--unfreeze_mm_vision_tower False
--mm_vision_select_layer -2
--mm_use_im_start_end False
--mm_use_im_patch_token False
--image_aspect_ratio pad
--group_by_modality_length True
--bf16 True
--output_dir $CKPT_DIR
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 4
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 500
--save_total_limit 5
--learning_rate 4e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length 2048
--gradient_checkpointing True
--dataloader_num_workers 4
--lazy_preprocess True
--run_name $CKPT_NAME
--report_to wandb

Behavior:
The training process fails due to OOM errors, even with minimal batch size and gradient accumulation settings.

TideDra · 2024-11-07T16:06:35Z

OOM when finetuning Cambrian in the 80GB A100 cluster even with a batch size of 1

Description: I am encountering an Out-of-Memory (OOM) error while attempting to fine-tune the Cambrian model using A100 GPUs. Despite setting the batch size and gradient accumulation steps to 1, the issue persists.

Environment:

Model: Cambrian GPU: A100 DeepSpeed Version: 0.14.4 CUDA Version: 12.1 PyTorch Version: 2.3.1

Python Packages: accelerate 0.32.1 aiofiles 23.2.1 aiohappyeyeballs 2.4.3 aiohttp 3.10.10 aiosignal 1.3.1 altair 5.4.1 annotated-types 0.7.0 anyio 4.6.2.post1 asttokens 2.4.1 async-timeout 4.0.3 attrs 24.2.0 bitsandbytes 0.43.1 cachetools 5.5.0 cambrian 1.0.0 certifi 2024.8.30 charset-normalizer 3.4.0 click 8.1.7 colorama 0.4.6 coloredlogs 15.0.1 contourpy 1.3.0 cos-python-sdk-v5 1.9.32 crcmod 1.7 cycler 0.12.1 decorator 5.1.1 deepspeed 0.14.4 diffusers 0.31.0 docker-pycreds 0.4.0 einops 0.8.0 einops-exts 0.0.4 exceptiongroup 1.2.2 executing 2.1.0 EzColorLog 1.0.3 fastapi 0.115.4 ffmpy 0.4.0 filelock 3.16.1 flash-attn 2.6.3 fonttools 4.54.1 frozenlist 1.5.0 fsspec 2024.10.0 ftfy 6.3.1 gcsfs 2024.10.0 gitdb 4.0.11 GitPython 3.1.43 google-api-core 2.22.0 google-auth 2.36.0 google-auth-oauthlib 1.2.1 google-cloud-core 2.4.1 google-cloud-storage 2.18.2 google-crc32c 1.6.0 google-resumable-media 2.7.2 googleapis-common-protos 1.65.0 gradio 4.16.0 gradio_client 0.8.1 h11 0.14.0 hf_transfer 0.1.8 hjson 3.1.0 httpcore 0.17.3 httpx 0.24.0 huggingface-hub 0.26.2 humanfriendly 10.0 idna 3.10 importlib_metadata 8.5.0 importlib_resources 6.4.5 ipython 8.29.0 jedi 0.19.1 Jinja2 3.1.4 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.7 latex2mathml 3.77.0 markdown-it-py 3.0.0 markdown2 2.5.1 MarkupSafe 2.1.5 matplotlib 3.9.2 matplotlib-inline 0.1.7 mdurl 0.1.2 mpmath 1.3.0 multidict 6.1.0 narwhals 1.13.2 networkx 3.4.2 ninja 1.11.1.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.77 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 open_clip_torch 2.29.0 orjson 3.10.11 packaging 24.1 pandas 2.2.3 parso 0.8.4 peewee 3.17.7 peft 0.11.1 pexpect 4.9.0 pillow 10.4.0 pip 24.2 platformdirs 4.3.6 prompt_toolkit 3.0.48 propcache 0.2.0 proto-plus 1.25.0 protobuf 5.28.3 psutil 6.1.0 ptyprocess 0.7.0 pure_eval 0.2.3 py-cpuinfo 9.0.0 pyasn1 0.6.1 pyasn1_modules 0.4.1 pycryptodome 3.21.0 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pynvml 11.5.3 pyparsing 3.2.0 python-dateutil 2.9.0.post0 python-multipart 0.0.17 pytz 2024.2 PyYAML 6.0.2 referencing 0.35.1 regex 2024.11.6 requests 2.32.3 requests-oauthlib 2.0.0 rich 13.9.4 rpds-py 0.21.0 rsa 4.9 ruff 0.7.2 safetensors 0.4.5 scikit-learn 1.5.1 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 sentry-sdk 2.18.0 setproctitle 1.3.3 setuptools 75.1.0 shellingham 1.5.4 shortuuid 1.0.13 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 stack-data 0.6.3 starlette 0.41.2 svgwrite 1.4.3 swanboard 0.1.4b2 swankit 0.1.1b3 swanlab 0.3.23 sympy 1.13.3 threadpoolctl 3.5.0 timm 1.0.7 tokenizers 0.19.1 tomlkit 0.12.0 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tqdm 4.67.0 traitlets 5.14.3 transformers 4.42.4 triton 2.3.1 typer 0.12.5 typing_extensions 4.12.2 tzdata 2024.2 ujson 5.10.0 urllib3 2.2.3 uvicorn 0.32.0 wandb 0.18.6 wavedrom 2.0.3.post3 wcwidth 0.2.13 websockets 11.0.3 wheel 0.44.0 xmltodict 0.14.2 yarl 1.17.1 zipp 3.20.2

Configuration:

DeepSpeed Config: Using ZeRO stage 2 (attach zero2.json if possible) Batch Size: 1 Gradient Accumulation Steps: 1 Mixed Precision: bf16

Steps to Reproduce:

Run the training script with the following command:

deepspeed --num_nodes $SLURM_JOB_NUM_NODES --num_gpus $SLURM_GPUS_PER_NODE --master_addr localhost --master_port 12345 --hostfile hostfile_temp --no_ssh_check cambrian/train/train_gpu.py --deepspeed ./scripts/zero2.json --model_name_or_path $ROOT_DIR/LLM/llama3-llava-next-8b --version llama_v3 --data_path "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/jsons/Cambrian150K_withsystemprompt.jsonl" --image_folder "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/" --pretrain_mm_mlp_adapter "$ROOT_DIR/cambrian-models/models--nyu-visionx--cambrian-8b_projector/mm_projector.bin" --vision_tower_aux_list '["siglip/CLIP-ViT-SO400M-14-384", "openai/clip-vit-large-patch14-336", "facebook/dinov2-giant-res378", "clip-convnext-XXL-multi-stage"]' --vision_tower_aux_token_len_list '[576, 576, 576, 9216]' --image_token_len 576 --num_query_group 1 --query_num_list '[576]' --connector_depth 3 --image_position 91 --vision_hidden_size 1024 --connector_only False --num_of_vision_sampler_layers 10 --start_of_vision_sampler_layers 0 --stride_of_vision_sampler_layers 3 --mm_projector_type sva --unfreeze_mm_vision_tower False --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir $CKPT_DIR --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 500 --save_total_limit 5 --learning_rate 4e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --run_name $CKPT_NAME --report_to wandb

Behavior: The training process fails due to OOM errors, even with minimal batch size and gradient accumulation settings.

How many gpu do you use? Besides we only test with vicuna-7b.

nku-zhichengzhang · 2024-11-07T16:12:54Z

OOM when finetuning Cambrian in the 80GB A100 cluster even with a batch size of 1
Description: I am encountering an Out-of-Memory (OOM) error while attempting to fine-tune the Cambrian model using A100 GPUs. Despite setting the batch size and gradient accumulation steps to 1, the issue persists.
Environment:
Model: Cambrian GPU: A100 DeepSpeed Version: 0.14.4 CUDA Version: 12.1 PyTorch Version: 2.3.1
Python Packages: accelerate 0.32.1 aiofiles 23.2.1 aiohappyeyeballs 2.4.3 aiohttp 3.10.10 aiosignal 1.3.1 altair 5.4.1 annotated-types 0.7.0 anyio 4.6.2.post1 asttokens 2.4.1 async-timeout 4.0.3 attrs 24.2.0 bitsandbytes 0.43.1 cachetools 5.5.0 cambrian 1.0.0 certifi 2024.8.30 charset-normalizer 3.4.0 click 8.1.7 colorama 0.4.6 coloredlogs 15.0.1 contourpy 1.3.0 cos-python-sdk-v5 1.9.32 crcmod 1.7 cycler 0.12.1 decorator 5.1.1 deepspeed 0.14.4 diffusers 0.31.0 docker-pycreds 0.4.0 einops 0.8.0 einops-exts 0.0.4 exceptiongroup 1.2.2 executing 2.1.0 EzColorLog 1.0.3 fastapi 0.115.4 ffmpy 0.4.0 filelock 3.16.1 flash-attn 2.6.3 fonttools 4.54.1 frozenlist 1.5.0 fsspec 2024.10.0 ftfy 6.3.1 gcsfs 2024.10.0 gitdb 4.0.11 GitPython 3.1.43 google-api-core 2.22.0 google-auth 2.36.0 google-auth-oauthlib 1.2.1 google-cloud-core 2.4.1 google-cloud-storage 2.18.2 google-crc32c 1.6.0 google-resumable-media 2.7.2 googleapis-common-protos 1.65.0 gradio 4.16.0 gradio_client 0.8.1 h11 0.14.0 hf_transfer 0.1.8 hjson 3.1.0 httpcore 0.17.3 httpx 0.24.0 huggingface-hub 0.26.2 humanfriendly 10.0 idna 3.10 importlib_metadata 8.5.0 importlib_resources 6.4.5 ipython 8.29.0 jedi 0.19.1 Jinja2 3.1.4 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.7 latex2mathml 3.77.0 markdown-it-py 3.0.0 markdown2 2.5.1 MarkupSafe 2.1.5 matplotlib 3.9.2 matplotlib-inline 0.1.7 mdurl 0.1.2 mpmath 1.3.0 multidict 6.1.0 narwhals 1.13.2 networkx 3.4.2 ninja 1.11.1.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.77 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 open_clip_torch 2.29.0 orjson 3.10.11 packaging 24.1 pandas 2.2.3 parso 0.8.4 peewee 3.17.7 peft 0.11.1 pexpect 4.9.0 pillow 10.4.0 pip 24.2 platformdirs 4.3.6 prompt_toolkit 3.0.48 propcache 0.2.0 proto-plus 1.25.0 protobuf 5.28.3 psutil 6.1.0 ptyprocess 0.7.0 pure_eval 0.2.3 py-cpuinfo 9.0.0 pyasn1 0.6.1 pyasn1_modules 0.4.1 pycryptodome 3.21.0 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pynvml 11.5.3 pyparsing 3.2.0 python-dateutil 2.9.0.post0 python-multipart 0.0.17 pytz 2024.2 PyYAML 6.0.2 referencing 0.35.1 regex 2024.11.6 requests 2.32.3 requests-oauthlib 2.0.0 rich 13.9.4 rpds-py 0.21.0 rsa 4.9 ruff 0.7.2 safetensors 0.4.5 scikit-learn 1.5.1 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 sentry-sdk 2.18.0 setproctitle 1.3.3 setuptools 75.1.0 shellingham 1.5.4 shortuuid 1.0.13 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 stack-data 0.6.3 starlette 0.41.2 svgwrite 1.4.3 swanboard 0.1.4b2 swankit 0.1.1b3 swanlab 0.3.23 sympy 1.13.3 threadpoolctl 3.5.0 timm 1.0.7 tokenizers 0.19.1 tomlkit 0.12.0 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tqdm 4.67.0 traitlets 5.14.3 transformers 4.42.4 triton 2.3.1 typer 0.12.5 typing_extensions 4.12.2 tzdata 2024.2 ujson 5.10.0 urllib3 2.2.3 uvicorn 0.32.0 wandb 0.18.6 wavedrom 2.0.3.post3 wcwidth 0.2.13 websockets 11.0.3 wheel 0.44.0 xmltodict 0.14.2 yarl 1.17.1 zipp 3.20.2
Configuration:
DeepSpeed Config: Using ZeRO stage 2 (attach zero2.json if possible) Batch Size: 1 Gradient Accumulation Steps: 1 Mixed Precision: bf16
Steps to Reproduce:
Run the training script with the following command:
deepspeed --num_nodes $SLURM_JOB_NUM_NODES --num_gpus
ROOT_DIR/LLM/llama3-llava-next-8b --version llama_v3 --data_path "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/jsons/Cambrian150K_withsystemprompt.jsonl" --image_folder "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/" --pretrain_mm_mlp_adapter "$ROOT_DIR/cambrian-models/models--nyu-visionx--cambrian-8b_projector/mm_projector.bin" --vision_tower_aux_list '["siglip/CLIP-ViT-SO400M-14-384", "openai/clip-vit-large-patch14-336", "facebook/dinov2-giant-res378", "clip-convnext-XXL-multi-stage"]' --vision_tower_aux_token_len_list '[576, 576, 576, 9216]' --image_token_len 576 --num_query_group 1 --query_num_list '[576]' --connector_depth 3 --image_position 91 --vision_hidden_size 1024 --connector_only False --num_of_vision_sampler_layers 10 --start_of_vision_sampler_layers 0 --stride_of_vision_sampler_layers 3 --mm_projector_type sva --unfreeze_mm_vision_tower False --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir $CKPT_DIR --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 500 --save_total_limit 5 --learning_rate 4e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --run_name $CKPT_NAME --report_to wandb
Behavior: The training process fails due to OOM errors, even with minimal batch size and gradient accumulation settings.

How many gpu do you use? Besides we only test with vicuna-7b.

2 A100 by now, for Llama3 8B.

But I only assign 1 sample per GPU. It really confused me. Could you give me any suggestions?

TideDra · 2024-11-07T16:19:40Z

@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.

nku-zhichengzhang · 2024-11-07T16:27:44Z

@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.

Loading the Llama3 model occupies 15GB of memory. BTW, how many GPUs did you use to train the Vicuna model?

TideDra · 2024-11-07T16:31:52Z

@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.

Loading the Llama3 model occupies 15GB of memory. BTW, how many GPUs did you use to train the Vicuna model?

We use at least 8 gpus for pretraining and 32 gpus for finetuning. You may try zero3, which requires less memories?

nku-zhichengzhang · 2024-11-07T16:56:00Z

@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.

Loading the Llama3 model occupies 15GB of memory. BTW, how many GPUs did you use to train the Vicuna model?

We use at least 8 gpus for pretraining and 32 gpus for finetuning. You may try zero3, which requires less memories?

Okay, thanks for the reply.

TideDra and others added 2 commits July 26, 2024 14:20

squash all gpu_dev commits and merge to main

9d7fd16

add coauthor

068a38e

Co-authored-by: flyskywalkerlby <[email protected]>

tsb0601 mentioned this pull request Jul 27, 2024

Support for multi-gpu Pre-training Code #67

Open

ellisbrown reviewed Jul 29, 2024

View reviewed changes

inference.py Outdated Show resolved Hide resolved

revert model path

e86e88a

ellisbrown mentioned this pull request Oct 30, 2024

GPU training script and evaluation code #80

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GPU Training #64

Support GPU Training #64

TideDra commented Jul 26, 2024

ellisbrown commented Jul 29, 2024

ellisbrown Jul 29, 2024

flyskywalkerlby commented Jul 29, 2024 via email

wufeim commented Oct 24, 2024

TideDra commented Oct 24, 2024

wufeim commented Oct 24, 2024

dfan commented Oct 28, 2024 •

edited

Loading

wufeim commented Oct 28, 2024

ellisbrown commented Oct 28, 2024

wufeim commented Oct 28, 2024

wufeim commented Oct 29, 2024

dfan commented Oct 30, 2024

ellisbrown commented Oct 30, 2024

wufeim commented Nov 1, 2024

TideDra commented Nov 1, 2024

nku-zhichengzhang commented Nov 7, 2024 •

edited

Loading

TideDra commented Nov 7, 2024

nku-zhichengzhang commented Nov 7, 2024 •

edited

Loading

TideDra commented Nov 7, 2024

nku-zhichengzhang commented Nov 7, 2024

TideDra commented Nov 7, 2024

nku-zhichengzhang commented Nov 7, 2024

Support GPU Training #64

Are you sure you want to change the base?

Support GPU Training #64

Conversation

TideDra commented Jul 26, 2024

ellisbrown commented Jul 29, 2024

ellisbrown Jul 29, 2024

Choose a reason for hiding this comment

flyskywalkerlby commented Jul 29, 2024 via email

wufeim commented Oct 24, 2024

TideDra commented Oct 24, 2024

wufeim commented Oct 24, 2024

dfan commented Oct 28, 2024 • edited Loading

wufeim commented Oct 28, 2024

ellisbrown commented Oct 28, 2024

wufeim commented Oct 28, 2024

wufeim commented Oct 29, 2024

dfan commented Oct 30, 2024

ellisbrown commented Oct 30, 2024

wufeim commented Nov 1, 2024

TideDra commented Nov 1, 2024

nku-zhichengzhang commented Nov 7, 2024 • edited Loading

TideDra commented Nov 7, 2024

nku-zhichengzhang commented Nov 7, 2024 • edited Loading

TideDra commented Nov 7, 2024

nku-zhichengzhang commented Nov 7, 2024

TideDra commented Nov 7, 2024

nku-zhichengzhang commented Nov 7, 2024

dfan commented Oct 28, 2024 •

edited

Loading

nku-zhichengzhang commented Nov 7, 2024 •

edited

Loading

nku-zhichengzhang commented Nov 7, 2024 •

edited

Loading