-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GPU Training #64
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: flyskywalkerlby <[email protected]>
@TideDra @flyskywalkerlby thanks so much for your contribution! I'll look through the code and add any comments/suggestions. We will also have to fire off some TPU training runs to verify that nothing impacts our TPU training before merging. |
dependencies = [ | ||
"torch==2.2.0", "torchvision==0.17.0", | ||
"transformers==4.37.0", "tokenizers==0.15.0", "sentencepiece==0.1.99", "shortuuid", | ||
"accelerate==0.23.0", "peft==0.4.0", | ||
"pydantic", "markdown2[all]", "numpy==1.26.4", "scikit-learn==1.2.2", | ||
"torch==2.3.1", "torchvision==0.18.1", | ||
"transformers==4.42.4", "tokenizers==0.19.1", "sentencepiece==0.2.0", "shortuuid", | ||
"accelerate==0.32.1", "peft==0.11.1", | ||
"pydantic", "markdown2[all]", "numpy==1.26.4", "scikit-learn==1.5.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to check if the most recent version of accelerate still has the TPU bugs we encountered
No problem. We changed this path to verify the experimental results.
发自我的iPhone
…------------------ Original ------------------
From: Ellis Brown ***@***.***>
Date: Mon, Jul 29, 2024 9:29 PM
To: cambrian-mllm/cambrian ***@***.***>
Cc: Boyang Liu ***@***.***>, Mention ***@***.***>
Subject: Re: [cambrian-mllm/cambrian] Support GPU Training (PR #64)
@ellisbrown commented on this pull request.
In inference.py:
> -model_path = os.path.expanduser("nyu-visionx/cambrian-8b") +model_path = os.path.expanduser("./checkpoints/cambrian-8b-finetune")
let's not change this to preserve the default behavior?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks! |
VLMEvalKit supports to evaluate cambrian. |
Got it, thanks! |
Is there a reason why the checkpoint saving uses torch.save()? It seems that the full model weights are stored per rank instead of the sharded model weights, so the overall size of the checkpoints is huge |
It seems to me there are some issues with the VLMEvalKit codebase when evaluating LLaVA on TextVQA. With the released LLaVA or models trained with this cambrian+gpu code, I couldn't reproduce the results reproduced in the LLaVA v1.5 paper. Not sure what's the difference between the evaluations but we probably need to modify the evaluation code from LLaVA to reproduce exact results? |
@wufeim note that we have released our eval code here https://github.com/cambrian-mllm/cambrian/tree/main/eval |
Oh I see it now. Thanks so much! I will check it out. I was looking at the documentation here and thought they were not out yet. Maybe update the link in the README? |
Hi @ellisbrown , quick questions on the evaluation code:
|
+1 the eval/requirements.txt is missing. It'd be nice to know if a specific version of datasets is needed |
@wufeim @dfan sorry the requirements was masked by gitignore. added in #82
@wufeim have a read through run_benchmark.sh. the questions are chunked and each gpu handles one chunk. let's move further discussion unrelated to this GPU training PR #64 to separate issues please. |
Hi @TideDra, I'm trying out the GPU training code. I see that you used zero2 for both pretraining and finetuning. Meanwhile LLaVA used zero2 for pretraining and zero3 for finetuning. I am not an expert with deepspeed but I did encounter some issues with zero3, possibly related to this. Did you have similar issues? Or how did you decide on zero 2/3? Thanks! |
In general, zero3 reduces GPU memory usage while increases training time, compared with zero2, but does not affect model perfomance theoretically. So zero2 is preferred if memory is enough. I didn't try zero3 so I didn't encounter your issue :). But actually, zero3 indeed has more bugs than zero2 in practice. |
@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories. |
Loading the Llama3 model occupies 15GB of memory. BTW, how many GPUs did you use to train the Vicuna model? |
We use at least 8 gpus for pretraining and 32 gpus for finetuning. You may try zero3, which requires less memories? |
Okay, thanks for the reply. |
This PR supports training Cambrian-8b on GPU with deepspeed zero2. Main modification includes
.float()
that satisfies TPU's precision and use bf16 unifiedly.