-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to correctly set bf16 for training, and seems deepspeed config additional offload/bf16 conflicts with the qlora-pipe setting? #16
Comments
I also have been experimenting with full fine tuning of an 8b model. I just pushed some commits which add the adamw_kahan optimizer type. There's a section in the README that now discusses floating point precision as well. For FFT I recommend setting everything to bf16 and using adamw_kahan optimizer. It is expected that Deepspeed's bf16 mode uses more memory. I think it wraps the optimizer, and does master weights + gradient accumulation + optimizer states all in fp32. This will use much more memory than full bf16 + Kahan summation in the optimizer. I would not use Deepspeed's bf16 mode unless you have a very large amount of VRAM to spare. If you are setting model_weight_dtype to bf16, it should not be loading the model in fp32. Can you call the print_model_info() function on the pipeline_model after it is loaded? It will show you the dtype of the model weights. If they really are in fp32 despite setting bf16 in the config, there is some bug or edge case somewhere. One more thing to check: are the model weights on disk fp32? Perhaps it is somehow ignoring the config and just loading it as the dtype the model is stored in. |
Hi @tdrussell, Really appreciate the helpful clarifications and explanations you provide! The updated repo also helps a lot to makes things clearer. I used the latest repo, call the print_model_info() and I can confirm the model is loaded in bf16 now:) |
I think it is expected to OOM on 8x80 VRAM. In my understanding, this is how much memory we need per parameter:
Each of these is bf16, so we have 10 bytes per parameter of fixed state to do FFT using this setup. The 700 GB required already OOMs, and you still need a bit for activations. 960 GB should to be enough though. Note that traditional mixed precision would use even more, because it keeps a fp32 master copy of the weights also. There are some changes to the optimizer you could make to try to lower VRAM a bit more.
Regarding multi-node training, I have developed everything with the assumption that it is only 1 node, and it is untested on more than 1. But in principle, I think it should work, assuming Deepspeed pipeline parallelism supports it. I would not be surprised if there are some places in the code that would break, and need fixes. EDIT: |
Hi @tdrussell, thank you so much for your detailed answer! |
@iamhappytoo Still having this issue? A stack trace and/or logs would be helpful. |
Hello @tdrussell,
First of all, thank you very much for your great repo! It is absolutely great work to pull all these optimization solutions together.
When I use the repo, I try to use the bf16 enabled in deepspeed config to support full parameter finetune of a 70b model. However it seems giving me oom even larger than without it.
With the default setting (not setting bf16: true in deepspeed config, and set every possible options in .toml config to bfloat16) the evaluation phase still seems to be using the float32, with 70b models consuming 33G * 8 memory when doing evaluating, and oom when the training come.
I'm wondering if the deepspeed config is by default deprecated from the qlora-pipe so one should not use it?
And how should I correctly set the bf16 in the code?
Thank you so much in advance!
The text was updated successfully, but these errors were encountered: