-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient clipping doesn't work with FSDP CPU offloading #1977
Comments
@ebsmothers, do you think it would make sense to ping someone from FSDP? |
Could you try modifying the https://github.com/pytorch/torchtune/blob/main/recipes/full_finetune_distributed.py#L903 |
I don’t think we want to modify init_process_group here. To me that error indicates that we are trying to call some comms primitive on a tensor that’s already on CPU, which we shouldn’t be doing. Initializing process group on CPU would only be helpful if we actually want distributed training on CPU, which we don’t. Let’s debug a bit more and then we can loop in distributed folks if needed. |
I believe when CPU offload is used in FSDP, gradients will be transferred to CPU during the backward pass (to free up gradients memory, similar to optim in backward) to perform optimizer step on CPU. That's probably why you see It's probably faster to check with the distributed folks if FSDP w/ CPU offload support gradient clipping in general. Even if it is technically possible (e.g. do clipping on CPU), I think it would be too slow + possibly require changes in internal FSDP code. |
Looks like torchtitan repo ran into the same issue and someone created a quick workaround in a special branch: |
I'm hitting this same issue when doing a full fine-tune of a 70B llama on a single node. Any "proper" way of solving this? I'll check out torchtitan solution edit: torchtitan solution is just "don't use grad clipping" basically? |
@gordicaleksa yeah you're right.. it seems to me like the error in that PR is the same as what's being described in this issue. cc @weifengpy @mori360 what is the status of pytorch/torchtitan#622? |
@ebsmothers To deal with the backend issue on the different device type during offloading, we used |
I thought we landed a PR to support cpu offloading in torchtitan? |
@mori360 the error is |
We used |
@mori360 can you share more info on why adding gloo backend for CPU solves the issue? Is there some 1:1 mapping between FSDP's process group on CUDA and a CPU process group when CPU offloading is enabled? My assumption was that any CPU offloading would offload to a single CPU process, but maybe that was incorrect? |
Yeah, CPU offloading would offload to a single CPU process, however gradient clipping needs communication in |
Thanks @mori360 and @weifengpy for the explanation here. I guess @RdoubleA was right from the outset (sorry for derailing things). I just opened #2108 for this |
@ebsmothers Can this issue be closed now? |
Yeah we can close this |
I am running the full finetune distributed recipe, when setting
clip_grad_norm: 1.0
andfsdp_cpu_offload: True
, it raises errorRuntimeError: No backend type associated with device type cpu
Full error stack trace:
Wondering how should we fix this error?
The text was updated successfully, but these errors were encountered: