You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I understand that we need to scale the learning rate in DDP to make sure the gradients are averaged correctly at the end. But I'm confused about the choice of 256. in the ddp_apex Python script and e.g. the use of 512. in this DeiT github repo.
I don't think this can be an arbitrary value but is bound in such a way that "LR=LR*X" where "X>1". If this is correct why not just do: lr_scaled = lr * world_size?
The text was updated successfully, but these errors were encountered:
Hi, I understand that we need to scale the learning rate in DDP to make sure the gradients are averaged correctly at the end. But I'm confused about the choice of
256.
in the ddp_apex Python script and e.g. the use of512.
in this DeiT github repo.I don't think this can be an arbitrary value but is bound in such a way that "LR=LR*X" where "X>1". If this is correct why not just do:
lr_scaled = lr * world_size
?The text was updated successfully, but these errors were encountered: