Scaling the learning rate in DDP #7

bartmch · 2021-03-09T08:40:08Z

Hi, I understand that we need to scale the learning rate in DDP to make sure the gradients are averaged correctly at the end. But I'm confused about the choice of 256. in the ddp_apex Python script and e.g. the use of 512. in this DeiT github repo.
I don't think this can be an arbitrary value but is bound in such a way that "LR=LR*X" where "X>1". If this is correct why not just do: lr_scaled = lr * world_size?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling the learning rate in DDP #7

Scaling the learning rate in DDP #7

bartmch commented Mar 9, 2021

Scaling the learning rate in DDP #7

Scaling the learning rate in DDP #7

Comments

bartmch commented Mar 9, 2021