-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix diloco 2 #13
fix diloco 2 #13
Conversation
e5fc6a0
to
294c36d
Compare
src/zeroband/diloco.py
Outdated
""" | ||
|
||
self._logger.debug("sync inner model") | ||
for param_offloaded, param in zip(self.param_list_cpu, model.parameters()): | ||
param.data = param_offloaded.data.to("cuda") # todo: use copy_ here | ||
param.data.copy_(param_offloaded.data.to(param.device)) # todo: use copy_ here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the .to(device) isnt needed here. You should be able to directly copy the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting so it will handle the move to device by itself ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding is that a.copy_(b) means copy the content of b to a inplace so it works across devices and doesnt allocate a new intermediate storage.
meanwhile param.copy_(param_offloaded.data.to(param.device)) will first evaluate the .to
which creates a new intermediate storage then copies the content. might be wrong tho, not sure how smart compilers/interpreters are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
330bdd3
to
9d801d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LFGTM!
FIX DILOCO LFG
This pr fix the diloco intregration. The main bug was that model were not initialized the same across diloco rank. We used to fix this by pulling the weight init from hf. We don't do this anymore because we don't use transformer for the modeling.
This pr introduces a common seed. I checked, and the weights are correctly initialized now.
The loss is looking better now
Additionally the PR add testing for the diloco all reduce as well as some small fix and enhancement