-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making a custom transformer architecture work with opacus #644
Comments
A quick question/guess: is there any model parameter which has several (X) forward passes but has <X backward passes? For those parameters, the per_sample_grad will not be appropriately calculated/stored which might lead to this issue. |
Thank you for responding. I am not sure what you meant by x forward passes and less than x backward passes. Could you give me a reproducible general case like that? see how model does without opacus(it runs) : https://colab.research.google.com/drive/1CjPdzUaThLKrY0vVUUM-__zrLgMsFxe7?usp=sharing I defined all the encoders, knowledge retrievers separately because I needed to eliminate problems with conditional statements/computation paths. |
Thanks for your detailed response. By any chance could we experiment with one single blocker at a time (for example, Question encoder) to see whether the problem replicate? Specifically, given x to be the output of Question encoder, just define some dummy label y and any loss function L, and do the backward pass of L(x,y). Then we can see whether per_sample_grad is empty or not. |
I am trying to make an architecture work with opacus . It consists of two encoders that use Self-attention and produces context embeddings x_t and y_t. “Knowledge Retriever” is using masked attention.
I suppose there are a few issues with this. It uses a modified multihead attention that uses an exponential decay function applied to the scaled dot product and a distance adjustment factor gamma that requires no gradient. It uses the model parameters that has been already calculated to obtain the distance adjustments. This causes conflicts with opacus for which I will create a separate issue later.
For simplicity, I have used just multihead attention to avoid conflicts with opacus. Here is the notebook that can be used to reproduce this: https://colab.research.google.com/drive/1Sp3jILzB3HvizIAw3OTiQGnVq7LB5gee?usp=sharing
And this still produces the following error:
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1352: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
ValueError Traceback (most recent call last)
in <cell line: 1>()
----> 1 best_epoch = train_one_dataset(train_q_data, train_qa_data, train_pid, valid_q_data, valid_qa_data, valid_pid)
2
5 frames
in train_one_dataset(train_q_data, train_qa_data, train_pid, valid_q_data, valid_qa_data, valid_pid)
37 for idx in range(max_iter):
38 # Train Model
---> 39 train_loss, train_accuracy, train_auc = train(
40 dp_model, dp_optimizer, train_q_data, train_qa_data, train_pid, accountant, label='Train')
41 # Validation step
in train(net, optimizer, q_data, qa_data, pid_data, accountant, label)
89 net.parameters(), max_norm=maxgradnorm)
90
---> 91 optimizer.step()
92
93 # correct: 1.0; wrong 0.0; padding -1.0
/usr/local/lib/python3.10/dist-packages/opacus/optimizers/optimizer.py in step(self, closure)
516 closure()
517
--> 518 if self.pre_step():
519 return self.original_optimizer.step()
520 else:
/usr/local/lib/python3.10/dist-packages/opacus/optimizers/optimizer.py in pre_step(self, closure)
494 # The corner case when the optimizer has no trainable parameters.
495 # Essentially the DPOptimizer act as a normal optimizer
--> 496 if self.grad_samples is None or len(self.grad_samples) == 0:
497 return True
498
/usr/local/lib/python3.10/dist-packages/opacus/optimizers/optimizer.py in grad_samples(self)
343 ret = []
344 for p in self.params:
--> 345 ret.append(self._get_flat_grad_sample(p))
346 return ret
347
/usr/local/lib/python3.10/dist-packages/opacus/optimizers/optimizer.py in _get_flat_grad_sample(self, p)
280 )
281 if p.grad_sample is None:
--> 282 raise ValueError(
283 "Per sample gradient is not initialized. Not updated in backward pass?"
284 )
ValueError: Per sample gradient is not initialized. Not updated in backward pass?
There is also some behavior I had to note. In the architecture class, transformer layers are initialized. In the forward pass the x and y embeddings are passed into the encoders. The flag is there to ensure when the knowledge retriever block(masked attention) is executed. This is clearer in the forward pass of the transformer layer, where the “if statement block” is for the masked attention (knowledge retriever) and the “else block” corresponds to the encoders on the left( see picture in notebook). All three components use the same forward pass.( see forward calls of Architecture, Transformer Layer classes)
Training/ optimizer step only seems to execute when I leave out the if/else conditions and have one forward pass for all three parts of the model: two encoders and knowledge retriever that uses masked attention.
Is there a way around this? Is there a way this could be reimplemented in a way which would allow per sample gradient computation?
Notebook without opacus:
https://colab.research.google.com/drive/1jg-ygK7Vfou-IaJqNaujk-CHfdMruup3?usp=sharing
The text was updated successfully, but these errors were encountered: