You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I tried to train t2t-vit-14 with hyperparameter that was released, I had a NaN loss problem. After AMP was turned off, loss became stable. But, I want to know the cause to campare the effect of AMP on speed and for afterwards.
I thought some modules had problem, so I turned off autocast for each module using torch.cuda.amp.autocast(enabled=False). At first, it was performed on the performer's prm_exp, which was previously thought to be a problem for others. But it didn't work.
In Vit, there seemed to be few reports of amp-related NaN losses, so I turned off the autocast in each of the t2t module and bit backbone and watched it. And it was found that Nan Loss did not occur when the amp was turned off in t2t module, and through the same process for t2t module, einsum inside def single_attn of Token_performer was found to be a problem.
So ㅑ replaced it with @ and transpose like other Attention modules, but NaN loss still occurred. Therefore, rather than a problem between einsum and AMP, it seems that a problem occurs due to loss of information in the process of calculating with fp16.
My device is 4 RTX A6000.
I hope it will help others.
The text was updated successfully, but these errors were encountered:
Additionally, the problem may be that 1e-8 is treated as 0 in fp16, so the epsilon value for learning stability is not properly applied. If I have time, I plan to experiment with 1e-7.
When I tried to train t2t-vit-14 with hyperparameter that was released, I had a NaN loss problem. After AMP was turned off, loss became stable. But, I want to know the cause to campare the effect of AMP on speed and for afterwards.
I thought some modules had problem, so I turned off autocast for each module using
torch.cuda.amp.autocast(enabled=False)
. At first, it was performed on the performer'sprm_exp
, which was previously thought to be a problem for others. But it didn't work.In Vit, there seemed to be few reports of amp-related NaN losses, so I turned off the autocast in each of the t2t module and bit backbone and watched it. And it was found that Nan Loss did not occur when the amp was turned off in t2t module, and through the same process for t2t module,
einsum
insidedef single_attn
ofToken_performer
was found to be a problem.So ㅑ replaced it with
@
andtranspose
like other Attention modules, but NaN loss still occurred. Therefore, rather than a problem betweeneinsum
and AMP, it seems that a problem occurs due to loss of information in the process of calculating with fp16.My device is 4 RTX A6000.
I hope it will help others.
The text was updated successfully, but these errors were encountered: