You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Normalize top-k scores as usual, since there's no clear motivation for the same. Good thing is that it's config driven in the HF implementation for OLMoE
📈 Potential Benefits
No clear benefits, but it could instead slow down training by a bit since now we're applying softmax on logits from all experts.
🧐 Problem Description
OLMoE has disabled the normalization for top-k routing probabilities. There is no clear motivation or ablation for why this was done. DeepSeekMoE also disables top-k normalization, while Mixtral-8x7b-v0.1 normalizes them.
💡 Proposed Solution
Apply softmax before
torch.topk
inFast-LLM/fast_llm/layers/transformer/mixture_of_experts.py
Line 167 in 51d5715
🔄 Alternatives Considered
Normalize top-k scores as usual, since there's no clear motivation for the same. Good thing is that it's config driven in the HF implementation for OLMoE
📈 Potential Benefits
No clear benefits, but it could instead slow down training by a bit since now we're applying softmax on logits from all experts.
📝 Additional Context
See OLMoE implementation for reference
The text was updated successfully, but these errors were encountered: