[feat] Option to disable top-k routing weights normalization #83

sohamparikh · 2024-12-03T03:17:21Z

🧐 Problem Description

OLMoE has disabled the normalization for top-k routing probabilities. There is no clear motivation or ablation for why this was done. DeepSeekMoE also disables top-k normalization, while Mixtral-8x7b-v0.1 normalizes them.

💡 Proposed Solution

Apply softmax before torch.topk in

Fast-LLM/fast_llm/layers/transformer/mixture_of_experts.py

Line 167 in 51d5715

scores = torch.softmax(top_logits, dim=-1, dtype=torch.float32)

🔄 Alternatives Considered

Normalize top-k scores as usual, since there's no clear motivation for the same. Good thing is that it's config driven in the HF implementation for OLMoE

📈 Potential Benefits

No clear benefits, but it could instead slow down training by a bit since now we're applying softmax on logits from all experts.

📝 Additional Context

See OLMoE implementation for reference

The text was updated successfully, but these errors were encountered:

sohamparikh · 2024-12-03T07:15:42Z

Reply from the OLMoE authors for reference. There might not be strong reasons to do this as of now

https://huggingface.co/allenai/OLMoE-1B-7B-0924/discussions/4#674e9e974232e14c900fa528

sohamparikh added the enhancement New feature or request label Dec 3, 2024

sohamparikh mentioned this issue Dec 3, 2024

[epic] OLMoE support #66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Option to disable top-k routing weights normalization #83

[feat] Option to disable top-k routing weights normalization #83

sohamparikh commented Dec 3, 2024

sohamparikh commented Dec 3, 2024

[feat] Option to disable top-k routing weights normalization #83

[feat] Option to disable top-k routing weights normalization #83

Comments

sohamparikh commented Dec 3, 2024

🧐 Problem Description

💡 Proposed Solution

🔄 Alternatives Considered

📈 Potential Benefits

📝 Additional Context

sohamparikh commented Dec 3, 2024