Fix QKV weight sharding for gemma #222

masahi · 2024-02-26T23:01:27Z

This fixes multi-gpu inference for gemma. The sharding function was assuming head_dim = model_config.hidden_size // model_config.num_attention_heads, which is incorrect for gemma.

@sunggg @Lunderberg

Fix gemma multi-gpu

d88f4b4

masahi merged commit 0dfb756 into octoml:batch-serving Feb 27, 2024
1 check passed

masahi deleted the gemma-followup branch February 27, 2024 07:27

Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Feb 27, 2024

Remove std::filesystem::canonicalize (octoml#222)

f1dcc7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix QKV weight sharding for gemma #222

Fix QKV weight sharding for gemma #222

masahi commented Feb 26, 2024 •

edited

Loading

Fix QKV weight sharding for gemma #222

Fix QKV weight sharding for gemma #222

Conversation

masahi commented Feb 26, 2024 • edited Loading

masahi commented Feb 26, 2024 •

edited

Loading