You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Microsoft Research and Tsinghua University researchers have introduced Differential Transformer (Diff Transformer), a new LLM architecture that improves performance by amplifying attention to relevant context while filtering out noise. Their findings, published in a research paper, show that Diff Transformer outperforms the classic Transformer architecture in various settings. The Diff-Transformer can be applied both during the training phase and to pretrained models. When applied to pretrained models, it can enhance their robustness and accuracy in practical applications like in-context learning and text summarization. Sources below. The feature request here is to examine the application potential at vLLM runtime.
"
multihead_diffattn.py contains naive implementation of multi-head differential attention.
multihead_flashdiff_1.py contains multi-head differential attention implemented with FlashAttention, for packages that support different qk/v dimensions (e.g., our customized-flash-attention and xformers).
multihead_flashdiff_2.py contains multi-head differential attention implemented with FlashAttention, for packages that do not support different qk/v dimensions (e.g., flash-attention).
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
nightflight-dk
changed the title
[Feature]: Support for Diff-Transformer to limit noise in attention calculation during inference
[Feature]: Support for Diff-Transformer to limit noise in attention calculation runtime
Oct 18, 2024
nightflight-dk
changed the title
[Feature]: Support for Diff-Transformer to limit noise in attention calculation runtime
[Feature]: Support for Diff-Transformer to limit noise in attention calculation @ runtime
Oct 18, 2024
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
🚀 The feature, motivation and pitch
Microsoft Research and Tsinghua University researchers have introduced Differential Transformer (Diff Transformer), a new LLM architecture that improves performance by amplifying attention to relevant context while filtering out noise. Their findings, published in a research paper, show that Diff Transformer outperforms the classic Transformer architecture in various settings. The Diff-Transformer can be applied both during the training phase and to pretrained models. When applied to pretrained models, it can enhance their robustness and accuracy in practical applications like in-context learning and text summarization. Sources below. The feature request here is to examine the application potential at vLLM runtime.
paper: ArXiv
press coverage (October 16th): VentureBeat
Alternatives
N/A
Additional context
github: Diff-Transformer
"
multihead_diffattn.py contains naive implementation of multi-head differential attention.
multihead_flashdiff_1.py contains multi-head differential attention implemented with FlashAttention, for packages that support different qk/v dimensions (e.g., our customized-flash-attention and xformers).
multihead_flashdiff_2.py contains multi-head differential attention implemented with FlashAttention, for packages that do not support different qk/v dimensions (e.g., flash-attention).
Also refer to microsoft/unilm#1633 for another implementation.
"
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: