You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, Fast-LLM allows self-attention across all tokens within a sequence, including tokens from different documents separated by EOS tokens. However, it was found that document-level isolation in attention can be beneficial. While models could potentially learn to ignore tokens across boundaries, this behaviour is neither guaranteed nor efficient.
💡 Proposed Solution
Implement an optional attention mask that prevents self-attention between different documents within a sequence.
For example, the Llama 3 paper found that such masking had limited impact during standard pre-training but proved crucial in continued pre-training on very long sequences.
🔄 Alternatives Considered
Just rely on the model to learn when and when not to attend across document boundaries, which is what is happening currently.
📈 Potential Benefits
Improved Performance: Preventing unnecessary attention across unrelated documents could lead to more efficient learning during training and better inference for document-separated tasks.
Enhanced Flexibility: By making this masking optional, users can tailor attention behavior to specific tasks or datasets.
Support for Long-Sequence Training: As noted in the Llama 3 paper, this feature could become essential in scenarios involving very long sequences or continued pre-training.
📝 Additional Context
Reference Implementations:
Packed attention mask implementation from Megatron-DeepSpeed: GitHub Link.
This change should be optional and integrated in a way that supports ablation studies, as there are cases (e.g., dependency-ordered document sequences) where cross-document attention may be beneficial.
The text was updated successfully, but these errors were encountered:
🧐 Problem Description
Currently, Fast-LLM allows self-attention across all tokens within a sequence, including tokens from different documents separated by EOS tokens. However, it was found that document-level isolation in attention can be beneficial. While models could potentially learn to ignore tokens across boundaries, this behaviour is neither guaranteed nor efficient.
💡 Proposed Solution
Implement an optional attention mask that prevents self-attention between different documents within a sequence.
For example, the Llama 3 paper found that such masking had limited impact during standard pre-training but proved crucial in continued pre-training on very long sequences.
🔄 Alternatives Considered
Just rely on the model to learn when and when not to attend across document boundaries, which is what is happening currently.
📈 Potential Benefits
📝 Additional Context
The text was updated successfully, but these errors were encountered: