[feat] Optional Attention Mask to Prevent Cross-Document Attention in Sequences #62

tscholak · 2024-11-22T16:42:35Z

🧐 Problem Description

Currently, Fast-LLM allows self-attention across all tokens within a sequence, including tokens from different documents separated by EOS tokens. However, it was found that document-level isolation in attention can be beneficial. While models could potentially learn to ignore tokens across boundaries, this behaviour is neither guaranteed nor efficient.

💡 Proposed Solution

Implement an optional attention mask that prevents self-attention between different documents within a sequence.

For example, the Llama 3 paper found that such masking had limited impact during standard pre-training but proved crucial in continued pre-training on very long sequences.

🔄 Alternatives Considered

Just rely on the model to learn when and when not to attend across document boundaries, which is what is happening currently.

📈 Potential Benefits

Improved Performance: Preventing unnecessary attention across unrelated documents could lead to more efficient learning during training and better inference for document-separated tasks.
Enhanced Flexibility: By making this masking optional, users can tailor attention behavior to specific tasks or datasets.
Support for Long-Sequence Training: As noted in the Llama 3 paper, this feature could become essential in scenarios involving very long sequences or continued pre-training.

📝 Additional Context

Reference Implementations:
- Packed attention mask implementation from Megatron-DeepSpeed: GitHub Link.
- Flash attention updates supporting attention masks: Dao-AILab/flash-attention#499, Dao-AILab/flash-attention#654.
- Alternatives for FlashAttention if custom masking isn't working:
  - xformers (Flash Attention doesn't support attention mask PKU-YuanGroup/Open-Sora-Plan#109).
  - FlexAttention (https://pytorch.org/blog/flexattention/#document-maskingjagged-sequences)
This change should be optional and integrated in a way that supports ablation studies, as there are cases (e.g., dependency-ordered document sequences) where cross-document attention may be beneficial.

tscholak added the enhancement New feature or request label Nov 22, 2024

tscholak mentioned this issue Jan 8, 2025

[meta] Support supervised fine-tuning (SFT) #106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Optional Attention Mask to Prevent Cross-Document Attention in Sequences #62

[feat] Optional Attention Mask to Prevent Cross-Document Attention in Sequences #62

tscholak commented Nov 22, 2024

[feat] Optional Attention Mask to Prevent Cross-Document Attention in Sequences #62

[feat] Optional Attention Mask to Prevent Cross-Document Attention in Sequences #62

Comments

tscholak commented Nov 22, 2024

🧐 Problem Description

💡 Proposed Solution

🔄 Alternatives Considered

📈 Potential Benefits

📝 Additional Context