Gated Attention
Context: Gating Mechanisms
Gating mechanisms have a long history in neural network architecture as a way to regulate information flow and improve the propogation of gradients
Early architectures like LSTMs, Highway Networks, and GRUs pioneered gating to control flow across layers / time steps.
More recent works like State-Space Models (SSMs) and linear attention variants often also apply gating to modulate token-mixer outputs.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
The paper systematically investigates augmenting standard Softmax attention with gating at various positions
Optimal Position
Mathematics: Formalizing Gated Attention
The gating mechanism is formalized as a dynamic filter applied to a target input
where
Non-linearity: This increases the expressiveness of the low-rank linear transformation formed by the value
((W_V)) and the dense((W_O)) projections.Input-Dependent Sparsity: Gating scores concentrated near zero introduce sparsity, filtering irrelevant context.
Impact: Performance and Scaling
They see a reduction of over
Training Stability: The gated architecture nearly eliminates loss spikes, allowing for larger learning rates and better scaling. Gating allows dense models to converge at higher learning rates (e.g., doubled to
Attention-Sink-Free: Sparse gating scores effectively eliminate the "attention sink" phenomenon, where models over-allocate attention to the first token. In baseline models, the first token receives
Context Extension: SDPA output gating facilitates adaptation to sequence length increases by replacing fixed attention-sink patterns with input-dependent filtering. Gated models demonstrate superior length generalization, outperforming baselines significantly at
Questions
N/A