Gated Attention

Context: Gating Mechanisms

Gating mechanisms have a long history in neural network architecture as a way to regulate information flow and improve the propogation of gradients (∇).

Early architectures like LSTMs, Highway Networks, and GRUs pioneered gating to control flow across layers / time steps.

More recent works like State-Space Models (SSMs) and linear attention variants often also apply gating to modulate token-mixer outputs.

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

The paper systematically investigates augmenting standard Softmax attention with gating at various positions ((G_1),(G_2),(G_3),(G_4),(G_5)).

Optimal Position ((G_1)): Applying a head-specific sigmoid gate after the Scaled Dot-Product (SDPA) yields the most significant improvements.

Mathematics: Formalizing Gated Attention

The gating mechanism is formalized as a dynamic filter applied to a target input Y using a gating input X:

(Y^′)=g(Y,X,(W_θ),σ)=Y⊙σ(X*(W_θ))

where Y is the input to be modulated (e.g., (SDPA), X is the hidden states after pre-normalization used to compute gating scores, and σ is the activation function, typically (sigmoid_). The effectiveness is attributed to two factors:

Non-linearity: This increases the expressiveness of the low-rank linear transformation formed by the value ((W_V)) and the dense ((W_O)) projections.
Input-Dependent Sparsity: Gating scores concentrated near zero introduce sparsity, filtering irrelevant context.

Impact: Performance and Scaling

They see a reduction of over 0.2 PPL in 15 billion parameter MoE models using SDPA Elementwise (G_1).

Training Stability: The gated architecture nearly eliminates loss spikes, allowing for larger learning rates and better scaling. Gating allows dense models to converge at higher learning rates (e.g., doubled to 8.0×10(-3)) where standard baselines would otherwise diverge.

Attention-Sink-Free: Sparse gating scores effectively eliminate the "attention sink" phenomenon, where models over-allocate attention to the first token. In baseline models, the first token receives ~46.7 of attention; gating reduces this to ~4.8 of attention.

Context Extension: SDPA output gating facilitates adaptation to sequence length increases by replacing fixed attention-sink patterns with input-dependent filtering. Gated models demonstrate superior length generalization, outperforming baselines significantly at 64k and 128k context lengths on the RULER benchmark.

Gated Attention

Context: Gating Mechanisms

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Mathematics: Formalizing Gated Attention

Impact: Performance and Scaling

Questions