The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

Published 7 May 2026 in cs.LG, cs.AI, and stat.ML | (2605.06611v1)

Abstract: Despite the prevalence of the attention sink phenomenon in LLMs, where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. Then, we validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii) amplifying the variance of targeted token representations. Both interventions can replicate attention sinks at arbitrary positions. Our mechanistic understanding offers a foundation for the systematic control of sink formation. Finally, as a proof of concept, we propose \textit{head-wise RMSNorm}, an architectural modification that stabilizes value aggregation outputs during pre-training. Our experiments demonstrate that restoring statistical parity across positions significantly accelerates convergence.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper demonstrates that variance discrepancies in causal self-attention trigger high-variance outlier tokens, initiating attention sinks.
The paper reveals that FFN layers amplify these outliers through super neuron activations and output projections, leading to dimension disparity.
The paper validates that interventions like head-wise RMSNorm effectively mitigate attention sinks and improve training stability.

Structural Origins and Mechanistic Analysis of Attention Sink in Transformer Decoders

Background and Problem Formulation

Attention sinks are observed artifacts in transformer decoders, notably in LLMs, where the initial token or select tokens acquire disproportionately high attention scores independent of semantic relevance. The consequence is manifold: while attention sinks can enable KV cache optimization and prevent over-smoothing, they precipitate pathological outcomes—activation outliers, representation collapse, and gradient anomalies. Existing hypotheses have attributed attention sinks to the softmax normalization, positional biases, or spectral subspaces, yet a definitive structural and mechanistic origin remained unresolved.

This paper rigorously elucidates the architectural roots of attention sink, focusing on variance discrepancy during value aggregation in causal self-attention, super-neuron activation in FFN layers, and resultant dimension disparity. Through targeted interventions and controlled experiments, the authors establish the structural causal chain from the attention mechanism to downstream representation anomalies.

Mechanistic Analysis: Causal Chain from Attention to Sink Formation

Variance Discrepancy from Causal Value Aggregation

Causal masking in decoder-only transformers inherently creates structural asymmetries in value aggregation. The first token attends only to itself, while subsequent tokens average contextually aggregated values. The result is a high-variance outlier representation for the initial token, contrasting with variance decay across later positions. Empirical evaluations (on Llama-2 and Llama-3) confirm that this effect persists even with random token sequences, verifying the positional independence of the phenomenon.

Propagation and Amplification via Output Projections and FFN Super Neurons

The variance discrepancy is not attenuated by the output projection (WO); structural alignment analyses show WO preferentially amplifies high-variance dimensions. The subsequent FFN layer, equipped with SwiGLU, activates specific super neurons—individual columns with markedly higher weight norms. The first token, being a variance outlier, aligns strongly with these super neurons, triggering extreme activations selectively channeled into sparse output dimensions by the down-projection matrix. This pathway leads to a representation dominated by a handful of outlier dimensions, a process termed "dimension disparity".

Structural Locking via RMSNorm and Query-Key Projections

Post FFN, RMSNorm, acting as a directional filter, normalizes the first token’s representation into a fixed high-magnitude direction. The query-key projection in subsequent layers then structurally locks onto this direction, guaranteeing persistently large attention scores for the outlier token. Head-wise SVD analyses reveal that certain attention heads are predisposed to align queries with this collapsed direction, ensuring sink formation.

Controlled Interventions and Causal Validation

Two direct interventions validate the structural causal chain:

Mask Intervention: Blocking aggregation for any token induces an attention sink at that position, demonstrating the effect is not due to absolute position but to variance structure.
Variance Amplification: Artificially inflating variance for arbitrary tokens consistently induces sink behavior, while mere scaling of representation norm does not—affirming variance discrepancy, not norm magnitude, is central.

Empirical results across multiple open-source models confirm universality and robustness of this chain.

Architectural Solutions and Mitigation Strategies

Sigmoid Attention and Head-wise RMSNorm

Replacing softmax with unnormalized sigmoid attention mitigates the variance discrepancy by removing the sum-to-one constraint, suppressing the formation of high-variance outliers and attention sinks. However, sigmoid attention sacrifices downstream training stability and introduces scaling issues with sequence length.

To preserve the stability and performance advantages of softmax while resolving variance disparity, the authors propose head-wise RMSNorm. This normalization stabilizes value aggregation outputs on a per-head basis, neutralizing variance outliers prior to projection. Empirical validation demonstrates:

Significant reduction in dimension disparity and manifold collapse, quantified by lower dominance ratios and preserved effective rank.
Accelerated pre-training convergence and consistently lower validation loss across multiple seeds compared to baseline and sigmoid variants.
Suppression of attention sinks without sacrificing optimization stability.

Implications for Transformer Architecture and Model Training

Practical and Theoretical Implications

The findings reveal that attention sinks are controllable architectural artifacts, not inevitable byproducts of scaling. Systematic interventions stabilizing variance at aggregation points can resolve downstream representation anomalies—dimension disparity, manifold collapse, and optimization bottlenecks. The head-wise RMSNorm approach extends normalization effects to rectify structural imbalances at the attention head level, promoting representational richness and training efficiency.

Generalizability and Future Directions

Empirical tests on models with diverse architectures (e.g., GQA in Llama-3) highlight the structural invariance of the causal chain underlying sink formation. The approach provides a framework for architectural design targeting geometric stability, potentially beneficial for scaling models to higher parameter counts and for integration into mixtures-of-experts and other heterogeneous transformer variants. Future research should investigate the impact of variance stabilization on interpretability, robustness, and resource efficiency in both training and inference.

Conclusion

This paper provides a detailed mechanistic explanation for the structural origin of attention sinks in transformer decoders, tracing their emergence to variance discrepancy in causal attention, selective activation of super neurons, and ensuing dimension disparity. Controlled interventions and new normalization strategies demonstrate that attention sinks are not fundamental model constraints but can be systematically mitigated. These results inform both theoretical understanding and practical architectural design, offering pathways toward more stable, interpretable, and efficient transformer-based LLMs (2605.06611).

Markdown Report Issue