- The paper demonstrates that variance discrepancies in causal self-attention trigger high-variance outlier tokens, initiating attention sinks.
- The paper reveals that FFN layers amplify these outliers through super neuron activations and output projections, leading to dimension disparity.
- The paper validates that interventions like head-wise RMSNorm effectively mitigate attention sinks and improve training stability.
Attention sinks are observed artifacts in transformer decoders, notably in LLMs, where the initial token or select tokens acquire disproportionately high attention scores independent of semantic relevance. The consequence is manifold: while attention sinks can enable KV cache optimization and prevent over-smoothing, they precipitate pathological outcomes—activation outliers, representation collapse, and gradient anomalies. Existing hypotheses have attributed attention sinks to the softmax normalization, positional biases, or spectral subspaces, yet a definitive structural and mechanistic origin remained unresolved.
This paper rigorously elucidates the architectural roots of attention sink, focusing on variance discrepancy during value aggregation in causal self-attention, super-neuron activation in FFN layers, and resultant dimension disparity. Through targeted interventions and controlled experiments, the authors establish the structural causal chain from the attention mechanism to downstream representation anomalies.
Variance Discrepancy from Causal Value Aggregation
Causal masking in decoder-only transformers inherently creates structural asymmetries in value aggregation. The first token attends only to itself, while subsequent tokens average contextually aggregated values. The result is a high-variance outlier representation for the initial token, contrasting with variance decay across later positions. Empirical evaluations (on Llama-2 and Llama-3) confirm that this effect persists even with random token sequences, verifying the positional independence of the phenomenon.
Propagation and Amplification via Output Projections and FFN Super Neurons
The variance discrepancy is not attenuated by the output projection (WO); structural alignment analyses show WO preferentially amplifies high-variance dimensions. The subsequent FFN layer, equipped with SwiGLU, activates specific super neurons—individual columns with markedly higher weight norms. The first token, being a variance outlier, aligns strongly with these super neurons, triggering extreme activations selectively channeled into sparse output dimensions by the down-projection matrix. This pathway leads to a representation dominated by a handful of outlier dimensions, a process termed "dimension disparity".
Structural Locking via RMSNorm and Query-Key Projections
Post FFN, RMSNorm, acting as a directional filter, normalizes the first token’s representation into a fixed high-magnitude direction. The query-key projection in subsequent layers then structurally locks onto this direction, guaranteeing persistently large attention scores for the outlier token. Head-wise SVD analyses reveal that certain attention heads are predisposed to align queries with this collapsed direction, ensuring sink formation.
Controlled Interventions and Causal Validation
Two direct interventions validate the structural causal chain:
- Mask Intervention: Blocking aggregation for any token induces an attention sink at that position, demonstrating the effect is not due to absolute position but to variance structure.
- Variance Amplification: Artificially inflating variance for arbitrary tokens consistently induces sink behavior, while mere scaling of representation norm does not—affirming variance discrepancy, not norm magnitude, is central.
Empirical results across multiple open-source models confirm universality and robustness of this chain.
Architectural Solutions and Mitigation Strategies
Sigmoid Attention and Head-wise RMSNorm
Replacing softmax with unnormalized sigmoid attention mitigates the variance discrepancy by removing the sum-to-one constraint, suppressing the formation of high-variance outliers and attention sinks. However, sigmoid attention sacrifices downstream training stability and introduces scaling issues with sequence length.
To preserve the stability and performance advantages of softmax while resolving variance disparity, the authors propose head-wise RMSNorm. This normalization stabilizes value aggregation outputs on a per-head basis, neutralizing variance outliers prior to projection. Empirical validation demonstrates:
- Significant reduction in dimension disparity and manifold collapse, quantified by lower dominance ratios and preserved effective rank.
- Accelerated pre-training convergence and consistently lower validation loss across multiple seeds compared to baseline and sigmoid variants.
- Suppression of attention sinks without sacrificing optimization stability.
Practical and Theoretical Implications
The findings reveal that attention sinks are controllable architectural artifacts, not inevitable byproducts of scaling. Systematic interventions stabilizing variance at aggregation points can resolve downstream representation anomalies—dimension disparity, manifold collapse, and optimization bottlenecks. The head-wise RMSNorm approach extends normalization effects to rectify structural imbalances at the attention head level, promoting representational richness and training efficiency.
Generalizability and Future Directions
Empirical tests on models with diverse architectures (e.g., GQA in Llama-3) highlight the structural invariance of the causal chain underlying sink formation. The approach provides a framework for architectural design targeting geometric stability, potentially beneficial for scaling models to higher parameter counts and for integration into mixtures-of-experts and other heterogeneous transformer variants. Future research should investigate the impact of variance stabilization on interpretability, robustness, and resource efficiency in both training and inference.
Conclusion
This paper provides a detailed mechanistic explanation for the structural origin of attention sinks in transformer decoders, tracing their emergence to variance discrepancy in causal attention, selective activation of super neurons, and ensuing dimension disparity. Controlled interventions and new normalization strategies demonstrate that attention sinks are not fundamental model constraints but can be systematically mitigated. These results inform both theoretical understanding and practical architectural design, offering pathways toward more stable, interpretable, and efficient transformer-based LLMs (2605.06611).