Mixture-of-Depths Attention: Unlocking Deep Transformers

This presentation explores Mixture-of-Depths Attention (MoDA), a novel mechanism that solves the critical information dilution problem in deep language models. By enabling each attention head to simultaneously retrieve from both sequence positions and historical layer representations, MoDA achieves the expressivity of dense cross-layer connectivity at a fraction of the cost. The talk covers the technical mechanism, hardware-efficient implementation strategies, and empirical results demonstrating consistent improvements across validation perplexity and downstream benchmarks with minimal computational overhead.
Script
Deep Transformers promise greater representational power, but there's a hidden cost: as layers stack up, information from early layers progressively dissolves through repeated residual updates. Mixture-of-Depths Attention solves this dilution problem by letting each layer reach back and selectively retrieve representations from its entire history, without the prohibitive cost of dense connectivity.
Standard residual architectures force every intermediate representation through a single additive bottleneck. By layer 40, the subtle distinctions captured at layer 5 have been averaged away dozens of times. This is why simply stacking more Transformer blocks often yields diminishing returns: the architectural pipeline itself erases the very information depth was meant to preserve.
Mixture-of-Depths Attention addresses this by creating a unified retrieval space that spans both time and depth.
Here's how it works: at each layer, every query doesn't just look at other tokens in the sequence. It simultaneously considers the collection of representations that previous layers produced at the same positional index. A single softmax operation normalizes across both dimensions, letting the model dynamically decide whether to pull information from recent context or from deeper historical features. This data-dependent routing means the architecture adapts its connectivity on the fly.
Dense connectivity would concatenate every intermediate state, ballooning to quadratic parameter and compute costs. MoDA achieves comparable expressivity at linear cost by reusing the grouped query attention infrastructure already present in modern language models. Custom CUDA kernels organize depth key-value pairs into contiguous, chunk-aware layouts, recovering nearly all the throughput of standard sequence attention. The result: cross-layer retrieval that scales.
Trained on 400 billion tokens, MoDA models consistently outperform strong open-source baselines. The improvements aren't marginal: downstream task accuracy rises by over 2 percent on average, and validation loss drops across the board. Crucially, the gains hold even as model depth grows, demonstrating that MoDA genuinely converts extra layers into usable signal rather than compounding the dilution problem.
Mixture-of-Depths Attention proves that deep Transformers can be both expressive and efficient when given the right connectivity. By unifying sequence and depth retrieval, it turns architectural depth into a practical advantage, not a theoretical luxury. To explore more cutting-edge research and create your own video summaries, visit EmergentMind.com.