Linear-Time Global Visual Modeling without Explicit Attention

Published 3 May 2026 in cs.CV | (2605.01711v1)

Abstract: Existing research largely attributes the global sequence modeling capability of Transformers to the explicit computation of attention weights, a process that inherently incurs quadratic computational complexity. In this work, we offer a novel perspective: we demonstrate that attention can be mathematically reframed as a Multi-Layer Perceptron (MLP) equipped with dynamically predicted parameters. Through this lens, we explain attention's global modeling power not as explicit token-wise aggregation, but as an implicit process where dynamically generated parameters act as a compressed representation of the global context. Inspired by this insight, we investigate a fundamental question: can we achieve Transformer-level sequence global modeling entirely through dynamic parameterization while maintaining linear complexity, effectively replacing explicit attention? To explore this, we design various dynamic parameter prediction strategies and integrate them into standard network layers. Extensive empirical studies on vision models demonstrate that dynamic parameterization can indeed serve as a highly effective, linear-complexity alternative to explicit attention, opening new pathways for efficient sequence modeling. Code is available at https://github.com/LeapLabTHU/WeightFormer.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper demonstrates that dynamic parameterization can replace explicit attention for global visual modeling.
It details the WeightFormer architecture that achieves competitive performance on ImageNet, COCO, and ADE20K while maintaining linear complexity.
Empirical results show significant computational savings and improved efficiency in image classification, detection, segmentation, and generation tasks.

Linear-Time Global Visual Modeling without Explicit Attention: An Expert Analysis

Motivation and Perspective Shift

The prevailing doctrine in the application of Transformers for global modeling across vision and other domains is the explicit computation and application of attention weights. This paradigm is inherently bounded by quadratic complexity in computation and memory due to the $N \times N$ token interaction matrix. Prior attempts at efficiency—sparse attention, low-rank approximations, kernel methods, etc.—remain entrenched in the explicit attention paradigm, always aggregating values via similarity-weighted matrices.

This paper reframes attention from explicit token-wise aggregation to implicit global modeling via dynamic parameterization. Attention is recast mathematically as a two-layer input-conditioned MLP, with the key $K^\top$ and value $V$ matrices functioning as dynamic parameters and Softmax as the activation, all predicted from the input. This perspective dispenses with explicit attention weights and treats global sequence modeling as compressing context into dynamic weights, facilitating implicit dependency integration and obviating the $N \times N$ matrix.

Dynamic Parameterization as Attention Replacement

The central question is whether dynamic parameterization suffices for Transformer-level global modeling with strictly linear complexity. To validate this, various dynamic parameter prediction strategies are designed—linear/correlation-based predictors, deep/bilateral activations for linear layers, spatially adaptive, decoupled amplitude-direction, and convolutional predictors for depthwise convolution—conditioned on compression paradigms that decouple parameter generation from input length.

The dynamic weights modulate static parameters ( $W_0 + \Delta W(X)$ for both linear and depthwise convolution), integrating global information via implicit transformation rather than explicit token routing. Spatial compression through pooling and correlation statistics ensures that parameter prediction scales linearly, establishing a principled foundation for efficient, global modeling architectures.

WeightFormer Architecture and Empirical Evaluation

WeightFormer is instantiated by integrating the aforementioned dynamic parameterization techniques sparsely across blocks, striking a balance between model capacity and efficiency. Dynamic blocks replace standard layers with spatially adaptive depthwise convolutions and bilateral-activated MLPs. An ablation on block frequency reveals that sparse insertion (every third block) maximizes accuracy and throughput, avoiding optimization difficulties attendant to overly dense dynamic parameterization.

Image Classification

On ImageNet-1K, WeightFormer-S achieves 81.3% top-1 accuracy, exceeding DeiT-S and ConvNeXt-S with similar parameter/FLOPs budgets. Importantly, WeightFormer maintains linear complexity and scales efficiently to high-resolution inputs, with throughput and memory advantages (up to 7.7 $\times$ at 1248 $\times$ 1248 compared to DeiT).

Object Detection and Segmentation

On COCO, WeightFormer-T provides consistent improvements in box (45.0 vs 44.4) and mask (38.3 vs 38.1) AP over DeiT-T, with substantial FLOPs reduction (566G vs 594G total, 77G vs 106G backbone). In semantic segmentation on ADE20K, WeightFormer-S outperforms DeiT-S by a 1.6-point mIoU margin with lower backbone FLOPs.

Image Generation

FID scores on class-conditional ImageNet-1K generation indicate competitive sample quality improvements across WeightFormer configurations, outperforming DiT and DiG baselines for comparable or better FID at reduced computational cost.

Effective Receptive Fields

Analysis of ERF exhibits expansive global receptive fields for dynamic-weighted models, confirming that implicit parameterization achieves genuine global reasoning akin to explicit attention without quadratic overhead.

Theoretical Implications and Directions

Dynamic parameterization fundamentally alters the conceptual landscape for global sequence modeling. The dynamic MLP interpretation explains the quadratic scaling of classic attention and establishes that global context aggregation does not require explicit token-to-token routing. This opens a principled design space for efficient architectures across vision and potentially broader domains.

The expressivity and inductive biases of dynamic parameterization—particularly how spatial and channel compression interact with semantic representation—remain undercharacterized. Training stability and optimization are non-trivial due to input-conditioned gradient flow. The paper restricts its evaluation to vision tasks; generalization to language or multimodal domains is an open avenue. Extensions could leverage more advanced parameter generation, adaptive distribution of dynamic blocks, and deeper theoretical analysis of global modeling power and limitations.

Conclusion

This work demonstrates that explicit attention is not strictly necessary for global sequence modeling. By recasting attention as a dynamic parameterized MLP and systematically validating dynamic parameterization within the WeightFormer architecture, the authors show that linear-time, global, efficient modeling is feasible and competitive in vision tasks. This paradigm shift motivates future development of architectures free from quadratic attention, promising both computational savings and potential advances in scalability and representational power across AI domains.

Markdown Report Issue