Papers
Topics
Authors
Recent
Search
2000 character limit reached

Linear-Time Global Visual Modeling without Explicit Attention

Published 3 May 2026 in cs.CV | (2605.01711v1)

Abstract: Existing research largely attributes the global sequence modeling capability of Transformers to the explicit computation of attention weights, a process that inherently incurs quadratic computational complexity. In this work, we offer a novel perspective: we demonstrate that attention can be mathematically reframed as a Multi-Layer Perceptron (MLP) equipped with dynamically predicted parameters. Through this lens, we explain attention's global modeling power not as explicit token-wise aggregation, but as an implicit process where dynamically generated parameters act as a compressed representation of the global context. Inspired by this insight, we investigate a fundamental question: can we achieve Transformer-level sequence global modeling entirely through dynamic parameterization while maintaining linear complexity, effectively replacing explicit attention? To explore this, we design various dynamic parameter prediction strategies and integrate them into standard network layers. Extensive empirical studies on vision models demonstrate that dynamic parameterization can indeed serve as a highly effective, linear-complexity alternative to explicit attention, opening new pathways for efficient sequence modeling. Code is available at https://github.com/LeapLabTHU/WeightFormer.

Authors (3)

Summary

  • The paper demonstrates that dynamic parameterization can replace explicit attention for global visual modeling.
  • It details the WeightFormer architecture that achieves competitive performance on ImageNet, COCO, and ADE20K while maintaining linear complexity.
  • Empirical results show significant computational savings and improved efficiency in image classification, detection, segmentation, and generation tasks.

Linear-Time Global Visual Modeling without Explicit Attention: An Expert Analysis

Motivation and Perspective Shift

The prevailing doctrine in the application of Transformers for global modeling across vision and other domains is the explicit computation and application of attention weights. This paradigm is inherently bounded by quadratic complexity in computation and memory due to the Nร—NN \times N token interaction matrix. Prior attempts at efficiencyโ€”sparse attention, low-rank approximations, kernel methods, etc.โ€”remain entrenched in the explicit attention paradigm, always aggregating values via similarity-weighted matrices.

This paper reframes attention from explicit token-wise aggregation to implicit global modeling via dynamic parameterization. Attention is recast mathematically as a two-layer input-conditioned MLP, with the key KโŠคK^\top and value VV matrices functioning as dynamic parameters and Softmax as the activation, all predicted from the input. This perspective dispenses with explicit attention weights and treats global sequence modeling as compressing context into dynamic weights, facilitating implicit dependency integration and obviating the Nร—NN \times N matrix.

Dynamic Parameterization as Attention Replacement

The central question is whether dynamic parameterization suffices for Transformer-level global modeling with strictly linear complexity. To validate this, various dynamic parameter prediction strategies are designedโ€”linear/correlation-based predictors, deep/bilateral activations for linear layers, spatially adaptive, decoupled amplitude-direction, and convolutional predictors for depthwise convolutionโ€”conditioned on compression paradigms that decouple parameter generation from input length.

The dynamic weights modulate static parameters (W0+ฮ”W(X)W_0 + \Delta W(X) for both linear and depthwise convolution), integrating global information via implicit transformation rather than explicit token routing. Spatial compression through pooling and correlation statistics ensures that parameter prediction scales linearly, establishing a principled foundation for efficient, global modeling architectures.

WeightFormer Architecture and Empirical Evaluation

WeightFormer is instantiated by integrating the aforementioned dynamic parameterization techniques sparsely across blocks, striking a balance between model capacity and efficiency. Dynamic blocks replace standard layers with spatially adaptive depthwise convolutions and bilateral-activated MLPs. An ablation on block frequency reveals that sparse insertion (every third block) maximizes accuracy and throughput, avoiding optimization difficulties attendant to overly dense dynamic parameterization.

Image Classification

On ImageNet-1K, WeightFormer-S achieves 81.3% top-1 accuracy, exceeding DeiT-S and ConvNeXt-S with similar parameter/FLOPs budgets. Importantly, WeightFormer maintains linear complexity and scales efficiently to high-resolution inputs, with throughput and memory advantages (up to 7.7ร—\times at 1248ร—\times1248 compared to DeiT).

Object Detection and Segmentation

On COCO, WeightFormer-T provides consistent improvements in box (45.0 vs 44.4) and mask (38.3 vs 38.1) AP over DeiT-T, with substantial FLOPs reduction (566G vs 594G total, 77G vs 106G backbone). In semantic segmentation on ADE20K, WeightFormer-S outperforms DeiT-S by a 1.6-point mIoU margin with lower backbone FLOPs.

Image Generation

FID scores on class-conditional ImageNet-1K generation indicate competitive sample quality improvements across WeightFormer configurations, outperforming DiT and DiG baselines for comparable or better FID at reduced computational cost.

Effective Receptive Fields

Analysis of ERF exhibits expansive global receptive fields for dynamic-weighted models, confirming that implicit parameterization achieves genuine global reasoning akin to explicit attention without quadratic overhead.

Theoretical Implications and Directions

Dynamic parameterization fundamentally alters the conceptual landscape for global sequence modeling. The dynamic MLP interpretation explains the quadratic scaling of classic attention and establishes that global context aggregation does not require explicit token-to-token routing. This opens a principled design space for efficient architectures across vision and potentially broader domains.

The expressivity and inductive biases of dynamic parameterizationโ€”particularly how spatial and channel compression interact with semantic representationโ€”remain undercharacterized. Training stability and optimization are non-trivial due to input-conditioned gradient flow. The paper restricts its evaluation to vision tasks; generalization to language or multimodal domains is an open avenue. Extensions could leverage more advanced parameter generation, adaptive distribution of dynamic blocks, and deeper theoretical analysis of global modeling power and limitations.

Conclusion

This work demonstrates that explicit attention is not strictly necessary for global sequence modeling. By recasting attention as a dynamic parameterized MLP and systematically validating dynamic parameterization within the WeightFormer architecture, the authors show that linear-time, global, efficient modeling is feasible and competitive in vision tasks. This paradigm shift motivates future development of architectures free from quadratic attention, promising both computational savings and potential advances in scalability and representational power across AI domains.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.