Your Embedding Model is SMARTer Than You Think

Published 24 May 2026 in cs.IR, cs.AI, and cs.CV | (2605.24938v1)

Abstract: Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine-grained, local evidence critical for dense retrieval tasks. Multi-vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. We also reveal SMART's superior performance, as simple lightweight post-training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at https://github.com/HanSolo9682/SMART.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper demonstrates that single-vector embedding models implicitly encode token-level information through contrastive training, enabling effective fine-grained retrieval.
It validates SMART’s inference-only adaptation using controlled benchmarks and multi-modal tasks, achieving up to +2.54% improvement in retrieval accuracy.
The proposed hybrid scoring scheme combines pooled and token-level scores, offering a plug-and-play upgrade with minimal post-training to boost performance.

SMART: Enabling Efficient Multi-Vector Retrieval from Single-Vector Embedding Models

Introduction and Motivation

Single-vector embedding models dominate multimodal retrieval by compressing token sequences into a global pooled representation, optimizing for efficient nearest-neighbor search across text, image, video, and document modalities. However, this paradigm fundamentally limits retrieval capacity, as global pooling erases localized evidence crucial for dense retrieval tasks where queries depend on fine-grained regional or token-level semantics. Empirical and theoretical analyses confirm that single-vector retrievers are strictly bounded by embedding dimensionality, often failing when relevance hinges on local features.

Multi-vector architectures (e.g., ColBERT, Colpali, jina-embeddings-v4) mitigate these limits via late-interaction mechanisms, retaining token-level representations and computing MaxSim-style relevance. The trade-off is considerable: such models require full-scale task-specific retraining, adapter construction, or additional learnable tokens, incurring significant computational and memory costs that scale with sequence length. Critically, many ignore the necessity for a globally summarizing representation for effective retrieval.

SMART (Single-to-Multi Adaptation for Retrieval Transformers) addresses these challenges by exploiting a key observation: standard contrastive training on the pooled token implicitly structures preceding hidden states in a manner compatible with token-level retrieval. Through gradient flow and the network’s architecture, local semantic information is maintained within hidden states though not explicitly optimized for retrieval. SMART unlocks these latent capabilities via direct late-interaction, transforming any single-vector embedder into a highly effective multi-vector retriever at zero training cost, and further enhancing performance with minimal lightweight post-training.

Figure 1: Single-vector models discard local token information via pooling; SMART reveals geometric alignment of hidden states and enables effective multi-vector retrieval with both inference-only adaptation and lightweight post-training.

Framework: Leveraging Implicit Local Evidence

Single-vector embedding models are trained with contrastive InfoNCE loss applied only to the pooled token (e.g., the final-layer <eot> representation). Despite this, all non-pooling hidden states lie on the gradient path due to transformer architecture. The pooled embedding aggregates signal from every non-pooling token. This indirect supervision—rooted in cosine similarity objective—encourages non-pooled hidden states to organize compatibly for local similarity computation, even without explicit token-level retrieval supervision.

SMART operates by extracting these hidden states at inference. For each input, the model produces token-level hidden states, which can be directly matched via MaxSim late-interaction: for every query token, find the most similar candidate token using normalized cosine similarity, and aggregate over all query tokens. This score ( $s_\mathrm{late}$ ) is combined with the original pooled score ( $s_\mathrm{single}$ ) in a hybrid scoring scheme ( $s_\mathrm{hybrid}$ ), retaining global compatibility while exposing local evidence.

This adaptation is hyperparameter-free and requires no additional training, acting as a plug-and-play upgrade for any single-vector backbone.

Empirical Validation: Controlled Benchmark Analysis

To explicitly test the pooling bottleneck, the authors constructed a controlled visual report benchmark where each query targets a local code–marker binding; hard negatives have identical layout, codes, colors, and shapes but permutations ensure no correct local bindings. Results show pooled single-vector scoring selects positives only 31.9% of the time, while late-interaction over hidden states improves to 56.8%, confirming the presence and utility of local evidence inaccessible via pooling.

Figure 2: Controlled benchmark isolates local binding retrieval—hard negatives share global content; late-interaction reveals local evidence missed by pooled score.

Broad-Scale Evaluation: MMEB-V2 Retrieval Tasks

SMART was applied inference-only to a range of state-of-the-art models (VLM2Vec-V2.0, GME, Qwen3-VL-Embedding) on MMEB-V2, covering image, visual document, and video retrieval. Across all backbones and domains, SMART consistently yields up to +2.54% improvement in retrieval metrics, scaling robustly even on the SoTA Qwen3-VL-Embedding-8B. These gains are achieved entirely at inference, no additional parameter updates needed.

Lightweight Post-Training and Efficient Conversion

SMART’s efficacy is further amplified by minimal post-training. By freezing the embedder and training only a token-wise adapter, performance is boosted by an additional point or more, efficiently converting a single-vector model to a SoTA multi-vector retriever often within less than 2 hours (for Qwen3-VL-Embedding-2B). When compared to full-scale multi-vector training (e.g., LamRA-Multi), SMART conversion achieves competitive performance, saving $\sim20\%$ training time.

Qualitative Visualizations

SMART corrects single-vector failures where global similarity leads to plausible but incorrect matches, recovering localized details through token-level evidence. Visualization of retrieval tasks shows SMART retrieving correct instances missed by pooling, with MaxSim mapping query patches to meaningful candidate regions.

Figure 3: SMART rectifies single-vector retrieval errors by leveraging token-level late-interaction; colored boxes indicate evidence-driven localized matches.

Figure 4: Additional examples—SMART retrieves candidates missed by pooling through token-level matching, critical for fine-grained visual information.

Figure 5: Token-level visualization—SMART maps selected query regions to top candidate-image tokens, highlighting retrieval of localized evidence.

Layer-wise Analysis

Late-interaction performance increases monotonically with layer depth in pooling-based readouts, confirming that deep layers encode progressively discriminative representations. Importantly, the final-layer pooled token remains robust even when paired with earlier hidden states for late interaction, supporting flexibility in hybrid scoring decisions.

Implications and Future Directions

Practical implications of SMART are substantial: inference-only adaptation improves retrieval accuracy and is universally compatible with all tested backbones, providing an avenue to unlock latent performance from off-the-shelf models with zero overhead. Theoretical implications challenge the notion that single-vector models are inherently unable to leverage local evidence, demonstrating that contrastive training does structure hidden states favorably.

Future directions involve expanding SMART to more complex temporal and holistic tasks (e.g., video moment retrieval, grounding), exploring integration with new aggregation modules that can bridge fine-grained alignment and high-level abstraction. The approach also invites investigation into amortized retrieval pipelines and the integration with continual learning for real-time multimodal search applications.

Conclusion

SMART establishes that single-vector embedding models retain latent multi-vector retrieval capabilities, recoverable via direct late-interaction scoring over hidden states. The framework acts as an efficient inference-time enhancement and enables conversion to powerful multi-vector retrievers with minimal additional training. SMART thus elevates dense retrieval performance, overcoming long-standing expressiveness bottlenecks and maximizing architectural efficiency in multimodal embedding models.

Markdown Report Issue