Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search

Published 29 May 2026 in cs.IR and cs.CV | (2605.30917v1)

Abstract: As large-scale visual-document corpora such as arXiv papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual documents to serve queries without neural encoding at scale. Existing methods either achieve strong retrieval quality with VLM-based dense or multi-vector models but require neural query encoding at serving time, or avoid query encoding with OCR- or caption-based BM25 at the cost of time-consuming text extraction or generation. To fill this missing serving regime, we present V-SPLADE, an inference-free sparse retriever for visual-document retrieval. However, such inference-free multimodal learned sparse retrieval systems remain underexplored and have not yet shown dense-level effectiveness under high sparsity. We attribute this limitation to a lexical grounding problem: visual sparse representations often fail to capture the lexical content embedded in document images. To address this problem, we introduce caption-gated token supervision, a training-only signal that uses VLM-generated captions as lexical cues to activate retrieval-relevant vocabulary dimensions. With this supervision, V-SPLADE improves average NDCG@5 across six visual-document retrieval benchmarks by +13.8pp over the same-scale dense baseline and by up to +6.3pp over OCR- or caption-based BM25 baselines. On an 18.7M-document corpus, it more than doubles R@5 over the same-scale dense baseline and further improves competing retrievers through score fusion by up to +2.4pp R@5. Code will be released soon at https://github.com/naver/v-splade.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces V-SPLADE, a novel sparse lexical retriever that indexes visual documents without runtime neural query encoding through caption-gated token supervision.
It adapts a compact vision-language model to project image inputs into a vocabulary-indexed sparse space, achieving up to +13.8 NDCG@5 improvement over dense alternatives.
Empirical results demonstrate scalability with 20× faster encoding and robust recall on an 18.7M-document corpus, underpinning its production-scale deployment.

Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search

Introduction and Motivation

The proliferation of large-scale visual-document corpora, such as arXiv and enterprise PDFs, has created acute demands for scalable and efficient visual-document retrieval. Traditional pipelines center on neural dense or multi-vector representation with vision-LLMs (VLMs), excelling in retrieval quality but incurring prohibitive costs at serving time due to runtime neural encoding. Alternatively, OCR- or caption-based pipelines enable query-encoding-free lexical retrieval (e.g., with BM25), albeit with the drawback of expensive and slow upstream text extraction or caption generation.

"Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search" (2605.30917) presents V-SPLADE, a novel sparse lexical retriever directly indexing visual documents and fulfilling a critical, previously unmet serving regime: production-scale, query-encoding-free retrieval without neural query overhead or reliance on costly upstream pipelines. Central to this advance is the diagnosis and resolution of a lexical grounding problem: existing sparse encoders struggle to reliably activate relevant lexical dimensions from image inputs, impeding retrieval performance under high sparsity.

Lexical Grounding Problem in Visual-Document Sparse Retrieval

Text-based learned sparse retrieval models (e.g., SPLADE) achieve robust lexical matching as the token inputs directly map to vocabulary indices. In contrast, images of documents lack explicit token anchors—lexical information must be inferred from visual cues rather than explicit word tokens, resulting in only partial coverage of actual lexical content in the output sparse representation. The diagnostic study demonstrates a large gap in overlap with source BoW between text and image representations (e.g., at top-30: 0.974 for text vs. 0.560 for rendered image). This lexical grounding deficit constrains the effectiveness of sparse retrieval in visual domains.

Caption-Gated Token Supervision

To remediate this grounding deficit, the authors propose caption-gated token supervision as a training-only signal, leveraging VLM-generated captions as lexical cues. During training, both the rendered document image and its offline-generated caption are encoded into the same vocabulary-indexed sparse space. The caption sparse vector "gates" the image vector—only dimensions corroborated by both views are reinforced, while others are suppressed. This intersection expands recall of lexical dimensions actually present in the underlying document, even when the image encoder alone would miss them.

Figure 1: Caption-gated token supervision overview.

At inference, only the image encoder is used, ensuring efficient, inference-free deployment.

V-SPLADE Architecture

V-SPLADE adapts SPLADE for visual-document retrieval by employing a compact vision-language backbone (ModernVBERT, ~250M parameters) with an LM head for projecting visual hidden states into a vocabulary-indexed sparse representation. Query representation is handled using Li-LSR–style inference-free, learned token lookup, further optimized for stability by adopting softplus activation (instead of ReLU) for visual sparse regimes.

Key scoring and ranking mechanisms:

Retrieval is based on sparse lexical dot product between query and document sparse vectors.
Training employs an InfoNCE batchwise loss, a separate ranking loss on captions, and a Flops regularizer to enforce sparsity.

Empirical Results

Retrieval Quality:

On six public visual-document retrieval benchmarks, V-SPLADE (quality variant):

Outperforms BiModernVBERT (same ~250M parameter, state-of-the-art dense retriever) by +13.8 NDCG@5 points.
Surpasses OCR- and caption-based BM25 approaches by up to +6.3 NDCG@5 points.

Scalability:

On an 18.7M-document corpus:

V-SPLADE achieves R@5 of 0.228 (vs. 0.090 for dense baseline).
Encoding is 20× faster than state-of-the-art caption or OCR pipelines.
Sparse retrieval retains recall more robustly under corpus scaling relative to dense alternatives.

Latency:

Supports sub-10ms exact inverted-index search.
Two-stage pruned index pipelines yield latency comparable to HNSW-based dense search.
Figure 2: Online serving FLOPs per query.

Corpus Scaling Robustness:

Dense R@5 drops to 35% of its 500K-corpus performance at 18.7M scale; V-SPLADE retains 54%, showing greater scalability.

Figure 3: Normalized recall degradation with corpus scaling (R@5, normalized to 500K = 1.0).

Complementarity with Neural Retrieval:

Score fusion with neural retrievers (e.g., BiModernVBERT, Qwen3-VL-Embedding-2B) yields up to +2.4pp R@5 improvement; V-SPLADE also serves as a strong first-stage retriever for multi-vector reranking, improving downstream recall ceiling.

Figure 4: Two-stage retrieval with ColModernVBERT reranking.

Precision on Lexically Specific Queries:

Largest gains observed on queries involving numerals or capitalized tokens (proper nouns), with R@5 increases up to +0.228 absolute points relative to dense baselines.

Token-Level Analysis:

Caption-gated supervision increases contentful token activation, reduces reliance on stopwords, and improves ranking margins. The mechanism induces complementary token sets between passage and caption sparse representations, providing distinct and synergistic lexical evidence for retrieval.

Figure 5: Stopword fraction among query--image overlap.

Theoretical and Practical Implications

The work sets a new standard for scalable, production-grade visual-document retrieval in environments where neural query encoding is cost-prohibitive. The lexical grounding bottleneck is a fundamental limitation in vision-based sparse models; the caption-gated supervision strategy improves sparse retriever performance for document images without runtime overhead or reliance on slow, upstream text extraction.

Practically, V-SPLADE enables:

Deployment for massive-scale, low-latency retrieval scenarios (e.g., search over tens of millions of scientific papers or business documents).
Seamless integration as a complementary lexical layer to dense and multi-vector systems, facilitating hybrid pipelines.
Significant latency and throughput improvements over pipeline-based lexical alternatives (OCR, caption generation).

Theoretically, the findings highlight:

The inherent capacity limits of fixed-dimension dense retrieval under corpus scaling.
The benefit of expanding the support for lexical evidence via hybrid multimodal signals, and the pitfalls of relying on image-only supervision for sparse token recovery.

Future Directions

Potential future research avenues include:

Extension and evaluation on multilingual document corpora and other modalities (e.g., video, natural images).
Scaling to larger VLM backbones and exploring prompt-controlled, region-adaptive captioning strategies.
Investigating the transferability of lexical grounding techniques to broader multimodal tasks beyond retrieval.

Conclusion

V-SPLADE establishes inference-free, lexically grounded sparse retrieval as a competitive and scalable solution for visual-document search. By addressing the lexical grounding challenge via caption-gated token supervision, it achieves strong empirical results across diverse benchmarks and at production scale, while remaining highly complementary to dense and multi-vector retrieval architectures. This positions it as a practical foundation for next-generation information retrieval systems targeting large, heterogeneous visual-document corpora.

Markdown Report Issue