Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation

Published 15 Apr 2026 in cs.IR | (2604.14403v1)

Abstract: Traditional Retrieval-Augmented Generation (RAG) approaches generally assume that retrieval and generation occur on powerful servers removed from the end user. While this reduces local hardware constraints, it introduces significant drawbacks: privacy concerns regarding data access, recurring maintenance and storage costs, increased latency, and the necessity of an internet connection. On-device RAG addresses these challenges by executing the entire pipeline locally, making it ideal for querying sensitive personal information such as financial documents, contact details, and medical history. However, on-device deployment necessitates a delicate balance between limited memory and disk space. Specifically, the context size provided to the generative model must be restricted to manage KV cache and attention memory usage, while the size of stored embeddings must be minimized to preserve disk space. In this work, we propose a unified model that compresses the RAG context and utilizes the same representations for retrieval. This approach minimizes disk utilization compared to using separate representations, while significantly reducing the context size required for generation. With an average of 1/10 of the context, our model matches the performance of a traditional RAG reader without increasing storage requirements compared to a multi-vector retrieval model. This approach represents the first model to unify retrieval and context compression using a shared model and representation. We believe this work will inspire further consolidation of distinct models to optimize on-device performance.

Summary

  • The paper introduces ECG, a unified model that integrates retrieval embedding and generative context into a single document representation.
  • It employs a two-phase training pipeline with self-supervised pretraining and RAG-specific fine-tuning to balance retrieval quality and generation accuracy.
  • Empirical results on SmolLM and Gemma models show significant improvements in Exact Match metrics while halving disk usage compared to dual-representation systems.

Unified Model and Document Representation for On-Device RAG: Technical Summary and Implications

Motivation and Problem Setting

The adoption of Retrieval-Augmented Generation (RAG) paradigms has been pivotal in incorporating external knowledge into LLMs, enhancing their utility in knowledge-intensive tasks. However, standard RAG architectures assume server-side computation and storage, thereby introducing privacy, latency, cost, and offline availability constraints—especially problematic when handling sensitive personal data or on-device inference. Existing on-device RAG research often focuses on either hardware-level optimizations or context compression techniques, but rarely addresses the combinatorial explosion in model, memory, and storage overhead that arises from the separation of retrieval and generative architectures.

This work addresses the on-device RAG bottleneck by proposing a unified approach that merges document retrieval embedding, context compression, and response generation within a single model and representation space. Central to this is the hypothesis that a single vector-based document representation can serve both high-recall retrieval and high-utility context input for conditional generation, ultimately enabling aggressive context compression and disk usage minimization. The architectural unification is non-trivial due to potential representational conflicts between retrieval and generative objectives.

Architectural Overview and Training Pipeline

The proposed architecture, termed ECG (Embed, Compress, Generate), leverages a pretrained decoder transformer and augments it with a lightweight, multi-layer projection module that produces multi-vector representations from input documents. These serve dual purposes:

  • For retrieval, a projection block θret\theta_{ret} produces document embeddings suitable for late-interaction similarity search (specifically, a mean-pooled MaxSim variant).
  • For generative context, a second projection block θcomp\theta_{comp} computes compressed context embeddings fed back into the generation transformer.

Vectors from both blocks are derived from the transformer's final hidden states associated with inserted special tokens, ensuring that only one set of vector representations is stored per document.

The training pipeline is two-phased:

  1. Self-Supervised Pretraining: Passage splitting, with alternating reconstruction and next-segment generation, is used to teach the model both to compress context and generate meaningful retrieval embeddings through a paired combination of contrastive loss (in-batch, bidirectional) and next-token prediction. The embedding count per document (nn) is dynamically varied to regularize compressibility.
  2. RAG-Specific Fine-Tuning: A distillation-based protocol jointly optimizes for answer generation (via KL minimization to a teacher reader on uncompressed context) and retrieval quality (InfoNCE and Margin MSE to distill from a strong teacher scoring model), with dynamically-learned loss scaling to balance the competing objectives.

Notably, the model maintains a variable-length embedding representation to flexibly manage trade-offs between retrieval/generation quality and disk/storage capacity—unlike prior works requiring fixed-length or frozen representations.

Empirical Evaluation

The experiments are conducted using two model sizes representative of real-world on-device constraints (SmolLM-v2 135M and Gemma 3B). Evaluation focuses on Exact Match (EM) metrics across the Natural Questions (NQ) and TriviaQA benchmarks, under two resource-constrained settings:

  • Fixed Context Budget: Rigorously restricts active memory (number of context vectors) during inference.
  • Fixed Performance: Measures the compression ratio (number of vectors needed) for baselines to match the ECG model’s single-document performance.

Competitor baselines include:

  • Purely parametric generation (no retrieval)
  • Standard RAG with separate retrieval and reader models (BM25, dense, and ColBERT variants)
  • RAG with context compression (COCOM-based), where retrieval and compression are handled separately.

Key Numerical Findings

  • At context budgets of 32 (SmolLM) / 16 (Gemma), ECG outperforms all baselines by substantial margins (EM ≈ 0.34–0.36 for NQ vs. ColBERT RAG baseline at ≈ 0.10–0.11), surpassing context compression models by similar margins.
  • To match ECG’s EM, baselines like standard RAG with ColBERT require 5×5\times to 14×14\times larger context budgets, with corresponding increases in disk and memory overhead.
  • Even at maximal context, standard RAG baselines often fail to attain ECG’s single-context accuracy (e.g., SmolLM NQ: 0.343 for ECG vs. 0.328 for ColBERT RAG at 8×8\times context).
  • Unifying representations halves disk usage compared to dual-representation RAG, while achieving both retrieval and generation efficacy.

Ablation and Analysis

  • Joint training with dynamic loss scaling is critical: Without careful balancing, multi-task objective conflicts degrade EM by nearly 50%.
  • Contrastive loss during pretraining and aggressive hard negative sampling are essential for robust retrieval.
  • Single-representation models not only compress storage but also act as regularizers, as evidenced by improved robustness and effectiveness beyond disk/memory savings.
  • Empirically, both the ECG retriever and reader must be evaluated in hybrid settings; retrieval quality is the dominant performance driver, and the model is robust to noise when the retriever is high-quality.

Theoretical and Practical Implications

This unified approach positions on-device RAG as truly practical for privacy-sensitive, cost-constrained, and offline first applications—where full-scale server infrastructure is infeasible or undesirable. The elimination of dual-representation storage and the resulting reduction in both context-length and disk requirements suggest immediate utility in mobile, edge, or federated deployment settings.

On the theoretical side, the work demonstrates that end-to-end joint training can resolve what were previously considered "orthogonal" representation demands between retrieval and generative models. The results also support the emerging view that retrieval-augmented architectures can benefit from constraining representation spaces with adversarial or contrastive objectives.

Future Directions

Future developments may include scaling this architectural motif to even larger model backbones, integrating further compression layers, or extending the approach for multi-modal or event-driven RAG. The ECG paradigm provides a blueprint for "everything-models"—unifying diverse downstream utility within a compact resource envelope, and may drive a shift toward on-device intelligence as hardware constraints relax incrementally but privacy concerns remain.

Conclusion

This work establishes that unified retrieval and generative representations, co-trained in a single model, are both feasible and optimal for on-device RAG. The ECG framework sets new accuracy/storage efficiency trade-offs and demonstrates clear superiority over disjoint or dual-representation approaches, paving the way for privacy-preserving, low-latency, and resource-aware RAG deployments. This represents a significant consolidation of model architecture that others in the field are likely to adopt for next-generation on-device AI (2604.14403).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Explaining: “A Unified Model and Document Representation for On-Device Retrieval‑Augmented Generation”

What’s this paper about?

This paper is about making AI assistants smarter and more private on your own device (like a phone or laptop). It focuses on a technique called Retrieval‑Augmented Generation (RAG), where an AI answers questions by first “looking things up” in a collection of documents and then writing a response. The authors introduce a new model, called ECG (Embed, Compress, Generate), that uses the same compact “document fingerprints” both to find information and to feed it to the AI for answering—all on the device, without needing the internet.

What questions are the researchers trying to answer?

The paper focuses on simple but important goals:

  • Can we run RAG entirely on a device to improve privacy and reduce lag?
  • Can we store less information on the device and still find the right documents?
  • Can we shrink what the AI has to read (its “context”) without hurting answer quality?
  • Is it possible to use one shared representation for both finding documents and helping the AI write answers?

How did they do it? (Methods in everyday language)

Think of your phone as a backpack with limited space (memory and storage). Traditional RAG keeps separate sets of items: one set for searching (retrieval) and another for reading (the full text). That takes a lot of room. This paper combines them so the same compact items are used for both purposes.

Here are the main ideas, with simple analogies:

  • Embeddings as fingerprints: Instead of storing full sentences, the model turns text into short numeric “fingerprints” (called embeddings). These are like tiny codes that capture meaning.
  • Multi‑vector representations: Instead of one fingerprint per document, it can keep a few—like multiple snapshots—to capture more detail.
  • MaxSim matching: To find relevant documents, it compares a question’s fingerprints to each document’s fingerprints and takes the best match—like finding the most similar snapshot pair.
  • Context compression: The model turns documents into a small set of vectors that stand in for the original text. This is like compressing a whole chapter into a handful of super‑compact notes that the AI can still understand.
  • One set of vectors for both jobs: The same compact vectors are used first to search and then to guide the AI’s writing. This saves storage space and speeds things up.

How they trained it (two stages):

  1. Self‑supervised pretraining (learning without labels):
    • The model practices turning text into compact vectors and then reconstructing text from those vectors.
    • It also learns which pieces of text go together by making matching pairs more similar and random pairs less similar (a “contrastive” objective).
    • Sometimes it compresses and rebuilds the same text; other times it uses neighboring halves of a passage, so it learns to help the AI predict surrounding content.
  2. RAG fine‑tuning (learning to answer questions):
    • The model learns from teacher systems: one that’s good at writing answers and another that’s good at ranking documents.
    • It’s trained to produce answers similar to the teacher (distillation) and to rank documents similarly to a top retriever (contrastive + ranking losses).
    • The authors carefully balance these objectives so neither overwhelms the other, using learned scaling (think: automatic volume knobs) to keep training stable.

They tested the approach with two small, on‑device‑friendly LLMs (SmolLM‑v2 135M and Gemma 3 1B) on popular question‑answer datasets (Natural Questions and TriviaQA).

What did they find, and why does it matter?

Main results:

  • Much smaller context, same or better answers: With about one‑tenth the usual context size, the ECG model matches or beats standard RAG systems that read far more text. That means the AI needs to “read” much less while maintaining accuracy.
  • Less storage on your device: Because the same vectors are used for both retrieval and generation, storage is cut roughly in half compared to setups that keep separate vectors for each task.
  • Stronger than other compression methods: ECG consistently outperforms systems that compress documents and then feed them to a separate reader. Jointly training the shared vectors for both searching and answering seems to make them more meaningful and robust.
  • Retrieval quality is excellent: Even when paired with standard readers, ECG’s retrieval (the “search” part) improves results, showing that its document matching is very strong.
  • Careful training matters: Key training tricks—like contrastive learning, answer distillation from a teacher, and dynamic loss scaling—are crucial. Without them, performance drops significantly.

Why it matters:

  • Privacy: Everything can run on your device, so your personal files (like messages, medical notes, or bank statements) don’t have to leave it.
  • Speed and reliability: No internet required, and less data to load means faster responses and lower memory use.
  • Battery and storage savings: Smaller context and shared vectors reduce both active memory (what the model uses when thinking) and storage (what stays on disk).

What’s the bigger impact?

This work shows a practical path to private, fast, and efficient AI assistants that can search your local documents and answer questions—all on your device. By unifying searching, compressing, and answering in a single model with shared representations, it sets a blueprint for compact “everything‑models” that do more with less. This could make powerful, personal AI tools more accessible, safer for sensitive data, and kinder to your device’s memory, storage, and battery.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of concrete gaps and unresolved questions that future work could address:

  • Lack of realistic on-device evaluation: no benchmarks on personal corpora (e.g., messages, emails, offline documents); need datasets and protocols reflecting privacy, heterogeneity, and on-device constraints.
  • Missing on-device systems metrics: no measurements of end-to-end latency, energy/battery impact, thermal behavior, or peak memory/KV-cache footprint on real devices (phones/tablets/edge SoCs).
  • Corpus scale and retrieval efficiency: evaluation uses pooled subsets; scalability to 100k–10M+ documents, ANN indices, and end-to-end recall/latency on-device remains untested.
  • Incomplete disk/storage accounting: reported “Disk Space / 10k” excludes index structures (e.g., HNSW graphs, inverted lists); need full-system storage comparisons (embeddings + index + metadata) versus BM25 and late-interaction baselines.
  • Multi-document conditioning weakness: ECG accuracy declines with more than one document; methods for stable multi-doc aggregation, multi-hop reasoning, and cross-document evidence fusion are not developed.
  • Dynamic length control untested: although variable-length embeddings are claimed, RAG fine-tuning fixes t; open questions on adaptive selection per doc/query, budget allocation, and scheduler policies given storage/latency targets.
  • Compression–quality trade-off mapping: limited exploration of EM versus number of vectors; need comprehensive curves across broader budgets, with translation to actual bytes and KV-cache memory in MB.
  • Retrieval quality isolation: no standalone retrieval metrics (Recall@k, MRR, nDCG) on full corpora; difficult to disentangle retriever and reader contributions and failure modes.
  • Teacher dependence and robustness: sensitivity to choice/quality of teacher reader and teacher ranker is unknown; need ablations across stronger/weaker teachers and cross-encoders vs bi-encoders.
  • Training stability and sensitivity: heavy reliance on learned loss scaling; systematic study of convergence, robustness across seeds, hyperparameter sensitivity, and catastrophic forgetting is missing.
  • Domain transfer: only Wikipedia/NQ/TriviaQA; effectiveness on finance, medical, legal, code, and noisy OCR text corpora is untested.
  • Multilingual and cross-lingual generalization: English-only evaluation; need assessments for multilingual corpora and cross-lingual retrieval/generation under shared representations.
  • Privacy/security of stored vectors: no analysis of inversion/membership inference risks on unified embeddings; need threat modeling and mitigations (e.g., DP, encryption, secure enclaves).
  • Incremental updates and model upgrades: procedures for on-device re-indexing, backward compatibility of stored vectors across model updates, migration cost, and drift management are unspecified.
  • Vector compression and quantization: impact of 8/4-bit quantization, PQ/IVF/HNSW settings, and mixed-precision on both retrieval and generation quality is not studied.
  • Offline preprocessing costs: time/energy to embed/compress entire user corpora on-device, scheduling while charging, and thermal throttling implications are not measured.
  • Hardware diversity: no results across CPUs/NPUs/GPUs on common mobile chipsets; portability, accelerator utilization, and operator support (e.g., attention kernels for continuous vectors) are unclear.
  • Long/complex queries and multi-hop tasks: performance on multi-hop QA or compositional tasks that require multiple pieces of evidence is not evaluated; need training/inference mechanisms for evidence chaining.
  • Robustness to retrieval noise/adversaries: behavior with near-misses, adversarial passages, and domain noise is uncharacterized; need stress tests and robustness benchmarks.
  • Faithfulness and attribution: only Exact Match is reported; measures of faithfulness to retrieved sources, source attribution, calibration, and hallucination rates are missing.
  • Interpretability of compressed contexts: no mechanism to map continuous vectors back to salient spans/facts; need tools for explanation and evidence tracing.
  • Baseline coverage: head-to-head comparisons with recent state-of-the-art retrievers/compressors (beyond the chosen GTE/BM25/ColBERT and COCOM-like setups) are absent.
  • Similarity function choices: only mean-pooled MaxSim is used; alternatives (learned late-interaction, attention pooling, hybrid lexical–vector) and their on-device cost/accuracy trade-offs are unexplored.
  • Architectural hyperparameters: m = d and L = 4 are fixed; the impact of retrieval dimension m, projection depth/width, parameter sharing between θ_comp and θ_ret, and cross-attention variants is not examined.
  • Handling variable-length/long documents: chunking policies, vector allocation per chunk, and scheduling across heterogeneous document lengths are not analyzed.
  • Detailed failure analysis: no qualitative or categorical error breakdown (question types, entity categories, reasoning steps) to guide targeted improvements.
  • Reproducibility and release: code/models/datasets are not released (as stated), limiting verification and adoption; standardized training/eval scripts and seeds would aid reproducibility.

Practical Applications

Overview

The paper introduces ECG (Embed–Compress–Generate), a unified on-device RAG model that uses the same multi-vector representations for both retrieval and compressed context for generation. Practically, this yields:

  • Up to ~10× reduction in context (KV-cache) tokens while matching standard RAG reader accuracy, and up to 16× context compression in some settings.
  • Approximately half the storage versus separate retrieval and compression representations, with disk usage similar to late-interaction retrievers (e.g., ColBERT) but without a second set of vectors for the reader.
  • Full offline, privacy-preserving deployment with lower latency and no server costs.

Below are concrete applications and what they require in practice.

Immediate Applications

These can be deployed now using small, quantized LMs and mobile/edge inference runtimes, with preprocessing scheduled opportunistically (e.g., while charging).

  • On-device personal document assistant
    • Use cases: Q&A and search over emails, PDFs, notes, scanned documents, calendars, and contacts on phones/laptops without sending data to cloud.
    • Sectors: Consumer software, productivity.
    • Tools/products/workflows:
    • A mobile SDK that builds a unified ECG index of user documents.
    • Background indexer that compresses and stores multi-vector embeddings while the device is charging.
    • In-app “Ask my documents” features with strict context budgets.
    • Dependencies/assumptions:
    • Device supports efficient on-device inference (e.g., NNAPI/Metal/Vulkan) and quantization (4–8-bit).
    • Local storage is encrypted; permissions to index user files; multilingual support may vary.
  • Enterprise endpoint RAG for sensitive data
    • Use cases: Local Q&A over confidential docs on managed laptops/desktops (e.g., finance reports, legal briefs, strategy docs) with zero data exfiltration.
    • Sectors: Enterprise software, legal, finance.
    • Tools/products/workflows:
    • MDM-managed ECG agent that indexes designated folders/repositories locally.
    • Policy-controlled context budgets to bound memory use on endpoints.
    • Dependencies/assumptions:
    • IT policy allows local embeddings at rest; audit/logging in place.
    • Model and index encrypted; periodic re-index scheduling to handle updates.
  • Clinical note and policy lookup on clinician devices
    • Use cases: Offline, private search over local subsets of EMR extracts, clinical protocols, and drug formularies on tablets/workstations at the edge.
    • Sectors: Healthcare.
    • Tools/products/workflows:
    • On-device ECG index built from locally cached, de-identified notes or policy documents.
    • Quick-answer UI embedded in EHR viewers or mobile companion apps with strict context budgets.
    • Dependencies/assumptions:
    • Compliance with HIPAA/GDPR; de-identification where needed.
    • Hospital IT approval for on-device indices; regular data refresh procedures.
  • Field and frontline operations assistant (offline)
    • Use cases: Field technicians querying manuals, SOPs, and maintenance logs on rugged devices in low-connectivity environments.
    • Sectors: Manufacturing, energy, utilities, logistics.
    • Tools/products/workflows:
    • Local ECG index packaged with device; incremental updates shipped over secure channels when connectivity returns.
    • Top-1 or Top-2 retrieval to minimize latency and battery use.
    • Dependencies/assumptions:
    • Device battery and thermals compatible with short bursts of inference.
    • Controlled corpus sizes; scheduled compression during off-hours.
  • Privacy-first customer support copilot (desktop)
    • Use cases: Agents querying local caches of knowledge bases, past tickets, and SOPs without sending customer data to external services.
    • Sectors: Customer support, BPO.
    • Tools/products/workflows:
    • Desktop app that maintains unified embeddings for support materials and private ticket archives.
    • Context budget enforcement to maintain consistent latency on commodity CPUs.
    • Dependencies/assumptions:
    • Data governance to ensure local caching is permitted.
    • Periodic index rebuilds orchestrated by IT.
  • Browser/reader-side “client-only” webpage and PDF assistant
    • Use cases: Users chat with open PDFs/webpages locally; embeddings stored per-document in a local vault.
    • Sectors: Consumer productivity, education.
    • Tools/products/workflows:
    • Browser extension/desktop reader that builds ECG vectors for the open document.
    • Single-document Q&A optimized for ECG’s strong top-1 behavior.
    • Dependencies/assumptions:
    • Memory budgets tuned for embedded devices; ephemeral indices for temporary docs.
  • Secure messenger/email “smart search”
    • Use cases: On-device RAG over message/email history while preserving privacy.
    • Sectors: Communications.
    • Tools/products/workflows:
    • Local unified embedding store per-account.
    • Scheduled, incremental indexing of new threads; user-controlled privacy settings.
    • Dependencies/assumptions:
    • Background processing allowances; device encryption.
  • Developer-facing on-device RAG toolkit
    • Use cases: App developers integrate ECG-based retrieval+generation with a single vector store per document.
    • Sectors: Software/SDKs.
    • Tools/products/workflows:
    • Unified multi-vector index format; MaxSim search with mean-pooled query scoring.
    • Context budget controller to keep KV-cache within device limits.
    • Dependencies/assumptions:
    • Availability of optimized kernels for MaxSim and multi-vector search.
    • Integration with mobile inference runtimes and schedulers.

Long-Term Applications

These require further research, scaling, or integration work (e.g., multi-modal extensions, standardized index formats, federation, regulatory frameworks).

  • OS-level personal knowledge vault
    • Vision: System service that maintains a unified, encrypted embedding store for all user content (docs, messages, images), shared across apps via permissions.
    • Sectors: Consumer OS, platform ecosystems.
    • Potential tools/products/workflows:
    • OS APIs exposing query-to-answer pipelines with dynamic context budgets and strict privacy controls.
    • Dependencies/assumptions:
    • Platform support for cross-app permissions, background indexing, and secure enclaves.
    • Standardized multi-vector index format and on-device search APIs.
  • Federated and privacy-preserving personalization
    • Vision: Federated fine-tuning of ECG on-device to adapt to user domains, with differential privacy guarantees.
    • Sectors: Consumer software, enterprise.
    • Potential tools/products/workflows:
    • Federated training pipelines that update retrieval/compression jointly without sharing raw data.
    • Dependencies/assumptions:
    • Efficient on-device training or adapter tuning; DP accounting; robust aggregation.
  • Multi-modal on-device RAG with unified representations
    • Vision: Extend ECG to images, audio, and structured data where the same vectors drive retrieval and condensed context for generation (e.g., maintenance diagrams, medical images).
    • Sectors: Healthcare, manufacturing, robotics, education.
    • Potential tools/products/workflows:
    • Multi-modal encoders that produce ECG-compatible vectors; unified MaxSim search across modalities.
    • Dependencies/assumptions:
    • Multi-modal pretraining data; device acceleration for vision/audio models.
  • Edge robotics and autonomous systems diagnostics
    • Vision: Robots and vehicles locally query logs, manuals, and sensor summaries for self-diagnostics and procedure guidance when offline.
    • Sectors: Robotics, automotive, aerospace, energy.
    • Potential tools/products/workflows:
    • Onboard ECG index over maintenance logs and SOPs; minimal-context guidance generation.
    • Dependencies/assumptions:
    • Real-time constraints; ruggedized hardware; safety and certification requirements.
  • Regulated data governance “on-device by default”
    • Vision: Policies and compliance frameworks that mandate on-device RAG for sensitive workflows (e.g., PHI/PII, insider materials), reducing cloud exposure.
    • Sectors: Government, healthcare, finance, legal.
    • Potential tools/products/workflows:
    • Compliance toolkits and audits centered on local ECG indices and encrypted storage.
    • Dependencies/assumptions:
    • Clear regulatory guidance; standardized certifications for on-device AI.
  • Cross-device encrypted index sync
    • Vision: Synchronize unified embeddings across user devices via end-to-end encryption with selective, minimal re-indexing.
    • Sectors: Consumer and enterprise productivity.
    • Potential tools/products/workflows:
    • Incremental, secure sync protocols; conflict resolution for index updates.
    • Dependencies/assumptions:
    • Key management; bandwidth-efficient multi-vector syncing; privacy-preserving deduplication.
  • Verticalized ECG models (domain-specialized)
    • Vision: Unified on-device RAG models fine-tuned for legal, finance, medical, or code domains, retaining shared representation efficiency.
    • Sectors: Legal, finance, healthcare, software engineering.
    • Potential tools/products/workflows:
    • Distillation pipelines from strong teacher readers/rankers per vertical; curated domain corpora.
    • Dependencies/assumptions:
    • Access to compliant training data; evaluation sets beyond open-domain QA.
  • Client-side web and app indexing at scale
    • Vision: Browsers or enterprise portals locally index visited pages/apps for offline semantic search with minimal storage overhead.
    • Sectors: Enterprise portals, knowledge management, education.
    • Potential tools/products/workflows:
    • Per-user ECG indexers integrated into browsers/IDEs; purge/retention policies.
    • Dependencies/assumptions:
    • Storage quotas; user consent and data minimization; multilingual coverage.
  • Co-design with hardware and runtimes
    • Vision: Hardware features and runtimes optimized for multi-vector MaxSim, KV-cache control, and unified vector storage.
    • Sectors: Semiconductors, mobile platforms.
    • Potential tools/products/workflows:
    • NPU kernels for MaxSim; KV cache-aware schedulers; power-aware context budgeting.
    • Dependencies/assumptions:
    • Vendor adoption; standardization of multi-vector APIs and formats.
  • Standardized multi-vector index ecosystem
    • Vision: Open formats and libraries for ECG-style unified indices across vendors and apps.
    • Sectors: Developer tooling, open-source.
    • Potential tools/products/workflows:
    • Interoperable indices; pluggable MaxSim search backends; benchmarking suites for on-device RAG.
    • Dependencies/assumptions:
    • Community consensus; compatibility with existing vector DBs.

Notes on feasibility across applications:

  • Strengths leveraged: 10×–16× context compression, single representation for retrieval and generation, strong top-1 retrieval performance, offline privacy.
  • Key dependencies: Efficient on-device runtimes, quantization, background indexing windows, encrypted storage, policy/IT approval, domain and language coverage, and availability of device NPUs/GPUs.
  • Risks/assumptions: Benchmark gains on open-domain QA may not directly translate to all domains; multilingual performance may lag; frequent corpus churn increases re-index cost; training method relies on teacher models (for development, not inference).

Glossary

  • ACC-RAG: A context compression method that adapts compressed representations for RAG tasks. "Our approach resembles COCOM and ACC-RAG but crucially the compressed context representations are also meaningful as retrieval embeddings."
  • auto-encoding: Training approach where a model learns to compress and reconstruct inputs, often used for representation learning. "use various approaches such as auto-encoding or cross-attention to compress longer contexts for LLMs."
  • BM25: A classic probabilistic IR ranking function based on term frequency and document length normalization. "BM25 \citep{bm25}"
  • ColBERT: A late-interaction dense retrieval model that compares token-level embeddings using MaxSim. "we adopt a mean-pooled variant of the MaxSim similarity introduced by ColBERT"
  • compartmentalized indexes: Index design that partitions or structures indices to better fit on-device constraints. "compartmentalized indexes \citep{mobile_rag}"
  • context budget: The fixed number of document representations the generator is allowed to consume. "We define a context budget as the number of document representations, either representing tokens or compressed vectors, that are used by the generator."
  • cross-attention: An attention mechanism that conditions one sequence on another (e.g., query attending to context). "auto-encoding or cross-attention to compress longer contexts for LLMs."
  • decoder transformer: An autoregressive transformer that generates text token-by-token, here adapted to also produce embeddings. "To adapt a pretrained decoder transformer θLM\theta_{LM} to also act as an encoder"
  • dense retrieval: Retrieval using learned continuous vector embeddings rather than lexical term matching. "A dense retrieval model based on ModernBERT"
  • dynamic loss scaling: Adjusting loss components (e.g., temperatures or scales) during training to balance multi-task objectives. "Removing dynamic loss scaling—the learned temperature and teacher score scale for the retrieval losses—"
  • Exact Match (EM): An evaluation metric that checks whether the generated answer exactly matches a reference. "Exact Match (EM) accuracy metric."
  • gated residual connection: A network connection that combines residual pathways with a learned gate to stabilize training. "using a gated residual connection and linear layers."
  • gist tokens: Learnable tokens that summarize or compress salient information for conditioning a model. ""gist" tokens"
  • hard compression: Techniques that filter or rewrite text to remove irrelevant tokens before generation. "Hard compression filters or rewrites text to remove irrelevant tokens."
  • hard negatives: Non-relevant documents that are difficult to distinguish from positives, used to strengthen retrieval training. "a positive document, hard negatives, and teacher relevance scores"
  • InfoNCE contrastive loss: A contrastive objective that separates positive pairs from negatives via normalized temperature-scaled cross-entropy. "we apply an InfoNCE contrastive loss"
  • KL divergence: A measure of how one probability distribution diverges from another, used here for distillation. "we minimize the KL divergence between our model (conditioned on compressed contexts) and the teacher reader"
  • knowledge distillation: Training a student model to match a teacher model’s output distribution. "we replace standard next-token prediction with knowledge distillation."
  • KV cache: Cached key/value tensors used by transformers to speed up attention across long contexts. "manage KV cache and attention memory usage"
  • late-interaction retrieval model: A retrieval approach comparing token-level representations at query time (instead of a single vector). "Compared to standard RAG with a late-interaction retrieval model,"
  • Layer Normalization: A normalization technique applied across features to stabilize and accelerate training. "and LN\text{LN} represents Layer Normalization."
  • lexical filtering: Using lexical cues (e.g., term overlap) to prune candidate documents before more expensive steps. "lexical filtering \citep{pocket_rag}"
  • Margin MSE loss: A loss that aligns model scores with teacher margins using mean squared error. "alongside a Margin MSE loss"
  • MaxSim similarity: A similarity function that, for each query token, takes the maximum dot product over document token embeddings. "we adopt a mean-pooled variant of the MaxSim similarity introduced by ColBERT"
  • multi-vector representations: Representing a query or document with multiple embeddings to preserve fine-grained information. "shared multi-vector representations."
  • on-device RAG: Performing retrieval and generation locally on a user’s device to reduce latency and protect privacy. "On-device RAG addresses these issues"
  • parametric knowledge: Information stored implicitly in the model’s parameters rather than in external documents. "This baseline leverages the model's parametric knowledge"
  • pooling approach: Reducing a corpus or candidate set by pooling likely relevant documents for efficient evaluation. "we use a pooling approach to reduce the large corpus size"
  • projection blocks: Additional neural layers mapping hidden states into task-specific embedding spaces. "two projection blocks"
  • quantization: Reducing numerical precision of model weights/activations to lower memory and compute costs. "speculative decoding and quantization"
  • reranking: Reordering candidate documents using a more accurate (often more expensive) scoring model. "limited to reranking settings"
  • self-supervised pretraining: Training without manual labels by constructing predictive tasks from raw text. "a self-supervised pretraining phase"
  • soft compression: Encoding context into continuous vectors rather than editing the surface text. "Soft compression encodes context into continuous vectors."
  • soft prompts: Learnable continuous prompt vectors prepended to inputs to steer model behavior. "learned soft prompts"
  • speculative decoding: A generation technique that uses draft proposals to accelerate decoding. "speculative decoding and quantization"
  • teacher reader: The teacher generative model whose output distribution guides the student via distillation. "the teacher reader (conditioned on uncompressed documents)"
  • teacher scoring model: A teacher model that provides query-document relevance scores for distillation. "a teacher scoring model to provide query-document relevance scores."
  • variable-length multi-vector embeddings: Embeddings with a flexible number of vectors per item to trade off accuracy and storage. "our model uses variable-length multi-vector embeddings to flexibly control retrieval and storage costs."
  • xRAG: A method that uses fixed retrieval embeddings as context for generation but separates retrieval and generation models. "whereas xRAG \citep{xrag_extreme_context_compression} uses fixed retrieval embeddings as context for generation"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 212 likes about this paper.