jina-embeddings-v5-omni: Text-Geometry-Preserving Multimodal Embeddings via Frozen-Tower Composition
Abstract: In this work, we introduce frozen-encoder model composition, a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a LLM, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space. Our method is to extend the two Jina Embeddings v5 Text models to support additional media by adding encoders for images and audio. The backbone text embedding models and the added non-text media encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights of the joint model. Training is therefore much more efficient than full-parameter retraining. Additionally, the LLM remains effectively unaltered, producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models. Our evaluations show that this approach produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper shows a simple way to make one computer model understand many kinds of data—text, images, sounds, and videos—and describe all of them in the same “language” of numbers. Think of it like giving different types of files (a sentence, a photo, a song, a clip) a shared set of coordinates on a map so they can be compared and searched together. The authors call their models “jina,” and they add images, audio, and video to an already-strong text model without changing how the text part behaves.
What questions were the researchers trying to answer?
- Can we extend a strong text-only model to handle images, audio, and video without rewriting or retraining the whole thing?
- Can we keep the text performance exactly the same so existing text search systems don’t break?
- Can we do this with much less training time and compute than usual, yet still get results close to larger, state-of-the-art multimodal models?
How did they approach the problem?
The big idea: frozen towers and tiny adapters
Imagine you already have a great “text brain” that turns words into numbers (embeddings). You add “eyes” (an image encoder) and “ears” (an audio encoder), but you keep all these parts “frozen,” meaning you don’t change them. Instead, you train only tiny connector pieces—like plug adapters—that convert the image and audio outputs into the kind of input the text brain expects. Then the text brain turns everything (text, pictures, sounds, video frames) into embeddings in the same space.
- “Frozen” means the big parts aren’t changed.
- “Projectors” are small layers that act like plug adapters, aligning image/audio features to the text model’s input.
- Only about 0.35% of all the model’s weights are trained—much less than retraining the whole model.
They built two sizes of the model (a smaller “Nano” and a bigger “Small”), both based on Jina’s strong text embedding models. For images and audio, they reused trustworthy encoders from other projects and trained just the projectors and a few special tokens to mark where images/audio go in the input.
How they trained it (in simple terms)
They trained the adapters by showing the model pairs that belong together—like a picture and its matching caption, or an audio clip and its description—and asking it to pull the right pairs close together and push wrong pairs apart in the embedding space. This is like a classroom game where each student has to find their matching partner among many others. The method is efficient and uses standard techniques, but the key point is: only the tiny adapters were trained, not the big “frozen” parts.
They also used a “nesting-doll” trick (called Matryoshka) so the embeddings can be chopped to shorter lengths while keeping most of their usefulness. That helps when you need faster or smaller storage.
What did they find?
- Strong overall performance for the size: The larger “Small” model performs about as well as or better than several bigger open models across text, image, audio, and video tests (especially good given its size).
- Text stays the same: Because the text brain wasn’t changed, text-only performance is identical to the original Jina text model. That means search systems built on those embeddings don’t need to be rebuilt.
- Very efficient training: Training just the tiny adapters was 2–4 times faster and used less memory than training everything.
- Especially good at document images: The model did very well on tasks like searching scanned pages, forms, charts, and diagrams, which matter a lot in real-world search and RAG (retrieval-augmented generation).
- Video is decent but weaker: Video results improved, but not as much as images and audio. There’s room to grow here.
- Embeddings shrink gracefully: Thanks to the nesting-doll trick, image and audio embeddings still work pretty well even when made shorter (video benefits less from shrinking).
Why this is important: The model can place text, pictures, audio, and video into the same shared space of meaning. That makes it easier to build apps where you can, for example, search with a sentence and find matching images, or search with a picture and find matching paragraphs.
Why does this matter?
- Saves time and money: You can turn a great text model into a multimodal one by training only tiny adapters, instead of retraining everything from scratch.
- Keeps your systems stable: Because the text part doesn’t change, companies don’t need to redo their huge text indexes or disrupt existing search quality.
- Better multimodal search and RAG: Many real documents mix text, pictures, and tables. This approach helps systems find and use the right information across all these formats.
- A flexible recipe: The “frozen-encoder + small projector” idea can be reused to add more modalities (or swap encoders) later. It’s like a modular design that can grow without breaking what already works.
In short, the paper shows a practical, efficient way to give a strong text model “eyes and ears,” keeping text performance untouched while reaching competitive multimodal results—all with minimal extra training.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The paper introduces a frozen-encoder composition recipe and reports competitive results, but it also leaves several aspects unaddressed. The following list summarizes concrete gaps and open questions to guide future work:
- Video pathway and training:
- The model lacks a dedicated video encoder and explicit temporal modeling; video results lag markedly (MMEB-Video). What gains arise from adding temporal adapters (e.g., 1D/3D conv, TimeSformer), a video-specific encoder, or temporal pooling strategies?
- Training mixture details show image and audio emphasis; no evidence of video-paired training. How much does adding paired video–text data (and AV tasks) improve video retrieval/QA and moment retrieval beyond the current “per-frame” serialization?
- Frame sampling policy (, per-frame slots ) and temporal stride are unspecified and unablated; how do these choices impact accuracy/latency trade-offs?
- Projector design and scope:
- Only shallow fully connected projectors are explored. Do deeper connectors, cross-attention adapters, low-rank adapters, or token-wise alignment layers increase performance without large parameter cost?
- The study does not examine scaling laws for projector size (width/depth) or parameter budgets versus performance and training efficiency.
- Encoder choice and swap-ability:
- Vision/audio encoders are fixed to Qwen3.5 and Qwen2.5-Omni. There is no systematic comparison of alternative towers (e.g., SigLIP2 variants, CLAP/PaSST for audio, dedicated video encoders) matched by scale and tokenization.
- The paper does not demonstrate that encoders can be swapped while keeping the text backbone frozen without substantial regression, nor provide guidelines for choosing “scale-matched” encoders.
- Training recipe and optimization:
- Only in-batch InfoNCE with a fixed temperature () and Matryoshka prefixes is used; sensitivity to , loss variants (e.g., multi-positive, hard negatives, cross-batch memory), or multi-granularity negatives is untested.
- The effect of curriculum, longer training, larger batch sizes, and gradual unfreezing schedules (beyond a small 2-stage audio variant) is not explored.
- It is unknown whether jointly training projectors for multiple modalities (co-training vs. per-modality/per-task runs) improves cross-modal alignment or harms geometry stability.
- Task adapters and cross-modal sharing:
- Projectors and delimiter embeddings are trained separately per task (retrieval, classification, clustering, text-matching). The trade-offs of shared vs. task-specific projectors/adapters and potential multi-task synergies or inter-task interference are unstudied.
- How to select the “right” task variant in production for heterogeneous corpora remains unclear; the impact of mis-specification on retrieval quality is unreported.
- Mixed-modality inputs and fusion:
- Although the serialization supports sequences mixing text, images, audio, and video, there is no evaluation on genuinely fused inputs (e.g., text+image queries/documents, audio+video fusion tasks) or analysis of cross-modal interaction effects in a single embedding.
- The ordering of modality segments (e.g., audio-before-video) and delimiter-token design are not ablated; their influence on fusion quality and stability is unknown.
- Geometry preservation beyond text-only:
- The paper guarantees identical text-only embeddings, but it does not assess whether adding non-text tokens alongside text perturbs text semantics in mixed inputs or introduces modality-dependent drift/hubness.
- No quantitative “modality gap” diagnostics (e.g., inter-modality angle distributions, anisotropy, hubness) are reported for the composed space.
- Matryoshka behavior:
- Video embeddings degrade sharply under Matryoshka truncation compared to text/image; strategies to improve temporal information retention at small prefix dimensions (e.g., prefix-aware video losses or temporal compression modules) are untested.
- Multilingual coverage:
- Multilingual training/evaluation for non-text modalities is limited (multilingual speech constitutes a small share of audio tokens). Robustness across low-resource languages and scripts for image–text and audio–text retrieval is underexplored.
- Data mixture transparency and reproducibility:
- The training mixture is summarized by token share but lacks a full dataset list, licensing, and preprocessing details (e.g., OCR standards, frame extraction), which impedes replication and bias assessment.
- While some MIEB tasks were removed due to contamination concerns, a broader decontamination strategy for all modalities and benchmarks is not described.
- Robustness and safety:
- No robustness tests to common corruptions or domain shifts (e.g., OCR noise, image blur, audio background noise), or adversarial examples, are presented.
- There is no analysis of fairness/bias across demographic attributes or content types, despite substantial use of medical imagery and multilingual audio.
- Inference efficiency and deployment:
- Training efficiency is measured, but inference latency, memory, and throughput across modalities (especially for multi-frame video and long audio) are not reported.
- Guidance on Matryoshka truncation at inference time (accuracy–latency trade-offs per modality) is absent.
- Document understanding scope:
- Strong page-level ViDoRe results are reported, but multi-page document embedding, page aggregation strategies, and text+image fusion in documents are not evaluated; the contribution of the visual vs. text path in doc tasks remains unclear.
- Token budget and slot allocation:
- The number of vision/audio slots (, ) and their allocation policy are fixed but unablated; the impact of slot count and spatial merge choices (2×2 unshuffle) on dense/localized tasks and efficiency is unknown.
- Audio clustering weakness:
- Audio clustering performance is notably poor; the causes (e.g., lack of clustering-oriented objectives, insufficient coverage of self-similar audio taxonomies) and remedies (contrastive-to-clustering losses, centroid supervision) are not investigated.
- Generalization to new modalities:
- The approach is claimed to be extensible, but no evidence is provided for adding other modalities (e.g., depth, 3D point clouds, sensor streams) that may have different token structures; connector designs and training recipes for such modalities remain open.
- Evaluation breadth:
- Dense/localized vision tasks (region retrieval, grounding), fine-grained audio tasks (speaker verification, event segmentation), and complex AV tasks (cross-modal grounding in video) are not covered, leaving capability boundaries unclear.
- Indexing stability at scale:
- While text geometry is preserved, the impact of introducing multimodal vectors into existing text-only indexes (e.g., distance scale calibration across modalities, ANN recall effects) is not measured.
- Hyperparameters and implementation details:
- Important choices (tokenization details, frame rate, audio windowing, normalization of tower outputs, delimiter initialization) are not fully specified or ablated, limiting actionable replication and optimization.
Practical Applications
Immediate Applications
The following applications can be implemented now with the paper’s released models and recipe. Each item summarizes what it enables, sectors, likely tools/workflows, and key assumptions or dependencies that affect feasibility.
- Bold, drop-in multimodal upgrade for existing text search/RAG stacks
- What: Add image/audio (and basic video) search to production systems that already use Jina Embeddings v5 Text—without reindexing text, because text embeddings are provably unchanged.
- Sectors: Software, enterprise search, legal, finance, govtech.
- Tools/workflows: Reuse current vector DB and indices; add modality-specific ingestion (page images, screenshots, audio tracks); route queries through task-specific adapters (retrieval, classification, clustering, text-matching); dynamic loading to omit unused towers for lower latency/memory.
- Assumptions/dependencies: The deployment already uses Jina v5 Text; vector DB supports the same embedding dimension; new multimodal items must be ingested and indexed; adapter selection must match use case.
- Visual document retrieval for scans, forms, PPT/PDF pages, infographics
- What: Page-level retrieval that is robust to layout, small fonts, stamps, tables—shown strong on ViDoRe-like tasks.
- Sectors: Enterprise content management, e-discovery, healthcare admin, public sector records, insurance claims.
- Tools/workflows: Page tiling if needed, high-resolution rendering, image embedding and indexing, optional OCR for downstream generation but not required for retrieval, Matryoshka prefixes for storage/latency tradeoffs.
- Assumptions/dependencies: Image resolution/tiling quality; vector DB handles large collections; domain drift (e.g., forms/templates) may benefit from light projector finetuning.
- Multimodal customer support and bug triage
- What: Retrieve similar cases across tickets with screenshots, screen recordings (sampled frames), logs, and recorded calls; match knowledge base articles to user-provided media.
- Sectors: Software/SaaS, telecom, consumer electronics.
- Tools/workflows: Attachment ingestion; frame sampling for videos; audio projector path for call indexing; retrieval adapter; rerank with business heuristics; lightweight redaction for PII.
- Assumptions/dependencies: Video retrieval is weaker overall—prefer frame sampling + moment retrieval slices; audio clustering is weak—use classification/retrieval instead.
- E-commerce multimodal search and product discovery
- What: Text-to-image product search, reverse image search, multilingual queries; voice queries for product findability; cross-modal recommendations.
- Sectors: Retail, marketplaces, classifieds.
- Tools/workflows: Product catalog ingestion (images + text); attribute-aware reranking; Matryoshka two-stage search (small prefix for recall, full vector for rerank); A/B testing for CTR/CVR.
- Assumptions/dependencies: Domain-specific visuals (e.g., fashion details) may need projector/domain adapters; ensure licensing/compliance on user-uploaded media.
- Media asset management and newsroom search
- What: Retrieve B-roll, stock shots, and audio clips using text queries; index shot sequences via sampled frames; retrieve segments (moment retrieval relatively strong).
- Sectors: Media/entertainment, marketing, broadcast.
- Tools/workflows: Shot/scene detection; frame sampling; per-shot embedding; timeline store for segment offsets; editorial tooling for preview/trim/export.
- Assumptions/dependencies: General video retrieval beyond moments is weaker—combine with metadata/ASR; storage costs for dense frame sampling; cadence tuning per content type.
- Safety/compliance and content moderation (image/audio)
- What: Zero-shot or few-shot classification for policy categories (NSFW, violence, copyright, hate symbols; audio profanity, music detection); retrieval for policy exemplars.
- Sectors: Trust & Safety, ad platforms, UGC, enterprise DLP.
- Tools/workflows: Label taxonomies mapped to retrieval prompts or classifier heads; adapter selection (classification); calibration via thresholds; human-in-the-loop review.
- Assumptions/dependencies: Audio clustering is weak—favor classification/retrieval; domain-specific edge cases require curated exemplars; continuous monitoring for drift.
- Multilingual multimodal search
- What: Cross-language queries for images and audio perform competitively; useful for global catalogs and multilingual knowledge bases.
- Sectors: Global tech, publishing, education.
- Tools/workflows: Normalize multilingual queries; per-locale evaluation; language-aware rerankers; monitoring by language distribution.
- Assumptions/dependencies: Coverage varies by language; evaluate on target locales; ensure Unicode/segmentation consistency in pipelines.
- Privacy-first, on-prem, and edge indexing
- What: Small open-weight models (≈0.95B–1.57B) enable on-prem or edge deployment for sensitive content across text/images/audio.
- Sectors: Healthcare, finance, defense, mobile/edge.
- Tools/workflows: Containerized inference with dynamic tower loading; CPU/GPU selection per modality; Matryoshka truncation to shrink storage/latency; offline batch indexing windows.
- Assumptions/dependencies: Throughput targets vs. hardware; careful memory planning; internal MLOps for adapter versioning.
- Academic baselines and teaching labs for multimodal IR
- What: A reproducible recipe (projector-only training; frozen towers) and open checkpoints for research, coursework, and ablations.
- Sectors: Academia, ML education, open-source IR.
- Tools/workflows: Minimal GPU training for projector adapters; benchmark suites (MIEB/MMEB/MAEB/MMTEB); publish task-specific adapters.
- Assumptions/dependencies: Domain generalization may require curated mixtures; institutional compute constraints dictate batch/steps.
- Public records and policy workflows (FOIA, hearings, evidence)
- What: Unified search over scanned government documents, photos, and hearing/audio transcripts with minimal text reindexing.
- Sectors: Govtech, NGOs, investigative journalism.
- Tools/workflows: Page image embedding; audio ingestion; case file linking; audit logs; explainable retrieval (exemplar matches).
- Assumptions/dependencies: Auditable pipelines; redaction for sensitive fields; retention policies; legal constraints on media handling.
Long-Term Applications
These are promising directions that need further research, scaling, or productization—often highlighted by the paper’s limitations (e.g., weaker general video and audio clustering) or by extensions of the frozen-encoder composition method.
- High-fidelity video understanding at scale
- What: Stronger generic video retrieval, VQA, and event/temporal reasoning by unfreezing encoders and/or training video-specific projectors with larger mixtures.
- Sectors: Sports analytics, surveillance, education platforms, compliance review.
- Tools/workflows: Dense frame/clip encoders; improved moment localization; specialized video adapters; long-context memory.
- Assumptions/dependencies: Larger training budgets and datasets; careful prevention of geometry drift; evaluation on domain benchmarks.
- Robust audio analytics beyond retrieval/classification
- What: Improve audio clustering and fine-grained tagging (speaker/state, sound scenes) with continued projector+encoder training and domain mixtures.
- Sectors: Contact centers, smart devices, media platforms.
- Tools/workflows: Semi-supervised labeling; active learning; per-language calibration.
- Assumptions/dependencies: Multilingual coverage; privacy-preserving pipelines for voice.
- Domain-specific multimodal verticals via plug-in encoders
- What: Use frozen-encoder composition to add medical imaging, geospatial, industrial sensor/video, or scientific diagrams by training only small projectors.
- Sectors: Healthcare, energy, manufacturing, geospatial.
- Tools/workflows: Swap in domain encoders (e.g., DICOM, satellite); projector training on curated data; regulatory validation plans.
- Assumptions/dependencies: Licensing for encoders/data; clinical/mission-critical validation; governance for updates.
- Multimodal RAG agents grounded in frames and audio
- What: Retrieval pipelines that feed LLMs with relevant frames/audio segments for answering complex queries or producing captions/summaries.
- Sectors: Assistive tech, enterprise copilots, education.
- Tools/workflows: Frame/audio selection via embeddings; tool-augmented LLMs; latency-aware chunking; feedback loops.
- Assumptions/dependencies: Stronger video/audio recall; LLM alignment and safety; compute budgets for multi-hop retrieval.
- Progressive, cost-aware vector search with Matryoshka prefixes
- What: Tiered retrieval (coarse with small prefixes, refine with full vectors) to cut latency and storage at web scale; adaptive per-modality truncation.
- Sectors: Large-scale search engines, marketplaces, social platforms.
- Tools/workflows: Vector DB support for prefix-length search; scheduler for progressive refinement; telemetry-driven truncation policies.
- Assumptions/dependencies: Product metrics tolerant to small recall loss; DB feature support; careful monitoring.
- Adapter/projector marketplaces and continuous delivery
- What: Share and version task- or domain-specific projector/LoRA packs; hot-swap via dynamic loading without retraining the backbone.
- Sectors: MLOps platforms, SaaS providers, OSS ecosystems.
- Tools/workflows: Registry for adapters; compatibility testing; semantic versioning; rollback tools.
- Assumptions/dependencies: Governance for security and IP; standardized evaluation badges.
- Federated and cross-organization multimodal search
- What: Preserve existing text indices while enabling shared cross-modal spaces across organizations; federated retrieval over images/audio.
- Sectors: Enterprise alliances, research consortia, supply chains.
- Tools/workflows: Federated retrieval protocols; embedding normalization; privacy-preserving aggregation.
- Assumptions/dependencies: Legal agreements for media sharing; alignment on embedding versions; secure gateways.
- New modalities via the same composition recipe (3D, time series, tabular)
- What: Extend the backbone to 3D point clouds, CAD, biosignals, tabular charts by adding language-aligned encoders and small projectors.
- Sectors: Robotics, AEC, biotech, finance analytics.
- Tools/workflows: Choose encoders with language supervision; minimal projector training; task adapters for retrieval/classification.
- Assumptions/dependencies: Availability of language-aligned encoders; evaluation datasets; geometry preservation monitoring.
- Real-time multimodal monitoring and alerting
- What: Stream embeddings for live audio/video to trigger alerts (quality, safety, compliance) with retrieval-backed playbooks.
- Sectors: Operations, contact centers, manufacturing, safety.
- Tools/workflows: Sliding-window embeddings; streaming vector stores; incident correlation; on-call tooling.
- Assumptions/dependencies: Throughput/latency tuning; improved video performance; reliable edge inference.
Notes on cross-cutting feasibility
- Performance profile: Strong on text, visual documents, image classification/clustering, multilingual image/audio retrieval; weaker on general video and audio clustering.
- Efficiency: Training only projectors yields 1.8×–3.9× faster training and lower memory; enables economical domain adaptation.
- Stability: Text geometry is preserved exactly for the released models, avoiding text reindexing and downstream regression risk.
- Engineering dependencies: Modality delimiters and dynamic tower loading must be integrated; vector DBs should support cosine/L2 on normalized vectors and, ideally, Matryoshka prefix search.
- Data/licensing: Verify licenses for Qwen-derived encoders and any domain encoders/data; establish governance for safety, bias, and privacy.
- Evaluation: Always benchmark on target domains/languages and monitor drift; calibrate thresholds for classification/moderation tasks.
Glossary
- AdamW optimizer: A variant of Adam that decouples weight decay from the gradient update to improve generalization. "We use the AdamW optimizer~\cite{adamw} with , , weight decay $0.01$, and global gradient clipping at ."
- affine map: A linear transformation followed by a bias term. "We write each fully connected layer as the same affine map"
- bf16 mixed precision: Training with bfloat16 numerical format for some tensors to reduce memory/compute while maintaining stability. "Training uses bf16 mixed precision and distributed data parallelism across $4$ NVIDIA H100 GPUs, with global batch size $256$ paired examples."
- bidirectional in-batch InfoNCE: A contrastive learning objective computed both left-to-right and right-to-left within a batch. "Projector training uses bidirectional in-batch InfoNCE with Matryoshka representation learning."
- compositional reasoning: Evaluations that require understanding combinations of attributes/relations to infer meaning. "covers classification, clustering, visual semantic textual similarity (STS), retrieval, document retrieval, compositional reasoning, and vision-centric tasks."
- contrastive image--text embedding: Learning to align images and text by pulling matched pairs together and pushing mismatched pairs apart in embedding space. "CLIP~\cite{clip} established contrastive image--text embedding with separately encoded image and text towers"
- distributed data parallelism: Training across multiple devices/processes that each handle different data shards and synchronize gradients. "Training uses bf16 mixed precision and distributed data parallelism across $4$ NVIDIA H100 GPUs"
- dynamic adapter selection: Automatically choosing task-specific adapters at runtime based on the input/task. "Jina Embeddings v5 Text already uses dynamic adapter selection to route retrieval, classification, clustering, and text-matching inputs through the corresponding task adapter."
- dynamic weight loading: On-demand loading/activation of parameter subsets conditioned on the task or modality. "Dynamic Weight Loading"
- embedding geometry: The spatial arrangement/structure of vectors in an embedding space that affects similarity and retrieval behavior. "Text embedding models anchor retrieval, retrieval-augmented generation (RAG)~\cite{rag}, and classification pipelines whose vector indexes depend on a stable embedding geometry."
- frozen towers: Encoder components (e.g., vision/audio backbones) kept fixed (not updated) during training. "Frozen towers feed trainable modality projectors into the frozen text backbone; task-specific exports select one projector/delimiter set and the matching LoRA adapter."
- frozen-encoder model composition: Building a multimodal model by connecting pretrained, frozen modality encoders to a frozen text model via small trained connectors. "In this work, we introduce frozen-encoder model composition, a novel approach to multimodal embedding models."
- fully connected layer: A dense neural layer where every input unit connects to every output unit. "We write each fully connected layer as the same affine map"
- GELU: Gaussian Error Linear Unit, a smooth nonlinearity used in transformers. "applying LayerNorm, a spatial merge, fc_vision_1, GELU, and fc_vision_2."
- global gradient clipping: Limiting the overall gradient norm to stabilize training. "with , , weight decay $0.01$, and global gradient clipping at ."
- language-space alignment: Aligning non-text encoder outputs with a LLM’s representation space to enable text-conditioned tasks. "visual and audio features need explicit language-space alignment or natural-language supervision before they transfer reliably to text-conditioned multimodal tasks"
- last-token pooling: Using the representation of the final token in a sequence as the pooled embedding. "the final embedding is produced by last-token pooling and L2 normalization."
- LayerNorm: Layer normalization, which normalizes features across channels for each token. "applying LayerNorm, a spatial merge, fc_vision_1, GELU, and fc_vision_2."
- L2 normalization: Scaling a vector to unit length using the Euclidean norm. "the final embedding is produced by last-token pooling and L2 normalization."
- LoRA adapters: Low-Rank Adaptation modules that add trainable low-rank updates to large models for efficient task specialization. "using LoRA adapters to optimize them for multiple tasks"
- Matryoshka representation learning: Training embeddings so that prefixes (first k dimensions) remain useful, enabling truncation with graceful degradation. "Projector training uses bidirectional in-batch InfoNCE with Matryoshka representation learning."
- Matryoshka truncation: Reducing embedding dimensionality by keeping only the first k dimensions while retaining performance. "ablations on projector training, encoder choice, and Matryoshka truncation, and separately quantify training efficiency."
- modality delimiters: Special tokens marking the start and end of non-text modality segments within a unified token sequence. "non-text modalities are represented by placeholder runs inside modality delimiters."
- modality gap: A discrepancy between different modalities’ regions in a shared embedding space that can harm cross-modal alignment. "However, contrastively-trained multimodal embedders suffer from a gap between modality-specific regions of the shared representation space"
- modality projectors: Trainable layers that map modality-specific encoder features into the text model’s hidden space. "Frozen towers feed trainable modality projectors into the frozen text backbone"
- modality towers: Separate encoder branches per modality (e.g., vision, audio) within a multi-tower architecture. "the model exposes a modality attribute that controls which frozen modality towers are instantiated"
- moment retrieval: Locating the temporal segment in a video corresponding to a query. "MMEB-Video, covering classification, VQA, retrieval, and moment-retrieval sub-tasks."
- nDCG@10: Normalized Discounted Cumulative Gain at rank 10, a ranking quality metric emphasizing higher ranks. "Curves show mean nDCG@10; line style indicates modality and color shade indicates model size."
- pixel shuffle/sub-pixel rearrangement: A reorganization that increases spatial resolution by redistributing channel information into space. "it is the inverse direction of pixel shuffle/sub-pixel rearrangement~\cite{pixel-shuffle}"
- RAG (Retrieval-Augmented Generation): Augmenting generation by retrieving relevant documents to condition the model. "Text embedding models anchor retrieval, retrieval-augmented generation (RAG)~\cite{rag}, and classification pipelines"
- reranking: Reordering retrieved candidates using a stronger or different model/criteria for improved relevance. "evaluates text-only embedding quality across retrieval, classification, clustering, semantic textual similarity, reranking, and pair-classification tasks."
- semantic textual similarity (STS): Measuring how similar in meaning two texts (or text–image pairs in visual STS) are. "visual semantic textual similarity (STS)"
- space-to-depth (pixel-unshuffle): A fixed rearrangement that reduces spatial resolution by concatenating neighboring patches along the channel dimension. "The spatial merge is a fixed space-to-depth (pixel-unshuffle) rearrangement that concatenates four neighboring patch embeddings into one"
- temperature (contrastive loss): A scaling factor controlling the sharpness of similarity distributions in contrastive objectives. "With temperature , "
- Vision-LLM (VLM): A model architecture that processes visual inputs through a connector into a LLM. "We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a LLM"
- ViT patch tokens: Token embeddings produced by a Vision Transformer for image patches. "the Qwen3.5 visual projector converts ViT patch tokens into text-token features by applying LayerNorm, a spatial merge, fc_vision_1, GELU, and fc_vision_2."
Collections
Sign up for free to add this paper to one or more collections.