Papers
Topics
Authors
Recent
Search
2000 character limit reached

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Published 11 Dec 2025 in cs.CV | (2512.10942v1)

Abstract: We introduce VL-JEPA, a vision-LLM built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.

Summary

  • The paper introduces a joint embedding predictive framework that replaces token-based generation with latent semantic prediction to improve efficiency.
  • The model architecture uses separate encoders and a transformer-based predictor with InfoNCE loss to align visual and textual embeddings.
  • Empirical results show enhanced learning speed, sample efficiency, and decoding efficiency across tasks like VQA, captioning, and retrieval.

VL-JEPA: A Joint Embedding Predictive Architecture for Vision-Language Tasks

Introduction and Motivation

The VL-JEPA model introduces a paradigm shift in vision-language modeling by replacing classical token-space autoregressive generation with a latent-space joint embedding predictive architecture. Vision-LLMs (VLMs) traditionally operate by decoding text tokens autoregressively, which conflates modeling of pertinent semantics with irrelevant surface-level linguistic variation. This leads to inflated model complexity and computational inefficiency, particularly in streaming and real-time applications that require prompt, semantics-driven responses and selective output emission.

VL-JEPA, in contrast, predicts continuous semantic embeddings for the target text, circumventing the need to model linguistically diverse token sequences. By operating within an abstract, semantically rich embedding space rather than the combinatorially sparse token space, VL-JEPA reduces both the learning burden and inference latency. Selective decoding—emitting human-readable output only when a significant semantic change is detected—emerges naturally from this architecture, providing significant efficiency improvements.

Architectural Overview

VL-JEPA's design is comprised of distinct modular components: an X-Encoder for visual inputs, a Y-Encoder for target texts, a transformer-based Predictor, and a lightweight Y-Decoder used solely for conversion from embeddings to text at inference. The core learning objective leverages the InfoNCE loss in embedding space, aligning predicted and ground-truth semantic embeddings while enforcing uniformity and averting representation collapse.

The model employs strong frozen vision backbones (V-JEPA 2 ViT-L) and initializes its textual components from high-performing embedding models (EmbeddingGemma-300M for Y-Encoder, Llama-3 layers for the Predictor). Training proceeds in two stages: large-scale, query-agnostic vision-language alignment pretraining on image/video-caption pairs, followed by supervised finetuning (SFT) with question-answer supervision, equipping the model with robust VQA and generative capabilities. Figure 1

Figure 1: VL-JEPA model architecture showing the distinct X-Encoder, Y-Encoder, Predictor, and Decoder modules, highlighting the shift from token prediction to embedding prediction.

Figure 2

Figure 2: Left—VL-JEPA predicts target embedding SYS_Y instead of reconstructing the raw textual target YY as done by classical VLMs; Right—Applications include captioning, selective decoding, discriminative VQA, classification, and text-to-video retrieval within the unified architecture.

Empirical Results and Analysis

Vision-Language Generation, Classification, and Retrieval

VL-JEPA demonstrates high performance across standard benchmarks. The base model, after only 2B vision-language pairs, surpasses strong baselines (CLIP, SigLIP2, PE-Core) in zero-shot classification and retrieval, especially on motion-centric datasets (e.g., SSv2, EK-100, EgoExo4D). After SFT, the model achieves performance on par with, or exceeding, state-of-the-art specialist models, while maintaining a unified, generalist architecture with just 1.6B parameters.

Visual Question Answering (VQA)

The SFT version of VL-JEPA matches or exceeds competitive VLM baselines on compositional visual reasoning (GQA), complex counting (TallyQA), and object hallucination benchmarks (POPE, POPEv2). These results were achieved with architectures that are both smaller and more parameter-efficient than many token-generative VLMs.

Embedding Prediction vs. Token Prediction: Controlled Study

A critical experiment contrasts embedding prediction (VL-JEPA) against token prediction (standard VLM) under matched conditions (same vision encoder, training data, and batch size). VL-JEPA achieves sharper learning curves, higher sample efficiency, and superior final performance on video captioning (14.8 vs. 7.1 zero-shot CIDEr) and classification (41.0% vs. 27.2% top-5 accuracy) after 15M samples. Figure 3

Figure 3: Embedding prediction (VL-JEPA) outpaces token prediction (VLM) in learning speed and final accuracy under strictly matched training settings; right—VL-JEPA halves parameter count and maintains lower inference time.

Moreover, the decoupling of semantic prediction and output decoding in VL-JEPA allows classification and retrieval tasks to be performed using only the embedding prediction modules, reserving the decoder solely for text generation scenarios.

Selective Decoding and Streaming Efficiency

VL-JEPA’s architecture naturally enables embedding-guided selective decoding in streaming scenarios. This approach monitors the predicted embedding stream and triggers text decoding only when a significant semantic shift is observed, in contrast to uniform interval-based sampling required by autoregressive VLMs. In experiments on long-form video, VL-JEPA’s selective decoding reduces the number of decoding operations by approximately 2.85× for equivalent CIDEr performance, representing a notable efficiency gain. Figure 4

Figure 4: Embedding-guided selective decoding (blue) enables substantial reduction in decoding operations compared to uniform sampling (red) with no loss in output quality—measured by temporal annotation CIDEr on EgoExo4D.

Model Ablations and Hard Negative Text Sensitivity

Extensive ablation studies confirm that pretraining on massive caption data, appropriate learning rate scaling for the Y-Encoder, use of contrastive InfoNCE loss, and increased Predictor depth all contribute positively to performance. The Y-Encoder is validated to yield embeddings robust to difficult text hard-negatives (SugarCrepe++, VISLA), with VL-JEPA outperforming strong CLIP-style and SigLIP2 encoders on semantic and lexical triplet tests.

Theoretical and Practical Implications

VL-JEPA bridges the architectural efficiency of joint embedding models (e.g., CLIP) and the flexible generation abilities of VLMs, supporting classification, retrieval, and vision-language-text generation in a single, scalable architecture. The findings underscore the inefficiency of token-space modeling in multimodal settings with inherently multimodal or ambiguous targets, and highlight the advantages of continuous latent representation for both training and inference.

By supporting native, semantics-aware selective decoding, VL-JEPA caters to streaming video understanding and real-time agentic applications, where computational efficiency and responsiveness are paramount. The ability to scale performance through dataset and model scaling, without the quadratic explosion of compute associated with autoregressive token decoding, is particularly salient for large-scale, always-on AI systems.

Future Directions

VL-JEPA opens several future avenues:

  • Scaling up pretraining data and model capacity to close remaining gaps with ultra-large token-generative VLMs on knowledge-intensive and reasoning-heavy tasks
  • Integrating visual reasoning and chain-of-thought mechanisms directly within the joint embedding space, potentially paving the way for more abstract, multimodal latent-space reasoning engines
  • Application to robotics and embodied AI, leveraging VL-JEPA’s real-time streaming capabilities and efficient multi-task inference
  • Exploring finetuning and adaptation protocols to enhance tool-use, agentic behaviors, and cross-modal retrieval/generation beyond the current evaluation scope

Conclusion

VL-JEPA establishes that shifting supervision from discrete token-space to continuous embedding space enables simpler, more efficient, and highly capable vision-LLMs. The architecture achieves superior sample efficiency and inference speed, strong empirical performance across classification, retrieval, generation, and VQA, and natively supports efficient selective decoding for streaming applications. These characteristics position VL-JEPA as a promising alternative for unified, multimodal semantic modeling and set the stage for continued exploration of latent space reasoning in AI architectures (2512.10942).

Whiteboard

Explain it Like I'm 14

Simple Summary of the Paper: VL-JEPA

Overview

This paper introduces VL-JEPA, a new kind of vision-LLM. Vision-LLMs are computer programs that understand pictures or videos and produce text (like captions or answers to questions). Instead of writing the answer word-by-word like most models, VL-JEPA predicts the “meaning” of the answer first in a special form called an embedding, and only turns that meaning into text when needed. This makes it faster, more efficient, and better suited for real-time tasks like smart glasses, robots, and live video understanding.

What Questions Does the Paper Try to Answer?

The paper explores simple, practical questions:

  • Can a model that predicts “meaning” (embeddings) instead of words (tokens) learn faster and use fewer resources?
  • Will this approach work well across many tasks like captioning, classification, retrieval, and visual question answering (VQA)?
  • Can it handle real-time video better by only decoding text when the meaning actually changes?
  • How does this new method compare to popular models like CLIP, SigLIP2, and big VLMs such as InstructBLIP or Qwen-VL?

How Does VL-JEPA Work? (Explained Simply)

Think of answering a question about a video like writing a message:

  • Traditional models “type” the message one word at a time. This is called autoregressive generation. It’s slow and cares a lot about exact wording.
  • VL-JEPA first decides the message’s meaning in a compressed form (an embedding), then writes it out as text only when necessary.

Here’s the main setup:

  • X-Encoder: This part looks at the image/video and turns it into compact visual signals (like “visual notes”).
  • Predictor: This part takes the visual notes and the user’s question, and predicts the “meaning vector” of the answer.
  • Y-Encoder: This turns the correct answer (during training) into its own meaning vector so the model knows what to aim for.
  • Y-Decoder: This last step turns the predicted meaning vector back into plain text—but only if you actually need text output.

Training in “meaning space” uses a method called InfoNCE. In everyday terms: it teaches the model to pull matching pairs (video + correct answer) closer together and push non-matching pairs farther apart. That prevents everything from blending into one and keeps meanings distinct.

Two training stages:

  1. Pretraining on lots of captions to align vision and language well.
  2. Supervised fine-tuning (SFT) to make the model good at answering questions (VQA) and other tasks.

Selective decoding for streaming video:

  • VL-JEPA produces a steady stream of meanings over time while watching a video.
  • It only converts those meanings to text when the meaning changes enough. This saves time and compute, like only sending a message when there’s new information.

What Did They Find?

Here are the main results the authors report, summarized in plain language:

  • Better learning with fewer moving parts: When trained under the same conditions as a traditional word-generating model, VL-JEPA learned faster and achieved higher scores—while using about 50% fewer trainable parameters.
  • Strong results across tasks:
    • Classification and retrieval: VL-JEPA outperformed strong baselines like CLIP, SigLIP2, and Perception Encoder on average across 8 video classification and 8 retrieval datasets.
    • VQA (answering questions about images/videos): With fine-tuning, VL-JEPA reached performance similar to well-known models (InstructBLIP, Qwen-VL) on several VQA benchmarks, despite having only 1.6 billion parameters.
    • World modeling task: VL-JEPA set a new best result by correctly identifying the action that explains a change between two images, beating some very LLMs.
  • Real-time efficiency: By decoding only when the predicted meaning changes significantly, VL-JEPA cut the number of decoding operations by about 2.85× while keeping output quality the same.

Why Is This Important?

  • Faster and more efficient: Predicting meaning instead of words avoids wasting effort on small differences in phrasing. It makes the model more responsive and cheaper to run.
  • Real-time friendly: For smart glasses, robots, and live video, you need quick updates. VL-JEPA’s “always-on meaning stream” supports that, decoding text only when necessary.
  • One model, many jobs: The same system handles captioning, classification, retrieval, and VQA—without complicated add-ons or special modes.
  • Lower cost, broader access: Using fewer trainable parameters and less decoding makes this approach more practical for devices and applications that don’t have huge computing resources.

Final Thoughts and Impact

VL-JEPA shows that focusing on the “meaning” of answers first can make vision-LLMs both smarter and faster. That’s good news for real-world applications where timing and efficiency matter—like helping users with step-by-step instructions, monitoring environments, or assisting robots with planning. The paper also hints at exciting future directions, such as improving the loss functions, expanding training data, and pushing even further on streaming performance. Overall, VL-JEPA is a promising step toward more practical, versatile, and responsive AI systems.

Knowledge Gaps

Below is a single, focused list of the paper’s unresolved knowledge gaps, limitations, and open questions that future work could concretely address:

  • Clarify and evaluate the y-decoder training and decoding process: the paper states the y-decoder is “invoked only when needed” and is not trained during the main phase, but does not specify how embeddings are converted to text (e.g., learned inverse, retrieval, template, or LLM-based decoding), nor the supervision used to train/align the decoder. Provide a detailed decoding pipeline and compare alternatives.
  • Establish rigorous online selective decoding methods: the current evaluation uses offline agglomerative clustering on the embedding stream. Develop true online change-point detection with latency/throughput constraints, measure false positives/negatives for event boundaries, and compare thresholding strategies across domains.
  • Quantify latency, throughput, and memory under streaming workloads: current inference cost measurements focus on text generation only. Report end-to-end streaming latency (including x-encoder, predictor, y-decoder), GPU memory and energy usage, KV-cache behavior (if any), batching effects, and hardware utilization, versus competitive streaming VLM baselines.
  • Expand evaluation beyond discriminative VQA to open-ended generation: all VQA results are discriminative (nearest label in embedding space). Assess open-ended multi-turn dialogue, chain-of-thought reasoning, long-form explanations, structured outputs (lists, tables), and exact formatting requirements to test generative adequacy of embedding-to-text decoding.
  • Investigate multilingual and cross-lingual capabilities: the y-encoder and evaluations appear monolingual. Test multilingual queries/answers, cross-lingual retrieval/classification, and embedding alignment across languages; explore multilingual y-encoder initializations and training schedules.
  • Study open-set recognition and “none-of-the-above” calibration: similarity-based selection lacks explicit calibration for unknown classes/answers. Introduce confidence thresholds, abstention mechanisms, and evaluate open-set detection and “no match” behavior.
  • Analyze embedding geometry and semantic fidelity: verify that paraphrases and semantically equivalent answers are clustered (as claimed) using human-judged semantic similarity datasets; quantify how distances correlate with human judgments and task accuracy.
  • Compare JEPA losses beyond InfoNCE under matched conditions: the paper mentions VICReg/SIGReg but does not evaluate them. Benchmark non-contrastive JEPA regularizers, EMA/frozen y-encoders, and anti-collapse strategies for stability, sample efficiency, and performance.
  • Probe robustness to noisy and synthetic captions: large portions of pretraining rely on auto-generated captions (e.g., Action100M). Quantify sensitivity to label noise, domain shift, and spurious correlations; test denoising or label-quality weighting under JEPA.
  • Detail and test selective decoding criteria: specify quantitative thresholds, window sizes, pooling strategies, and drift metrics; evaluate trade-offs between missed events and duplicate captions; test generality across domains beyond EgoExo4D.
  • Provide a fairer and broader controlled comparison to generative VLMs: the baseline uses a 1B LLM with a frozen PE ViT-L, but generative systems can differ in optimization and video conditioning. Include multiple token-space baselines (with matched compute, optimization, and streaming settings) and report compute budgets (FLOPs, wall-clock).
  • Evaluate performance on appearance-centric datasets where VL-JEPA is weaker: the paper attributes this to fewer vision-language pairs. Systematically vary data mixtures (appearance vs motion) and analyze how JEPA responds to different pretraining compositions.
  • Characterize failure modes: present qualitative/quantitative analyses of when embedding prediction produces incorrect or unstable semantics (e.g., rare words, fine-grained attributes, spatial relations), and how this impacts downstream decoding.
  • Examine long-horizon temporal consistency: measure whether predicted embeddings maintain coherent semantics across extended videos, handle repeated/undone actions, and whether “semantic drift” occurs in sliding windows.
  • Assess safety and hallucination beyond POPE/POPEv2: discriminative evaluation may mask hallucination patterns. Include generative hallucination metrics (CHAIR/CHAIR-V2, ObjectScore), toxic content avoidance, and safe decoding policies for embedding-to-text.
  • Test integration with action/planning/robotics loops: while world modeling is evaluated, there is no end-to-end closed-loop test (e.g., real-time procedural assistance or robot policies). Measure action-latency, decision quality, and failure recovery under embedding-driven control.
  • Investigate scalability and model size trade-offs: quantify how performance and efficiency scale with predictor/y-encoder sizes, embedding dimensionality, and number of visual frames; include ablations on pooling, attention masks, and fusion strategies.
  • Clarify reproducibility and data availability: core training relies on internal Action100M and large proprietary mixes. Release data or strong public substitutes, along with training scripts, preprocessing, and evaluation code to enable replication.
  • Evaluate the y-encoder’s text-only quality comprehensively: current TOT benchmarks (SugarCrepe++, VISLA) are limited. Add broader textual similarity and retrieval tasks (e.g., MTEB), multi-domain QA, and test whether joint training harms or helps pure text performance.
  • Analyze calibration of similarity scores across tasks: distances in embedding space are used for classification, retrieval, and VQA. Calibrate scores per-task, evaluate temperature scaling/learned thresholds, and study cross-task transfer of calibration.
  • Compare decoding frequency reductions across diverse datasets: the 2.85× reduction is shown on EgoExo4D. Test on other long-form domains (sports, surveillance, instructional videos) and quantify generality of the Pareto improvements.
  • Formalize the theoretical advantages of embedding prediction: provide an analysis linking multi-modal target distributions to learning efficiency (e.g., bias-variance, mode aggregation), and derive conditions where JEPA yields provable gains over token prediction.
  • Address tasks requiring exact lexical fidelity: study cases where surface form matters (codes, equations, named entities, timestamps, URIs), and evaluate whether embedding-to-text decoding can preserve exact forms or requires hybrid generative components.
  • Report comprehensive compute budgets: beyond “fewer trainable parameters,” provide total training FLOPs, wall-clock time, and energy vs baselines; quantify if JEPA’s efficiency holds when normalized for compute across training and inference.

Practical Applications

Immediate Applications

The following applications can be deployed now, leveraging VL-JEPA’s non-autoregressive, embedding-predictive design and its demonstrated performance in classification, retrieval, VQA, and streaming selective decoding.

  • Smart glasses and wearables for procedural assistance (healthcare, manufacturing, education)
    • Description: Continuous monitoring of user actions with low latency; selectively decode descriptions only when a meaningful change is detected (e.g., new step begins), reducing battery and compute cost.
    • Tools/workflows: “Event-aware captioner” pipeline: frames → X-Encoder → Predictor → embedding stream → change detector (variance/cluster-based) → Y-Decoder on demand → guidance prompts.
    • Assumptions/dependencies: Access to on-device or edge accelerators; careful tuning of decoding thresholds; domain-specific supervised finetuning can improve accuracy on target procedures.
  • Real-time video analytics with cost-aware selective decoding (media platforms, CX ops, call centers)
    • Description: Live captioning and action tracking with ~2.85× fewer decode operations at comparable quality; highlights generation, moment-of-interest detection, and adaptive logging.
    • Tools/workflows: “Selective Decoder” microservice with embedding clustering and variance thresholds; integration with streaming backends (e.g., Kafka-like event buses).
    • Assumptions/dependencies: Stable streaming pipeline and clustering hyperparameters; handling domain shift in appearance-centric scenarios (consider additional fine-tuning).
  • Open-vocabulary classification for dynamic taxonomies (e-commerce, digital asset management, content libraries)
    • Description: Label assets against evolving vocabularies without retraining the classifier; embed candidate labels and select nearest to predicted embeddings.
    • Tools/products: “Open-Vocab Tagger” service with plug-in label lists and human-in-the-loop verification; MLOps integration for taxonomy updates.
    • Assumptions/dependencies: Quality of Y-Encoder embeddings for nuanced categories; periodic audit to manage bias and mislabeling.
  • Text-to-video retrieval for knowledge and media search (enterprise search, training content, education)
    • Description: Rank large video libraries by semantic similarity to text queries; strong performance on motion-centric datasets and instructional videos.
    • Tools/workflows: “JEPA-based Video Indexer” for DAM/knowledge systems; retrieval prompts tailored for domain (e.g., “Find steps showing X”).
    • Assumptions/dependencies: Scalable indexing of predicted embeddings; better performance when videos emphasize actions/steps versus static appearance.
  • Discriminative VQA for constrained-answer interfaces (industrial QA, forms, compliance checks)
    • Description: Multiple-choice or constrained-option visual QA with a single unified model; reduces need for large generative decoders during inference.
    • Tools/products: VQA scorer that embeds answer sets and selects nearest match to predicted embeddings; dashboards for batch QA.
    • Assumptions/dependencies: Well-formed answer sets; calibration for edge cases; minimal reliance on world knowledge beyond perception.
  • Efficient content annotation and captioning (media production, educational content, accessibility)
    • Description: Generate captions only at semantic change points (selective decoding), lowering annotation costs for long-form content while maintaining fidelity.
    • Tools/workflows: Batch captioning with adaptive segment selection (agglomerative clustering + average pooling for denoising).
    • Assumptions/dependencies: Clustering stability across varied content; human QA for final captions in production settings.
  • Accessibility: low-latency scene description with reduced energy usage (daily life)
    • Description: Real-time descriptive assistance for visually impaired users with fewer decode operations and prompt updates.
    • Tools/workflows: Mobile assistant app with embedding-change detection; on-demand text decoding; voice synthesis modules.
    • Assumptions/dependencies: On-device acceleration or efficient edge offload; careful UX design to avoid over/under-notification.
  • Energy- and cost-efficient model deployment (software, cloud, edge)
    • Description: Fewer trainable parameters and decoupled decoding enable more cost-effective training and inference, especially in always-on scenarios.
    • Tools/workflows: Deployment with quantization/pruning; dynamic decoding schedules; inference observability for embedding drift.
    • Assumptions/dependencies: Hardware compatibility; licensing of pretrained components (V-JEPA2, EmbeddingGemma); privacy-respecting data pipelines.
  • Research and teaching: JEPA vs. token-generation studies (academia)
    • Description: Reproduce controlled comparisons; study sample efficiency and performance trade-offs; design curricula on non-generative VLMs and streaming AI.
    • Tools/workflows: Teaching labs comparing CIDEr/accuracy under matched encoders/datasets; ablation on Y-Encoder, losses, and predictor layers.
    • Assumptions/dependencies: Access to datasets and GPUs; alignment of evaluation protocols.
  • Policy and compliance: privacy-aware streaming analytics (public sector, corporate governance)
    • Description: Event-triggered decoding implements data minimization by default; reduced textual extraction lowers privacy risk in continuous monitoring.
    • Tools/workflows: Governance templates for “decode-on-change” pipelines; audit logs of decoding triggers; PII minimization policies.
    • Assumptions/dependencies: Clear data retention policies; stakeholder buy-in; transparency about embedding monitoring and trigger criteria.

Long-Term Applications

These applications will benefit from further research, domain-specific finetuning, scaling, compression, and validation in safety-critical settings.

  • Embodied AI and robotics with continuous semantic streams (robotics, logistics, manufacturing)
    • Description: Use VL-JEPA’s embeddings as an “always-on semantic state” for planning/control; pair action candidates with state transitions (inverse dynamics).
    • Tools/workflows: “Semantic Event Bus” feeding planners; model-predictive control that monitors embeddings; teach-by-demonstration with action-effect retrieval.
    • Assumptions/dependencies: Closed-loop performance validation; robust detection of subtle state changes; safety certifications; integration with robot policies.
  • Surgical and clinical workflow monitoring (healthcare)
    • Description: Step recognition and change-point captioning in procedures; decision support with constrained VQA; reduction of documentation overhead.
    • Tools/products: OR monitoring tools; procedure step trackers; coding/documentation assistants with selective decoding.
    • Assumptions/dependencies: Regulatory approval, robust domain adaptation; strict privacy and consent frameworks; hospital IT integration.
  • Automotive and mobility situational awareness (transportation)
    • Description: Real-time detection of events and driver/environment changes; decode only when situation changes, reducing latency and compute.
    • Tools/workflows: Edge inference on ADAS/AV systems; embedding telemetry for incident analysis.
    • Assumptions/dependencies: Safety-critical validation; extreme robustness; hardware acceleration; legal/compliance constraints.
  • Large-scale video intelligence platforms (media, security operations, sports analytics)
    • Description: City-scale or platform-scale indexing via embeddings; highlight detection; action-centric retrieval; minimal text storage due to selective decoding.
    • Tools/workflows: Distributed embedding stores; adaptive decoding services; APIs for action-centric search.
    • Assumptions/dependencies: Scalable storage and retrieval; content governance; careful design to avoid misuse or privacy violations.
  • Standardization of event-driven decoding pipelines (software ecosystems, standards bodies)
    • Description: Define APIs/metrics for “decode-on-change,” semantic drift thresholds, and evaluation protocols; best practices to minimize computational and privacy footprints.
    • Tools/workflows: Reference implementations and benchmarks; MLOps patterns for embedding monitoring.
    • Assumptions/dependencies: Community adoption; cross-vendor interoperability; clear metrics for semantic change and quality.
  • Hardware co-design for JEPA-style models (semiconductor, edge devices)
    • Description: Accelerators optimized for X-Encoder + Predictor pipelines with occasional decoding; energy-efficient edge silicon for continuous embedding streams.
    • Tools/products: JEPA-friendly DSPs/NPUs; power-aware streaming schedulers.
    • Assumptions/dependencies: Ecosystem investment; model compression/distillation to fit edge constraints; reliability across diverse conditions.
  • Adaptive education and training assistants (education, workforce development)
    • Description: Real-time feedback in labs and workshops; detect learner’s step changes; decode guidance selectively to reduce distraction.
    • Tools/workflows: Classroom kits with smart cameras; learning analytics on embeddings; personalized prompts.
    • Assumptions/dependencies: Bias and fairness audits; privacy protections for learners; domain-specific finetuning for curricula.
  • Compliance analytics with dynamic taxonomies (finance, insurance, regulated industries)
    • Description: Open-vocabulary classification and constrained VQA for document-image/video reviews; update categories without retraining.
    • Tools/workflows: Review portals with JEPA-based taggers; audit trails; policy-triggered decoding.
    • Assumptions/dependencies: Strong explainability (distance-to-label embeddings); model governance; human oversight.
  • Consumer apps for personal media search and life-logging (daily life)
    • Description: Action-centric search over personal videos; sparse captioning to index memories without excessive text generation.
    • Tools/products: Mobile apps with on-device embeddings; “moment finder” features; privacy-first storage of embeddings.
    • Assumptions/dependencies: Small-model distillation; opt-in consent and data minimization; user controls for decoding frequency.
  • Multi-agent systems using shared semantic states (software, AI research)
    • Description: Agents consume shared embedding streams to coordinate tasks; decode selectively on agreed events.
    • Tools/workflows: “Semantic Blackboard” for agent collaboration; event contracts for decoding.
    • Assumptions/dependencies: Agreement on embedding spaces; synchronization protocols; robustness to drift across agents.

Cross-cutting assumptions and dependencies

  • Domain alignment: VL-JEPA is particularly strong on motion-centric tasks; appearance-centric performance may require additional pretraining/fine-tuning.
  • Y-Encoder quality and calibration: Choice and tuning (e.g., EmbeddingGemma vs. alternatives) affect classification/VQA reliability.
  • Selective decoding hyperparameters: Thresholds/cluster sizes must be tuned to avoid missed or spurious events; average pooling helps stability.
  • Hardware and deployment constraints: Current models (~1.6B parameters) may require edge accelerators or cloud offload; compression/quantization/distillation will broaden on-device use.
  • Data, licensing, and privacy: Ensure rights for training/inference datasets; implement privacy-by-design (event-triggered decoding, data minimization, transparent logging).
  • Safety and governance: Open-vocabulary labeling and VQA can reflect biases; institute human-in-the-loop review and model audits, especially in regulated or safety-critical contexts.

Glossary

  • Agglomerative clustering: A hierarchical clustering method that merges clusters based on similarity, often used to segment sequences. "We apply agglomerative clustering with temporal connectivity constraints"
  • Anti-collapse strategies: Techniques used to prevent learned representations from collapsing to trivial solutions in self-supervised learning. "Alternatively, the regularization term can be replaced by other anti-collapse strategies, such as using an exponential moving average (EMA) for the Y-Encoder or freezing the Y-Encoder."
  • Autoregressive token-by-token decoding: A generation process where tokens are produced sequentially, each conditioned on previously generated tokens, incurring latency. "VLMs rely on autoregressive token-by-token decoding, which must be completed before revealing the underlying semantics of YY."
  • Bi-directional InfoNCE loss: A contrastive learning objective applied in both directions (prediction-to-target and target-to-prediction) to align embeddings. "We train the Predictor and the Y-Encoder jointly with bi-directional InfoNCE loss, enabling them to mutually learn from each other."
  • Catastrophic forgetting: The tendency of a model to lose previously learned information when trained on new data. "downsampled pretraining stage data to avoid catastrophic forgetting."
  • Causal attention mask: An attention mechanism constraint that prevents a token from attending to future tokens, enabling autoregressive behavior. "We disable the causal attention mask so that both vision and query embeddings can be jointly attended."
  • CIDEr: A caption evaluation metric that measures consensus between generated captions and references using TF-IDF weighting. "Performance is measured by the average CIDEr score between each annotation yy and its closest decoded output y^\hat{y}."
  • CLIP-style evaluation protocol: An evaluation approach where text and image/video embeddings are compared via similarity for zero-shot tasks. "We evaluate VL-JEPA following the CLIP-style evaluation protocol"
  • Cosine learning rate annealing: A schedule that decreases the learning rate following a cosine curve to improve convergence. "with cosine learning rate annealing applied to improve convergence."
  • Discriminative VQA: A VQA setting where the model selects an answer from a set of candidates rather than generating free-form text. "VL-JEPA's embedding space facilitates {discriminative VQA}, open-vocabulary classification and text-to-video retrieval tasks using a single unified model architecture."
  • Embedding space: A continuous vector space where inputs like text or images are represented for learning and comparison. "the training objective is defined in the embedding space"
  • Exponential Moving Average (EMA): A technique that maintains a smoothed version of model parameters or representations over time. "using an exponential moving average (EMA) for the Y-Encoder"
  • Generative VLMs: Vision-LLMs that generate text outputs by predicting tokens in sequence. "Compared to the token-space loss used by generative VLMs, calculating the training loss in the embedding space is beneficial"
  • InfoNCE loss: A contrastive loss that encourages paired embeddings to be close while pushing apart negatives. "{we adopt the InfoNCE loss \citep{radford2021learning} due to its maturity in the vision-language domain.}"
  • Joint Embedding Predictive Architecture (JEPA): A paradigm where models predict target embeddings directly rather than reconstructing raw data. "JEPA models typically optimize two objectives jointly: 1) prediction error in the embedding space, and 2) additional regularization that avoids representation collapse"
  • KV-cache optimizations: Inference techniques that cache key-value pairs from attention layers to speed up autoregressive decoding. "complex KV-cache optimizations \citep{di2025streaming} for efficiency"
  • Latent space: A learned continuous space capturing abstract features where targets are predicted and compared. "embeds the textual target into a continuous latent space"
  • Monosemanticity: The property of embeddings or segments having a single, consistent meaning. "high intra-segment monosemanticity"
  • Non-autoregressive: A modeling approach that produces outputs in parallel or without sequential dependency, reducing latency. "Thanks to its non-autoregressive nature, VL-JEPA can produce continuous streams of target semantic embeddings"
  • One-hot token space: A sparse representation where each token is encoded as a vector with a single 1 and zeros elsewhere. "In the raw one-hot token space, different plausible YY outputs for the same input often appear nearly orthogonal"
  • Open-vocabulary classification: Classification that uses text embeddings for labels, enabling recognition beyond a fixed set of classes. "open-vocabulary classification, text-to-video retrieval, and discriminative VQA"
  • Pareto improvement: A scenario where a method achieves better performance without increasing cost, or the same performance with lower cost. "selective decoding achieves a Pareto improvement over uniform sampling"
  • Perception Encoder: A large-scale vision encoder backbone used for multimodal tasks. "Perception Encoder (PE-Core)"
  • Perplexity: A measure of uncertainty in language modeling; lower perplexity indicates better fit to a class description. "while for VLM we pick the class with lowest perplexity."
  • Predictor: The component that maps visual embeddings and a textual query to a predicted target embedding. "Predictor (SV,XQS^Y)(\langle S_V, X_Q\rangle \mapsto \hat{S}_Y) is the core component of VL-JEPA."
  • Prompt ensembling: Averaging predictions over multiple prompts for the same class to improve zero-shot accuracy. "achieved 61.6\% ImageNet zero-shot accuracy (without prompt ensembling)."
  • Query conditioning: Feeding the textual query into the model to condition predictions on the question or prompt. "Query conditioning is achieved by tokenizing and embedding the textual query"
  • Query-free pretraining: Pretraining the model without queries to learn general vision-language alignment from captions. "The first query-free pretraining stage aims to establish robust vision-language alignment"
  • Recall@1: A retrieval metric indicating the fraction of queries for which the top-ranked item is correct. "retrieval recall@1 (across 8 datasets)"
  • Representation alignment term: A loss component that pulls matched prediction and target embeddings closer together. "a representation alignment term that minimizes the distance between normalized prediction and target embeddings"
  • Representation collapse: When embeddings degenerate to a constant or trivial solution, losing discriminative power. "additional regularization that avoids representation collapse"
  • Selective decoding: Decoding text only when the semantic embedding stream changes significantly, reducing cost. "VL-JEPA, in contrast, natively supports selective decoding."
  • SIGReg: A non-contrastive regularization method used in JEPA to prevent collapse and improve representations. "such as VICReg \citep{bardes2021vicreg} and SIGReg \citep{balestriero2025lejepa}"
  • Sliding windows: Overlapping temporal windows used to process streaming data with minimal latency. "within sliding windows with minimal latency"
  • Socratic LLMs: LLMs used with external captions or prompts to reason about visual tasks. "Socratic LLMs (w/ Qwen2.5-VL-72B captions)"
  • Streaming VLMs: Vision-LLMs adapted for continuous video streams, often with memory mechanisms. "Standard streaming VLMs are limited to this strategy, whereas VL-JEPA supports a more effective alternative"
  • Temporal connectivity constraints: Restrictions that ensure clusters respect the temporal order when segmenting a sequence. "We apply agglomerative clustering with temporal connectivity constraints"
  • Token-space loss: A training objective that compares generated token sequences with ground-truth sequences. "Compared to the token-space loss used by generative VLMs"
  • VICReg: A non-contrastive regularization method that enforces variance, invariance, and covariance constraints. "such as VICReg \citep{bardes2021vicreg} and SIGReg \citep{balestriero2025lejepa}"
  • Vision Transformer (ViT): A transformer-based architecture for vision that processes images or videos as token sequences. "a Vision Transformer that outputs a sequence of visual tokens"
  • Vision-LLM (VLM): A model that processes both visual and textual inputs to generate or select text outputs. "classical VLMs"
  • Visual Question Answering (VQA): A task where a model answers questions about visual content. "VQA capabilities"
  • Visual tokens: Discrete or continuous token-like representations of visual inputs used by transformers. "outputs a sequence of visual tokens"
  • Ward distance: A clustering linkage criterion based on variance minimization within clusters. "measured by variance (i.e., Ward distance)."
  • World modeling: Inferring actions or causes that explain transitions between visual world states. "the “world modeling” task in the WorldPrediction~\citep{chen2025worldprediction} benchmark"
  • X-Encoder: The component that encodes visual inputs into visual embeddings. "X-Encoder (XVSV)(X_V \mapsto S_V) compresses high-volume visual inputs"
  • Y-Decoder: The component that converts predicted embeddings back into text at inference time. "Y-Decoder (S^YY^)(\hat{S}_Y \mapsto \hat{Y}) is not involved during the main training phrase of VL-JEPA."
  • Y-Encoder: The component that embeds textual targets into the latent space to serve as prediction targets. "Y-Encoder (YSY)(Y \mapsto S_Y) embeds the textual target into a continuous latent space"
  • Zero-shot: Evaluation or prediction without task-specific training, relying on generalizable representations. "higher performance on zero-shot captioning and classification"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 34 tweets with 2296 likes about this paper.