InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

Published 9 Apr 2026 in cs.CV and cs.AI | (2604.08337v1)

Abstract: Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces an instance-aware pre-training framework that integrates both global and instance-level alignment to improve spatial-temporal understanding.
It employs a teacher-student Vision Transformer with cross-attention mechanisms to fuse cropped instance features with full-context representations.
Extensive experiments show significant gains, with notable improvements in retrieval metrics and IoU scores compared to traditional global-only methods.

Instance-Aware Vision-Language Pre-Training for Spatial-Temporal Understanding: An Analysis of InstAP

Introduction and Motivation

Despite significant advances in vision-language pre-training (VLP), most approaches are limited by their focus on coarse, global video-text alignments, an impediment for downstream tasks that demand precise grounding of textual mentions in spatio-temporal visual evidence. Such deficiencies inhibit the performance of VLP models in retrieval, video question answering, and spatial-temporal grounding tasks, where isolated instance disambiguation and localization are essential. The studied paper—"InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding" (2604.08337)—directly addresses this bottleneck by introducing an instance-level alignment paradigm and a supporting large-scale, dual-granularity annotated dataset (InstVL).

InstVL Dataset: Design and Characteristics

The InstVL dataset is a novel contribution, comprising 2M images and 50K videos, developed to enable instance-aware VLP at scale. Each visual example is annotated with both holistic scene captions and multiple fine-grained descriptions, each precisely grounded to specific spatial regions or temporal trajectories. These instance annotations are generated using open-vocabulary detection (GroundingDINO), tracking (SAM2), and large vision-LLMs, with iterative manual quality control. Compared to Visual Genome or existing video datasets, InstVL provides much more linguistically diverse and temporally consistent instance grounding, overcoming the static, template-driven, or predicate-limited nature of prior resources.

Figure 1: Illustration of the InstVL dataset—sampled frames with color-coded, temporally-consistent instance trajectories and associated instance/global captions.

The test suite for InstVL is carefully partitioned, including held-out splits with distribution shifts (sourced from COYO), ensuring robust generalization evaluation and preventing test-set leakage.

InstAP Framework and Methodological Innovations

InstAP's architecture and training protocol are fundamentally distinguished by their unified instance-aware alignment objectives and architectural modifications. The framework incorporates:

Self-Supervised Masked Video Modeling: A teacher-student Vision Transformer (ViT-L) initialization, using high-level semantic regression on attention-guided, visible tokens—a choice validated to yield strong spatial-temporal features suitable for cross-modal alignment.
Instance-Aware Alignment Objectives: Beyond the global video-text contrastive and matching losses, InstAP adds instance-level counterparts. For each detected spatio-temporal instance and corresponding caption, cropped video features are cross-attended with the full context, producing instance-aware embeddings. These are contrasted with instance text embeddings, with specific masking of intra-video negatives to mitigate false negatives inherent to dense annotation.
Figure 2: The instance-aware alignment mechanism fuses trajectory-based instance features with global context, enforcing fine-grained contrastive alignment.

The final joint loss integrates masked video reconstruction, global alignment, and instance-aware objectives, with careful empirical weighting.

Experimental Results

Retrieval Benchmarks:

Comprehensive comparisons on the InstVL benchmark confirm InstAP's superiority for instance retrieval across all splits. For example, in instance-level retrieval on InstVL-1K (video), InstAP attains T2V R@1 of 60.63, a significant margin over all baselines, including controlled UMT-L variants trained on identical data. Notably, these gains persist on distribution-shifted splits (img-zero), evidencing strong generalization and compositionality.

Zero-Shot Generalization:

InstAP exhibits highly competitive zero-shot results on standard benchmarks, e.g., MSR-VTT (41.1 R@1) and DiDeMo (54.0 R@1)—in fact achieving SoTA for these tasks. Unlike naive finetuning of existing models on the InstVL corpus (which often degrades global retrieval metrics due to interference or overfitting), instance-aware alignment improves both instance and global scene understanding.

Spatial-Temporal Grounding:

Adding a lightweight MLP head for spatial localization, InstAP pretrained features robustly enable spatial-temporal grounding, yielding significant gains under stringent IoU metrics compared to global-only baselines. For instance, an IoU@90 of 25.13 is achieved on InstVL-1K (video), a nearly 2x improvement.

Qualitative Analysis:

Visualization of model attention and retrieved regions confirm that InstAP's representations are tightly aligned with linguistically relevant instances, while global-only baselines often exhibit diffused or erroneous attention.

Figure 3: InstAP focuses attention on caption-relevant regions (e.g., localized license plate), unlike the global baseline's diffuse focus.

Figure 4: Instance-aware pre-training enables correct retrieval of fine-grained descriptions where global baselines fail due to distractors.

Ablation Studies:

Dissecting InstAP's improvements, the instance-aware loss ( $\mathcal{L}_{\mathrm{inst}}$ ) emerges as the dominant factor—accounting for +13.96 to +17.61 mean recall gains on instance-level retrieval—while learnable temperature, instance loss weighting, and explicit use of trajectories further enhance performance.

Theoretical and Practical Implications

Theoretical: InstAP validates that instance-level contrastive pre-training, when conducted end-to-end and not as an auxiliary head or post-hoc graft, is critical for building VLP encoders that generalize both holistically and at the entity level. The finding that instance-aware pre-training not only enhances instance retrieval but also robustifies global understanding challenges prior tradeoff assumptions regarding detail versus context.

Practical: The availability of InstVL and the paradigm instantiated by InstAP have significant implications for tasks such as compositional video retrieval, dense video captioning, VQA, and open-world object/scene understanding. The findings suggest that downstream specialization and fine-tuning can be replaced or augmented with better pre-training objectives and resources. The architectural simplicity—instance-awareness embedded into a shared encoder via cross-attention—facilitates practical adoption and extensibility.

Future Directions

Future research directions may include: scaling InstVL further; integrating open-vocabulary, dense-phrase grounding for all entities and actions; augmenting the fusion transformer for denser temporal reasoning and causal question answering; and exploring the interaction of instance-level and global alignment via more complex multi-task objectives. Additionally, adapting instance-aware paradigms to multi-modal and multi-lingual video-text modeling remains to be explored.

Conclusion

InstAP demonstrates that integrating instance-aware, spatial-temporal contrastive alignment into large-scale VLP not only overcomes the instance grounding limitations of prior global methods, but also enhances global representations. With its dual-granularity dataset and unified training pipeline, InstAP advances the frontier of representation learning for video-language understanding, enabling robust performance in both holistic and fine-grained downstream applications.

(2604.08337)

Markdown Report Issue