- The paper introduces CollabVR, a closed-loop framework combining vision-language and video generation models to enhance step-level video reasoning for complex tasks.
- It employs progressive planning where a VLM verifies each video clip and repairs errors to mitigate long-horizon drift and simulation failures.
- Experimental results on Gen-ViRe and VBVR-Bench demonstrate notable gains in accuracy and efficiency compared to conventional single-shot and resampling methods.
Step-Level Collaborative Video Reasoning with Vision-Language and Video Generation Models
Motivation and Framework
CollabVR introduces a closed-loop orchestration framework for goal-directed visual reasoning tasks leveraging complementary strengths of Vision-LLMs (VLMs) and Video Generation Models (VGMs) (2605.08735). VLMs demonstrate proficient logical decomposition, plan synthesis, and task verification, but lack robust direct visual simulation capabilities. VGMs excel at high-fidelity short-horizon visual synthesis but cannot reliably achieve long-horizon planning or enforce global logical consistency. CollabVR couples a VLM and VGM in a step-level closed loop: the VLM plans the next action, inspects each generated clip, and dynamically repairs errors by folding diagnostic feedback into subsequent prompts. This progressive, verifier-guided loop directly addresses and mitigates two dominant failure modes of modern VGMs—long-horizon drift and mid-clip simulation errors—which arise due to the absence of explicit reasoning processes in VGM rollouts.
Pipeline Architecture
CollabVR comprises two principal modules:
- VLM-Driven Progressive Planning: Instead of an upfront fixed decomposition, the VLM adaptively determines the number of sub-steps and emits only the immediate next action, conditioned on previously generated frames. This mitigates long-horizon drift by enabling state-adaptive plan depth, improving both efficiency and fidelity on complex multi-step tasks where single-shot generation fails.
- VLM-VGM Collaborative Reasoning (Verification and Recovery): The VLM verifier judges each generated clip, localizes failure, and proposes actionable prompt revisions. The framework explicitly repairs execution errors at the granularity where VLMs are most reliable, containing local simulation failures before they propagate.
The closed-loop design resembles algorithmic construction: the desired video reasoning trajectory is built incrementally, each sub-action verified and corrected, rather than passively sampled from the VGM's distribution.
Experimental Evaluation
CollabVR is benchmarked on Gen-ViRe and VBVR-Bench:
- Gen-ViRe: Rubric-based evaluation emphasizes task correctness across diverse categories (algorithmic, planning, spatial, perceptual).
- VBVR-Bench: Deterministic rule-based evaluation with controlled ground-truth references, spanning in-domain and out-of-domain splits.
Setup includes both open-source (e.g., VBVR-Wan2.2, Cosmos-Predict-2.5) and closed-source (e.g., Veo 3.1) VGMs, and different VLM variants (Gemini 2.5. Pro, Qwen3.5-27B, Qwen3.5-9B).
Test-Time Scaling and Baseline Comparisons
CollabVR outperforms established test-time scaling methods:
- Single-inference is limited by VGM's inability to handle multi-step tasks or propagate corrections.
- Pass@k resampling (k=2,4) yields diminishing returns; simply sampling more outputs does not suffice for task correctness, as correct trajectories often reside outside the generator's distribution.
- VideoTPO (iterative full-video prompt refinement) fails to repair task-level decompositional failures.
CollabVR consistently achieves higher task accuracy at lower or comparable compute cost. On VBVR-Wan2.2 (open-source), Gen-ViRe scores increased from 0.391 (Pass@1) to 0.531 (+0.140), and on Veo 3.1 (closed-source) from 0.481 to 0.550 (+0.069). Gains are most pronounced on categories with long-horizon planning (planning, spatial, algorithmic) and persist even when the VGM is fine-tuned for reasoning.
Module Ablations
Systematic ablations reveal complementary module effects:
- Progressive planning (M1) is more effective for multi-step tasks (Gen-ViRe), driving accurate decomposition.
- Verification and recovery (M2) dominates on single-step or atomic transformations (VBVR-Bench), efficiently correcting localized failures.
Combining modules yields positive gains across all task categories, with adaptive activation reflective of task complexity. The framework adapts the mode of intervention (decomposition vs. recovery) to the demands of each reasoning instance.
VLM Supervision Reliability
Human-annotated benchmarks validate VLM as a reliable step-level supervisor:
- Plan depth, clip-level verification, and evolution judgments from Gemini 2.5. Pro align closely with expert human annotators.
- Lower-capability VLMs degrade gracefully; even small open-source models paired with CollabVR outperform more powerful baselines operating without step-level supervision.
Limitations remain where the VLM fails to detect intricate issues (symbolic, knowledge-based transformations), or the VGM lacks per-step instruction-following fidelity, compounding simulation errors.
Practical and Theoretical Implications
CollabVR demonstrates that explicit test-time orchestration—progressive planning, verification, and prompt evolution—can substantially improve goal-directed video reasoning without additional model training. This approach:
- Allows flexible composition with any off-the-shelf VGM/VLM, agnostic to proprietary or open-source status.
- Efficiently utilizes inference compute, targeting failed suffixes and sub-actions rather than resampling entire trajectories.
- Yields outputs that are interpretable and auditable, with discrete reasoning steps and failures exposed.
The residual shortcomings elucidate directions for future research:
- Training VGMs with stronger process-aware or reasoning-grounded objectives, especially for symbolic and knowledge-driven categories.
- Enhancing VLM grounding and reliability, including more granular failure localization and repair strategies.
- Extending step-level orchestration to multimodal or agentic tasks, integrating specialized perception and reasoning modules.
The paradigm shift advocated is from monolithic, ever-larger generators to collaborative, inference-time composition across specialized models, with CollabVR serving as an instantiation of this design pattern.
Conclusion
CollabVR advances the state of collaborative video reasoning by coupling VLMs and VGMs in a step-level closed-loop. The framework progressively plans, verifies, and repairs each action, consistently improving visual reasoning fidelity on challenging benchmarks. Validation via human-annotated studies confirms the reliability of VLM supervision, and ablations clarify the interplay between decomposition and recovery. While orchestration cannot overcome baseline generative limitations, CollabVR sets a template for future inference-time integration of reasoning and simulation modules, pointing towards more adaptive, modular, and interpretable AI systems for complex temporally-grounded tasks.