WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Published 23 Apr 2026 in cs.CV | (2604.21686v1)

Abstract: Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (warena.ai), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces WorldMark, a unified benchmark suite that standardizes evaluation protocols for interactive video world models.
It implements a modular action mapping interface and a comprehensive evaluation toolkit assessing visual quality, control alignment, and world consistency.
Empirical results reveal trade-offs between visual aesthetics and temporal/geometric coherence across state-of-the-art models in diverse scenarios.

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Motivation and Problem Definition

The rapid evolution of interactive world modeling has led to the proliferation of fundamentally heterogeneous approaches, exemplified by models such as Genie 3, YUME 1.5, HY-World 1.5, Matrix-Game 2.0, among others. A critical barrier for the field has been the absence of a unifying evaluation suite—each model introduces its own control scheme, set of evaluation scenes, and benchmarks, which precludes controlled, fair, and directly comparable evaluation. Existing public benchmarks (e.g., VBench, VBench++, WorldScore, MIND, PhyGenBench) lack unified interactive action protocols, making even shared metrics ultimately incomparable. The fragmentation is pronounced when considering diversity in control interfaces: some models use WASD-inspired key mappings, others accept pose parameters, natural-language action prompts, or high-dimensional latent actions. Direct, apples-to-apples protocol for evaluating action and scene consistency, perceptual fidelity, and temporal/geometric coherence has not been possible.

WorldMark Benchmark Architecture

WorldMark introduces a comprehensive solution to benchmark standardization for interactive image-to-video (I2V) world models via three primary design pillars:

Standardized Test Suite: 500 controlled evaluation cases are curated, each combining a reference image and a contextually validated action sequence. The dataset spans first- and third-person views, both photorealistic and stylized domains, with difficulty tiers dissecting basic, intermediate, and complex navigation/action patterns.
Unified Action Mapping Layer: A central interface defines a WASD+L/R action vocabulary, which is then adapted to each model’s native control format, guaranteeing every system receives semantically equivalent movement and camera control instructions independent of their internal protocol.
Modular Evaluation Toolkit: Outputs are assessed over multiple axes—Visual Quality, Control Alignment, and World Consistency—covering both geometric/pose-based errors and VLM-driven (automatic vision-LLM) judgments. This supports both default and extensible metric integration, future-proofing the suite as the field evolves.
Figure 1: The WorldMark framework resolves fragmentation in world model evaluation by standardizing datasets, action mappings, and evaluation protocols.

Dataset: Image and Action Suite Composition

WorldMark’s dataset construction begins with 50 visually diverse images sampled from WorldScore’s curated dataset, encompassing real, urban, indoor, natural, and stylized scenes (e.g., oil painting, Minecraft). Each sample is paired with both the original (first-person) and VLM-generated third-person perspectives, generating 100 test images.

Figure 2: Image Suite diversity covers both first-person and third-person perspectives, as well as stylized and photorealistic scenes.

Fifteen standardized action sequences (elementary translations, rotations, cyclic/compound movements) form the Action Suite, covering all common action primitives, combined and composed into three increasing complexity tiers (easy, medium, hard). Each reference image is programmatically matched with a VLM-filtered subset of feasible actions, guaranteeing physical plausibility (e.g., disallowing forward motion into a visible wall).

Figure 3: The 15 standardized action trajectories, including simple and compound motion patterns.

Figure 4: Context-aware action selection uses a VLM to filter physically implausible motion trajectories for each scene.

Unified Action Interface and Cross-Model Mapping

The centralized action vocabulary and mapping adapters achieve canonicalization across drastically different model APIs: text-driven, pose/6-DoF parameterized, gamepad emulation, one-hot action functions, and high-dimensional continuous vectors. Calibration ensures motion semantics are preserved, despite per-model differences in action scaling and response. Adding support for new architectures mandates implementation of only a single adapter module, which eliminates the engineering overhead of creating new bespoke benchmarks per model.

Evaluation Metrics: Multi-Dimensional and Human-Validated

The WorldMark evaluation suite adopts a multi-pronged approach:

Visual Quality: Frame-level perceptual goodness via LAION's aesthetic predictor, and low-level imaging distortion via MUSIQ.
Control Alignment: Ground-truth vs. predicted translation (3D Euclidean) and rotation (geodesic angular deviation) error analysis, leveraging DROID-SLAM for accurate pose extraction.
World Consistency: 3D geometric reprojection error, and three VLM-driven consistency signals: temporal subject stability, content (hallucination and disappearance avoidance), and style (global feature/stylistic drift).
Figure 5: Qualitative contrasts in Visual Quality, Control Alignment, and World Consistency, with explicit failure and success cases illustrated.

Metric alignment with human perception is empirically validated: Spearman correlations above 0.9 are reported between VLM-based quantitative scores and manual user rankings, demonstrating that the automated system is robustly predictive of human evaluative signal.

Figure 6: High Spearman correlation values confirm that automated metrics closely match human preferences for world consistency evaluation.

Empirical Results and Model Landscape Analysis

Benchmarking six state-of-the-art models (YUME 1.5, Matrix-Game 2.0, HY-World 1.5, HY-GameCraft, Open-Oasis, Genie 3) reveals multiple critical findings:

Visual Quality vs. Consistency Decoupling: The model producing the highest aesthetic and imaging scores (YUME 1.5) often ranks poorly in long-range temporal/3D consistency, while Genie 3, a proprietary model, dominates spatial consistency with only moderate per-frame visual quality.
Control Alignment Limitation: Strong compliance with action trajectories (e.g., lowest translation/rotation error in HY-Game) does not guarantee superior world simulation or visual quality. Conversely, Genie 3 exhibits moderate pose errors but achieves the highest global scene consistency.
Third-Person Challenge: Transitioning from first- to third-person scenarios severely degrades camera control (an order-of-magnitude increase in rotation error in Matrix-Game 2.0), indicating current models are not architecturally robust to viewpoint shift.
Domain Transfer Failure: Models specialized for a narrow visual style (Open-Oasis trained on Minecraft) catastrophically fail on out-of-domain data, highlighting lack of generalization.
Stylization Effects: Stylized scenes marginally increase aesthetic appeal in some cases but exacerbate control errors—posing open questions regarding the sensitivity of current architectures to visual distribution shift.

Implications and Future Directions

WorldMark’s systematic exposure of divergence between visual/language-driven generation and robust, coherent world modeling signals that further architectural and objective innovations are necessary for progress towards truly generalizable interactive world models. The modular, extensible evaluation design enables facile ablation across new control paradigms, action complexities, and perceptual evaluation technologies (including next-generation VLMs). Several actionable directions are immediately evident:

Improved architectures for third-person and multi-view consistency.
Objective design separating control alignment from world modeling capability.
Domain-generalization protocols and curriculum for stylized/out-of-distribution settings.
More granular, human-grounded, and open-ended evaluation via the WorldModel Arena online platform.

Conclusion

WorldMark delivers, for the first time, a standardized, extensible, and practically useful benchmarking protocol, dataset, and toolchain for interactive I2V world models. By decoupling evaluation infrastructure from model implementation idiosyncrasies, it enables direct, meaningful empirical progress on the core world modeling problem. All curated data, standardized adapters, and metric code are available, reducing the marginal cost for future model integration to only a mapping adapter. The analysis highlights critical open challenges: achieving both photorealism and deep geometric-temporal coherence, robust camera and action control in third-person and stylized scenarios, and the necessity of broader evaluative perspectives than frame-level quality metrics alone.

Markdown Report Issue