- The paper introduces TripVVT-10K, the first large-scale triplet dataset for in-the-wild video virtual try-on.
- It employs a diffusion transformer with a coarse human mask to enhance background fidelity and maintain temporal coherence.
- TripVVT achieves superior performance, with a 91% Gemini-SR and strong perceptual quality across diverse, challenging benchmarks.
TripVVT: A Large-Scale Dataset and Coarse-Mask Baseline for Video Virtual Try-On in the Wild
Overview and Motivation
"TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On" (2604.27958) targets a core bottleneck in video virtual try-on (VVT): the absence of a diverse, large-scale, triplet-based in-the-wild dataset and the limitations of mask priors used for garment region localization. The proliferation of diffusion-based video generation methods demands high-quality, temporally aligned supervision not sufficiently covered by prior image and video datasets that either rely on pairwise supervision, synthetic environments, or lack garment diversity and scene complexity. This deficiency negatively impacts generalization to unconstrained, real-world scenarios, limiting practical deployment in e-commerce, content creation, and virtual reality.
The paper’s principal contributions are: (1) TripVVT-10K, the first large-scale, in-the-wild triplet video dataset for virtual try-on, (2) TripVVT, a Diffusion Transformer-based (DiT) baseline leveraging a robust, coarse human-mask rather than fragile garment masks to balance background preservation and spatial robustness, and (3) TripVVT-Bench, a 100-case, multi-faceted evaluation benchmark for rigorous, standardized assessment in challenging real-world settings.
Dataset Construction and Distinctiveness
TripVVT-10K comprises 10,031 high-resolution video triplets (original video, garment reference, try-on video), spanning 30 garment categories (upper, lower, and full-body), collected and preprocessed from diverse, complex, in-the-wild scenarios. The pipeline employs web-scraping (~20K raw human video clips), well-defined normalization and quality filtering, and leverages an overview protocol that uses state-of-the-art generative models such as Nano Banana and Wan-Animate for garment swapping and video inpainting.
A critical design is the reverse-training paradigm: the synthesized garment-swapped version becomes the model input ("original video"), while the authentic, unmodified sequence is the supervision target ("try-on video"). Garment reference images are generated by isolating and reconstructing clothing from the person’s initial frame using robust foreground extraction. The dataset is further augmented with reformatting pipelines for existing image/video try-on datasets to provide additional training diversity, though only native TripVVT-10K data appears in the main benchmark/test protocols.
The auxiliary modalities provided per triplet—temporally consistent human-mask video, dense pose sequences, extracted garment line maps, and text prompts—are tightly entwined with the model’s conditioning strategy and promote strong multi-modal control without the fine-grained mask errors that plague prior pipelines.
The TripVVT model is built on a DiT backbone, repurposed from MagicTryOn to process multi-modal streams: video features (from a VAE encoder), resize-aligned human-mask, reference garment encoding, pose sequences, and textual scene prompts.
The methodological pivot is the substitution of fragile, pixel-accurate garment masks with a robust—yet spatially less restrictive—human mask as a prior. This enables the architecture to restrict edits to the person region, strongly preserving background fidelity in unconstrained scenarios, while avoiding garment mask errors in cases of occlusion, pose changes, and difficult lighting. Auxiliary pose guidance further ensures temporally consistent, physically plausible garment animation.
Training leverages the fully supervised triplets, with the coarse mask acting as a soft constraint. The model operates in an end-to-end manner, supporting robust garment transfer and human motion tracking in unconstrained video. Ablation experiments confirm that removing the human mask, pose guidance, or training without TripVVT-10K data sharply degrades performance, underscoring the necessity of each component.
Benchmark and Evaluation Protocol
TripVVT-Bench, the new evaluation suite, comprises 100 hold-out triplets representing diverse, high-complexity scenes, including crowded, dynamic, low-light, and multi-person scenarios. The benchmark protocol covers four essential metrics:
- Video Quality: Evaluated via VFID (with I3D and ResNeXt backbones), SSIM, and LPIPS.
- Clothing Fidelity: Assessed using CLIP-I and Gemini-2.5-Flash-based semantic similarity.
- Background Consistency: Pixel-level (BG-L1-Err) and perceptual (BG-DINO-Err) background divergence outside garment regions.
- Temporal Consistency: CLIP-F similarity between adjacent frames.
The use of authentic ground-truth supervision enables reference-based, objective evaluation not possible with weaker, pseudo-paired datasets.
Experimental Results and Ablation
Quantitative Results:
On both the established ViViD-S test set and the more challenging TripVVT-Bench, TripVVT achieves the strongest overall performance in video quality (lowest VFID), try-on garment fidelity (highest CLIP-I, Gemini-SR), and temporal coherence, outperforming both state-of-the-art academic competitors (ViViD, CatV2TON, MagicTryOn) and the commercial video editing tool Kling 1.6. Notably, the model achieves a Gemini-SR (semantic try-on success rate) of 91% on TripVVT-Bench, a substantial improvement over Kling (78%) and MagicTryOn (43%). While background metrics (SSIM, BG-L1-Err) are marginally lower than some mask-based methods—reflecting the trade-off of using a coarse human mask—the overall perceptual quality and robustness are superior in unconstrained scenarios.
Qualitative and User Study Results:
Visual comparisons demonstrate that, unlike prior methods, TripVVT reliably preserves garment details, human structure, and background integrity across a range of complex scenes. In user studies, TripVVT achieves the highest first-place rankings, with 67.6% preference on TripVVT-Bench, demonstrating strong alignment between objective and perceptual quality.
Ablations:
Without TripVVT-10K, the model's generalization collapses in outdoor/unconstrained settings. Excluding the human mask results in spatial misalignments and artifacts, and switching back to garment masks reintroduces fragility to pose and occlusion, particularly for high-motion or occluded gestures. Elimination of pose guidance compromises motion fidelity and realism.
Limitations and Future Directions
While the coarse human mask enhances robustness, it introduces ambiguity in distinguishing which regions to edit, occasionally affecting non-target apparel or accessories. Strict region-specific editing and disentanglement remain challenging. The paper suggests that advanced structural guidance—explicit garment segmentation or attention supervision—may be needed to achieve precise, selective garment transfer.
Additionally, the reliance on generative models for data synthesis (particularly for rare garment styles or complex scenes) may introduce biases or rare-mode artifacts, warranting further investigation and dataset expansion.
Implications and Future Developments
The introduction of TripVVT-10K and TripVVT-Bench sets a new standard for large-scale, real-world VVT research. The paradigm shift to coarse-mask spatial priors demonstrates that rigid, fine-grained mask generation is not a prerequisite for temporally and perceptually robust try-on synthesis in the wild. This opens avenues for even less supervised, prompt-driven, or region-free localization approaches that may exploit broader cues such as human parsing, text, or cross-modal feature attention.
Practically, the datasets and methods provided will likely catalyze advances in end-user applications—e-commerce, digital avatar creation, AR/VR content synthesis—by prioritizing adaptability to real-world video. The explicit, multi-modal benchmark protocol facilitates reproducible, rigorous comparison, paving the way for future hybrid architectures that integrate granular segmentation at proposal or revision stages.
Future research should explore the intersection of self-supervised or prompt-driven garment localization, more expressive temporal modeling (e.g., transformer-based spatio-temporal self-attention beyond DiT), and continual dataset enrichment for emerging clothing styles and settings.
Conclusion
The paper establishes a unified foundation for robust, scalable, and practical video virtual try-on, introducing a large-scale triplet dataset and a strong, spatially stable baseline leveraging coarse human-mask priors. Both the dataset and the benchmark provide necessary infrastructure for advancing this field toward general-purpose, in-the-wild virtual try-on with high fidelity, temporal coherence, and strong perceptual quality (2604.27958).