Closing the synthetic-to-real domain gap for unpaired photographs

Develop methods to close the domain gap between the synthetic material renderings used for self-supervised pretraining and unpaired real photographs, enabling the physically-grounded visual representations learned from synthetic data to transfer reliably to real-world images.

Background

The paper introduces a self-supervised ViT-based backbone trained with material-aware physical augmentations, using multiple renderings of the same material under varied geometry and illumination to learn physically-grounded features. This pretraining is conducted entirely on synthetic data generated from procedural materials rendered with a path tracer.

While the model demonstrates strong invariance to geometry and lighting and improved material-aware performance, the authors highlight that transferring these representations to real photographs without paired supervision is hindered by a synthetic-to-real domain gap. Addressing this gap is essential for broad application of the learned features to real-world imagery and tasks.

References

Moreover, our pretraining relies entirely on synthetic data; closing the domain gap to unpaired real photographs remains an open challenge.

Φeat: Physically-Grounded Feature Representation  (2511.11270 - Vecchio et al., 14 Nov 2025) in Section 6 (Limitations)