SAM3D-Phys: Towards Multi-Object Interactive Simulation in Real World

Published 28 May 2026 in cs.CV | (2605.30239v1)

Abstract: This work addresses the problem of recovering complete, simulatable object geometry from reconstructed real-world scenes, enabling physics-based interaction with objects embedded in the scene. While modern multi-view reconstruction methods can produce visually accurate environments, objects are often incomplete due to occlusions and limited observations, making them unsuitable for physics simulation. To address this limitation, we propose SAM3D-Phys, a framework that integrates scene reconstruction with generative 3D priors of SAM3D to recover physically simulatable objects. Our approach first reconstructs the scene from multi-view images to obtain scene geometry and partial observations of objects. We then leverage SAM3D to infer complete object geometry from these partial observations. To ensure that the recovered objects remain consistent with the reconstructed scene, we restore scene-consistent object states through two complementary strategies: a physics-constrained spatial optimization algorithm that iteratively aligns the recovered object to its original location, and a mask-guided appearance distillation module that refines texture fidelity based on the observed images. By recovering complete object geometry and restoring its pose and appearance within the scene, SAM3D-Phys produces clean object representations suitable for physics-based simulation, enabling simultaneous and physically consistent interactive simulation of multiple objects within a reconstructed scene. Project page: https://chnxindong.github.io/sam3d-phys/

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a novel framework that integrates generative 3D priors with scene-centric multi-view reconstruction to achieve physically realistic multi-object simulation.
It employs a four-stage pipeline—combining PGSR, SAM3D extraction, spatial/appearance alignment, and a modified MPM—to resolve incomplete object geometries and ensure stable dynamics.
Quantitative and qualitative evaluations show improved spatial fidelity, appearance consistency, and real-time performance on consumer-grade hardware.

SAM3D-Phys: Multi-Object Interactive Simulation in Reconstructed Real-World Scenes

Overview and Motivation

SAM3D-Phys introduces a computational framework for reconstructing real-world scenes to enable physically plausible multi-object interaction, targeting applications spanning VR, robotics, video generation, and embodied scene understanding (2605.30239). The paper observes a fundamental challenge: object geometries reconstructed via multi-view image methods (e.g., neural rendering, Gaussian splatting) are frequently incomplete due to occlusions or sparse viewpoints. Such artifacts undermine downstream physics simulation, especially in cluttered, multi-object environments where spatial and geometric consistency is vital for stable dynamics. Existing works either focus on single-object cases or restrict simulation to simplified planar scenes, exhibiting limitations in both completeness and physical realism.

Framework Architecture

SAM3D-Phys leverages generative 3D priors (SAM3D) integrated with scene-centric multi-view reconstructions to recover simulatable objects with full geometry and scene consistency. The pipeline executes four stages:

Scene Reconstruction with PGSR: Multi-view images are processed with Planar-based Gaussian Splatting (PGSR), enhanced by segmentation affinity features, followed by object removal and inpainting using LaMa for background completion.
Object Extraction via SAM3D: Target objects are segmented and converted from partial observations to complete 3D geometry utilizing generative priors acquired from SAM3D, resolving occlusion-induced incompleteness.
Object-Scene Alignment: Appearance and metric cues from the reconstructed scene guide spatial optimization and mask-based appearance distillation, restoring objects’ pose and appearance within the original environment.
Interactive Physical Simulation (MPM): Aligned objects are reinserted and simulated using a modified Material Point Method (MPM), supporting simultaneous, physically consistent dynamics for multiple objects.
Figure 1: SAM3D-Phys pipeline, from PGSR-based reconstruction to SAM3D object extraction, spatial/appearance alignment, and multi-object MPM simulation.

Object-Scene Spatial and Appearance Consistency

Spatial Alignment

The framework deploys a render-and-compare refinement optimized by SSIM/MS-SSIM objectives, driving per-object pose (translation, rotation) to minimize discrepancies between rendered masks and ground-truth image regions. Depth maps are converted into pointmaps via pinhole back-projection, providing accurate 3D initialization for generative modules.

Figure 2: Use of pointmaps enables improved object initialization, enhancing spatial alignment and simulation stability.

Spatial refinement is performed iteratively, with intermediate optimization steps yielding progressive alignment of objects to observed mask regions and correct spatial placement within the scene.

Figure 3: Pose refinement example—rendered object incrementally aligns to image mask during spatial optimization.

Physics-Constrained Alignment

SAM3D-Phys introduces a control graph derived via a multimodal model (Gemini) to encode object-object and scene-object interactions and constraints. Two terms are enforced: one for stable object-ground contact, and one for object-object non-penetration, addressing interpenetration errors and ensuring physically reasonable final placements.

Figure 4: Control graph guides spatial refinement and physically consistent object placement, mitigating interpenetration.

Appearance Distillation

Mask-guided appearance distillation aligns textures via patch-level VGG feature loss within overlapping mask regions. This module eliminates cartoon-like artifacts and bridges appearance gaps between decoupled/generated objects and original observations.

Figure 5: Appearance alignment improves texture fidelity and overall visual consistency with scene context.

Interactive Simulation and Scene Editing

Reintegrated and aligned objects are modeled and simulated using MPM, which efficiently aggregates per-object attributes (position, velocity, deformation) and enforces mass and momentum conservation across the Eulerian grid. Flexible interaction is facilitated via user-defined impulse forces and force fields, enabling scene-centric manipulation.

Figure 6: Visualization of multi-object physical simulation and object removal-editing in the reconstructed environment.

The pipeline achieves real-time efficiency on consumer-grade hardware (RTX 4090).

Experimental Evaluation

Dataset and Metrics

A real-world multi-object benchmark (six scenes, including complex mutual occlusions) was constructed to evaluate the method. Metrics are edge error (Sobel, spatial accuracy) and PSNR/SSIM (appearance fidelity within masks), and baselines include Feature Splatting and DecoupledGaussian.

Figure 7: Dataset scenes—multi-object scenarios with varying occlusion and coupling for robust evaluation.

Comparative Results

SAM3D-Phys demonstrates superior spatial decoupling, geometric completeness, and appearance fidelity relative to baselines, which suffer from fragmentation, tearing, and poor object separation—particularly under multi-object and occlusion-heavy conditions.

Figure 8: Comparative visualization—SAM3D-Phys avoids fragmentation and achieves accurate object separation/alignment.

Zoomed-in comparisons further highlight physically reasonable motion and high spatial/appearance fidelity.

Figure 9: Detail comparisons show baseline failure in multi-object scenarios; generative priors and metric cues resolve object separation gaps and guarantee visual/physical fidelity.

Quantitative ablation shows spatial alignment and appearance distillation modules yield measurable improvements: edge error drops by 1.0%, PSNR increases from 23.9 to 30.0, SSIM rises from 0.869 to 0.954.

User Study and LMM Evaluation

User and LMM-as-Judge evaluations confirm that SAM3D-Phys attains higher scores for both motion realism and visual quality, outperforming baselines by significant margins in both objective and subjective assessment.

Limitations and Future Directions

Appearance inconsistencies due to illumination (specular, shadow artifacts) and insufficient contextual cues remain problematic in scene completion and object inpainting, as shown in failure cases.

Figure 10: Limitations—specular highlights, shadow variations, and low-context degrade appearance reconstruction and alignment.

Further improvements may rely on robust illumination modeling, context-aware generative methods, and synergetic integration with multimodal cues.

Theoretical and Practical Implications

SAM3D-Phys establishes a holistic, training-free pipeline for generating physically simulatable objects from incomplete, occlusion-heavy real-world scene observations. Methodologically, it bridges generative 3D priors and scene-aware reconstruction, enforcing spatial/appearance consistency and physical plausibility. Practically, it advances capabilities in VR, robotics, and embodied AI by supporting dynamic scene editing and interactive simulation in realistic environments, suggesting directions for scalable, context-aware world models and simulation-ready, object-centric scene pipelines.

Conclusion

SAM3D-Phys provides an efficient, modular framework for physical simulation in real-world multi-object scenes, coupling generative 3D priors and spatial/appearance alignment with scene-centric reconstruction and MPM-based simulation. The method yields measurable gains in geometric completeness, spatial fidelity, and stable dynamics, validated across qualitative and quantitative benchmarks. The approach is practically deployable and extensible; addressing illumination/context limitations and scaling to broader multimodal scene modeling remain key avenues for future research.

Markdown Report Issue