GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

Published 22 May 2026 in cs.CV | (2605.23888v1)

Abstract: We introduce a new approach to high-fidelity 3D scene reconstruction from multi-view RGB images that tightly couples reconstruction with a strong generative 3D prior. We cast scene reconstruction as conditional 3D generation over a set of spatially-localized, overlapping chunks that together tile the scene, scaling generation to large scene extents. Crucially, we inherit the fidelity and completeness of state-of-the-art generative shape models -- we use Trellis.2 as an example -- which we generalize to the scene level. To this end, we propose a projection-based conditioning mechanism that lifts posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene, yielding high-fidelity, multi-view consistent generated geometry. This enables lifting the strong object-level prior of Trellis.2 to multi-view, scene-scale generation, producing faithful, editable PBR mesh reconstructions of indoor environments. As a result, we obtain high-fidelity results that outperform cutting-edge reconstruction methods by 16%.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents a tightly-coupled framework using Trellis.2 generative priors to produce complete, semantically consistent 3D scene reconstructions from multi-view RGB inputs.
It employs a projection-based multi-view conditioning pathway with per-chunk latent aggregation and LoRA adaptation to ensure pose-consistent and realistic mesh recovery.
Empirical results on ScanNet++ and synthetic datasets demonstrate enhanced geometric fidelity, completeness, and physically-based texture recovery for editable PBR meshes.

GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

Problem Domain and Motivation

High-fidelity multi-view 3D scene reconstruction from RGB images remains a pivotal challenge for computer vision and graphics, with strict fidelity requirements driven by immersive AR/VR, robotics, and digital content pipelines. Existing paradigms bifurcate into per-scene optimization and feed-forward prediction, both exhibiting strong limitations: optimization-based methods (e.g., neural implicit surfaces, Gaussian splatting) fail in occluded or weakly textured regions, often generating incomplete or oversmoothed geometry; while direct regression models (e.g., depth foundation models, volumetric fusion) recover observed surfaces robustly but lack explicit generative priors, resulting in unstructured, deterministic outputs that do not interpolate plausible content for unobserved regions.

Methodological Advancements

GenRecon proposes a tightly-coupled solution leveraging a strong generative 3D prior—specifically, Trellis.2—to elevate the scene-level mesh reconstruction. The pipeline frames scene recovery as conditional 3D generation across overlapping spatial chunks, jointly synthesizing a coherent mesh covering the entire spatial extent. This approach inherits object-level generative fidelity and semantic completeness, while addressing two critical requirements for scene-level scaling: multi-image conditioning and explicit pose control.

A spatially-grounded, projection-based multi-view conditioning pathway is introduced, lifting DINOv3 image features from each posed RGB input into per-view 3D volumes aligned with each chunk. Aggregation via an IBRNet-style mechanism ensures permutation invariance across views, constructing a 3D conditioning grid that retains geometric correspondence and global context. Conditioned latent generation occurs through parameter-efficient LoRA adaptation, preserving the pretrained prior's structure.

Figure 1: End-to-end pipeline: posed images and sparse SfM points define scene chunks; conditioned 3D features are aggregated and injected into the generative denoiser for joint mesh synthesis.

Within each chunk, generative modeling proceeds via flow-matching, maintaining spatial consistency, even at chunk boundaries, through boundary-sensitive multi-chunk latent aggregation. The final scene-level latent is decoded into a PBR mesh suitable for physically-based rendering and downstream editing.

Empirical Evaluation

The method is benchmarked against state-of-the-art reconstruction pipelines spanning optimization, feed-forward, and diffusion-based priors on both synthetic (3D-FRONT) and real-world (ScanNet++) indoor datasets. GenRecon demonstrates quantitatively superior results across all evaluated metrics: geometric alignment (Chamfer distance), completeness, angular normal errors, perceptual/semantic similarity (LPIPS, CLIP), and F-score within tolerance (10 cm). Notable findings include:

On ScanNet++, GenRecon achieves a 16% increase in geometric fidelity over strong baseline methods, with RMSE, AbsRel, and completeness scores consistently outperforming per-scene and feed-forward approaches.
On synthetic data, Chamfer distance and normal consistency validate faithful mesh recovery and structural accuracy, while high CLIP and LPIPS scores underscore semantic alignment.
Figure 2: Qualitative comparisons on ScanNet++: GenRecon reconstructions exhibit superior completeness and preserve fine-scale details relative to all baselines.

Ablation studies further demonstrate the effectiveness of the projection-based 3D conditioning, with pose-consistent chunk alignment emerging only when this pathway is utilized. Moreover, chunk generation quality scales with the number of conditioning views, yet even single-view inputs yield spatially correct and semantically plausible geometry.

Figure 3: Ablation experiments on unseen SAGE-10k chunks: 3D conditioning enables pose-correct generation from a single view; performance improves with increased view count.

Large-scale scene reconstructions validate the scalability of the chunk-based pipeline, enabling mesh recovery for extensive indoor environments.

Figure 4: High-fidelity mesh generation for large indoor environments, visualized via top-down and detailed close-ups.

Physically-Based Texture Recovery and Relighting

Distinct from most prior feed-forward and optimization pipelines, GenRecon provides editable PBR meshes with material properties (albedo, metallic, roughness), facilitating realistic scene relighting under varying illumination and seamless integration into graphics authoring engines.

Figure 5: PBR channel predictions (lit, albedo, metallic/roughness) for ScanNet++ reconstructions; recovered materials respond plausibly in rendering environments.

Figure 6: Relighting experiments: recovered scenes can be realistically illuminated under arbitrary configurations.

Limitations and Future Directions

While GenRecon sets a new benchmark for indoor scene mesh recovery, known limitations include reduced performance on non-Lambertian surfaces due to limited representation in training data and potential hallucination of geometry in weakly observed regions attributable to the strong prior. Chunk partitioning is not yet adaptive for unusually large or complex spatial extents.

Anticipated developments include enhancing generative prior capabilities via larger and more diverse training corpora, adaptive chunking strategies, explicit modeling for challenging materials, and extension toward open-world reconstruction. The framework lays foundational groundwork for integrating generative modeling in practical, scalable 3D asset pipelines—potentially catalyzing advances in simulation, embodied AI, and real-time content creation.

Conclusion

GenRecon effectively bridges object-level generative 3D priors with multi-view, scene-scale mesh recovery, introducing a spatially-grounded conditioning mechanism and efficient chunk-based synthesis. The achieved mesh fidelity, completeness, and PBR material recovery mark a substantial advance over existing paradigms, supporting practical relighting and editing. The approach is extensible and signals future progress toward robust, high-fidelity scene recovery solutions spanning graphics, robotics, and AI simulation.

Markdown Report Issue