ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

Published 6 Apr 2026 in cs.CV, cs.LG, and cs.RO | (2604.04667v1)

Abstract: Real-time depth reconstruction from ultra-high-resolution UAV imagery is essential for time-critical geospatial tasks such as disaster response, yet remains challenging due to wide-baseline parallax, large image sizes, low-texture or specular surfaces, occlusions, and strict computational constraints. Recent zero-shot diffusion models offer fast per-image dense predictions without task-specific retraining, and require fewer labelled datasets than transformer-based predictors while avoiding the rigid capture geometry requirement of classical multi-view stereo. However, their probabilistic inference prevents reliable metric accuracy and temporal consistency across sequential frames and overlapping tiles. We present ZeD-MAP, a cluster-level framework that converts a test-time diffusion depth model into a metrically consistent, SLAM-like mapping pipeline by integrating incremental cluster-based bundle adjustment (BA). Streamed UAV frames are grouped into overlapping clusters; periodic BA produces metrically consistent poses and sparse 3D tie-points, which are reprojected into selected frames and used as metric guidance for diffusion-based depth estimation. Validation on ground-marker flights captured at approximately 50 m altitude (GSD is approximately 0.85 cm/px, corresponding to 2,650 square meters ground coverage per frame) with the DLR Modular Aerial Camera System (MACS) shows that our method achieves sub-meter accuracy, with approximately 0.87 m error in the horizontal (XY) plane and 0.12 m in the vertical (Z) direction, while maintaining per-image runtimes between 1.47 and 4.91 seconds. Results are subject to minor noise from manual point-cloud annotation. These findings show that BA-based metric guidance provides consistency comparable to classical photogrammetric methods while significantly accelerating processing, enabling real-time 3D map generation.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a hybrid pipeline combining test-time diffusion with cluster-based bundle adjustment for real-time, dense depth mapping.
It achieves enhanced temporal and geometric consistency, reducing metric errors and drift compared to existing state-of-the-art methods.
Quantitative evaluations on UAV imagery demonstrate near-real-time processing with precision approaching classical non-real-time reconstruction.

ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

Introduction and Motivation

The ZeD-MAP framework addresses the challenge of acquiring metrically consistent dense depth maps from ultra-high-resolution UAV imagery at near-real-time rates. In time-sensitive applications such as disaster response and search-and-rescue, the latency between image acquisition and delivery of georeferenced 3D mapping products must be minimized. However, large-format UAV imagery, diverse scene geometries, and operational constraints render traditional dense stereo and learning-based methods computationally prohibitive or insufficiently consistent.

Existing dense reconstruction pipelines—such as SGM-based photogrammetry and large-scale learned models—either lack the required runtime efficiency or depend on extensive pretraining and hardware not typically available during field deployments. Diffusion-based depth estimation models have shown promise for zero-shot, dense, per-image depth prediction with strong generalization properties, but suffer from temporal inconsistency, lack of metric grounding, and inconsistent predictions across overlapping views. ZeD-MAP proposes a hybrid pipeline that injects lightweight, cluster-based bundle adjustment (BA) into the test-time diffusion workflow to enforce metric consistency and global coherence.

Methodology

Cluster-Based Incremental Bundle Adjustment

ZeD-MAP introduces a sliding-window, overlap-aware clustering protocol, grouping UAV images into clusters based on GNSS-guided or fixed three-frame windows. For each cluster, sparse keypoints are extracted, limited, and matched using a RANSAC-filtered approach optimized for real-time, wide-baseline UAV data. Incremental bundle adjustment is then performed on the cluster—specifically, on the middle frame and its neighbors. This optimization solves for camera poses and sparse 3D structure, with the last group of frames from the previous cluster held fixed as inter-cluster anchor points, thus propagating metric scale and preventing drift.

Metric Guidance for Diffusion-Based Depth

Sparse 3D anchor tie-points, reprojected into the central cluster frame, supply a metric-aligned guide for the diffusion depth model. This design integrates directly into the Murre architecture but adapts it for online, sequential operation. Instead of relying on offline COLMAP-derived guidance, as in prior guided diffusion approaches, ZeD-MAP autonomously generates on-the-fly, mission-specific metric prior information, thus supporting real-time requirements.

Each cluster's representative frame receives a rasterized set of sparse depth anchors derived from BA. During diffusion, the model leverages these anchors and camera parameters to produce a dense depth map that is metrically aligned and consistent with previous and subsequent frames in the sequence. Output maps are incrementally fused into a global reconstruction using TSDF integration, enabling dynamic updating and direct generation of 3D products (true ortho-mosaics, global point clouds, and depth maps).

Efficient N-Frame Scheduling

The pipeline offers adaptive cluster sizing: when GNSS is available, cluster size is dynamically chosen based on coverage overlap; otherwise, a conservative fixed triple is used to guarantee sufficient spatial anchoring. This approach achieves both coverage needed for dense reconstruction and minimizes BA complexity, maintaining real-time throughput even with 50+ MP frames.

Experimental Validation and Results

Cross-Frame Consistency

Evaluation on sequential UAV imagery demonstrates that BA guidance significantly enhances temporal and geometric stability relative to unguided diffusion and state-of-the-art monocular predictors (e.g., VGGT, MapAnything, DepthAnything v2). The BA-guided approach consistently suppresses stochastic variation and alignment drift, preserving scene continuity and facilitating robust mapping over large parallax and low-texture environments.

Quantitative Accuracy

On a ground-marker benchmark (22 frames with high-precision GNSS/ground truth), ZeD-MAP achieves an XY error of 0.867 m and a vertical error of 0.123 m, with relative inter-marker errors under 2.2% (XY) and 1% (Z). Processing times average 1.47–4.91 seconds per image, contingent on cluster size, demonstrating feasibility for operational real-time deployment. These results approach the geometric fidelity of classical, non-real-time methods (COLMAP reports 0.855 m XY/0.057 m Z, ~45 s/image) while vastly outpacing them in throughput. Feed-forward large-scale pretrained models (VGGT, MapAnything) are faster but display significantly higher metric errors and inter-frame inconsistencies.

Large-Scale Disaster Mapping

Tests on a two-strip, 60-image earthquake dataset (DLR MACS, 7920 × 6004 px) affirm ZeD-MAP’s resilience to large-scale block misalignment—an area where feed-forward predictors exhibit marked drift in both planimetric and altimetric domains. Only ZeD-MAP (and offline COLMAP) maintain globally consistent geometry across strips. ZeD-MAP achieves comparable local DSM precision (0.024 m vs. COLMAP’s 0.023 m), higher coverage, and acceptable global noise levels (as measured by NMAD and global std-dev), all within a fraction of the time of classical methods.

Implications and Future Directions

The integration of lightweight cluster-based BA into test-time diffusion fundamentally bridges the gap between the rapid generalization capabilities of zero-shot deep networks and the metric rigor traditionally guaranteed by geometric optimization. Practically, ZeD-MAP enables deployment of dense metric mapping in time-critical airborne missions without dependence on large labeled datasets, pretraining, or external SfM. Theoretical value is provided by demonstrating that the metric and temporal consistency deficits of zero-shot diffusion can be largely mitigated through sparse geometric feedback, realized with minimal computational overhead.

Further improvements are possible. Remaining bottlenecks include the intrinsic cost of diffusion inference and scalability to extreme image resolutions or frame rates. Tightening the integration between diffusion and SLAM/graph-optimization, extending support for robust initialization under rapid motion and high dynamic range scenes, and leveraging learned feature extraction specifically optimized for wide-baseline UAV geometry are all credible directions that could enhance both robustness and efficiency.

Conclusion

ZeD-MAP delivers a scalable, metrically consistent, and real-time capable solution for dense aerial depth mapping, combining the generalization capability of zero-shot diffusion models with the globally consistent metric anchoring provided by efficient, cluster-based BA. Experimental results confirm that it outperforms feed-forward and naive zero-shot baselines in both accuracy and consistency, approaches classic dense MVS in local precision, and meets the runtime constraints necessary for operational UAV deployment. The framework represents a significant advancement in harmonizing learned and geometric methods for real-time geospatial computer vision.