- The paper introduces a hybrid pipeline combining test-time diffusion with cluster-based bundle adjustment for real-time, dense depth mapping.
- It achieves enhanced temporal and geometric consistency, reducing metric errors and drift compared to existing state-of-the-art methods.
- Quantitative evaluations on UAV imagery demonstrate near-real-time processing with precision approaching classical non-real-time reconstruction.
ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging
Introduction and Motivation
The ZeD-MAP framework addresses the challenge of acquiring metrically consistent dense depth maps from ultra-high-resolution UAV imagery at near-real-time rates. In time-sensitive applications such as disaster response and search-and-rescue, the latency between image acquisition and delivery of georeferenced 3D mapping products must be minimized. However, large-format UAV imagery, diverse scene geometries, and operational constraints render traditional dense stereo and learning-based methods computationally prohibitive or insufficiently consistent.
Existing dense reconstruction pipelines—such as SGM-based photogrammetry and large-scale learned models—either lack the required runtime efficiency or depend on extensive pretraining and hardware not typically available during field deployments. Diffusion-based depth estimation models have shown promise for zero-shot, dense, per-image depth prediction with strong generalization properties, but suffer from temporal inconsistency, lack of metric grounding, and inconsistent predictions across overlapping views. ZeD-MAP proposes a hybrid pipeline that injects lightweight, cluster-based bundle adjustment (BA) into the test-time diffusion workflow to enforce metric consistency and global coherence.
Methodology
Cluster-Based Incremental Bundle Adjustment
ZeD-MAP introduces a sliding-window, overlap-aware clustering protocol, grouping UAV images into clusters based on GNSS-guided or fixed three-frame windows. For each cluster, sparse keypoints are extracted, limited, and matched using a RANSAC-filtered approach optimized for real-time, wide-baseline UAV data. Incremental bundle adjustment is then performed on the cluster—specifically, on the middle frame and its neighbors. This optimization solves for camera poses and sparse 3D structure, with the last group of frames from the previous cluster held fixed as inter-cluster anchor points, thus propagating metric scale and preventing drift.
Metric Guidance for Diffusion-Based Depth
Sparse 3D anchor tie-points, reprojected into the central cluster frame, supply a metric-aligned guide for the diffusion depth model. This design integrates directly into the Murre architecture but adapts it for online, sequential operation. Instead of relying on offline COLMAP-derived guidance, as in prior guided diffusion approaches, ZeD-MAP autonomously generates on-the-fly, mission-specific metric prior information, thus supporting real-time requirements.
Each cluster's representative frame receives a rasterized set of sparse depth anchors derived from BA. During diffusion, the model leverages these anchors and camera parameters to produce a dense depth map that is metrically aligned and consistent with previous and subsequent frames in the sequence. Output maps are incrementally fused into a global reconstruction using TSDF integration, enabling dynamic updating and direct generation of 3D products (true ortho-mosaics, global point clouds, and depth maps).
Efficient N-Frame Scheduling
The pipeline offers adaptive cluster sizing: when GNSS is available, cluster size is dynamically chosen based on coverage overlap; otherwise, a conservative fixed triple is used to guarantee sufficient spatial anchoring. This approach achieves both coverage needed for dense reconstruction and minimizes BA complexity, maintaining real-time throughput even with 50+ MP frames.
Experimental Validation and Results
Cross-Frame Consistency
Evaluation on sequential UAV imagery demonstrates that BA guidance significantly enhances temporal and geometric stability relative to unguided diffusion and state-of-the-art monocular predictors (e.g., VGGT, MapAnything, DepthAnything v2). The BA-guided approach consistently suppresses stochastic variation and alignment drift, preserving scene continuity and facilitating robust mapping over large parallax and low-texture environments.
Quantitative Accuracy
On a ground-marker benchmark (22 frames with high-precision GNSS/ground truth), ZeD-MAP achieves an XY error of 0.867 m and a vertical error of 0.123 m, with relative inter-marker errors under 2.2% (XY) and 1% (Z). Processing times average 1.47–4.91 seconds per image, contingent on cluster size, demonstrating feasibility for operational real-time deployment. These results approach the geometric fidelity of classical, non-real-time methods (COLMAP reports 0.855 m XY/0.057 m Z, ~45 s/image) while vastly outpacing them in throughput. Feed-forward large-scale pretrained models (VGGT, MapAnything) are faster but display significantly higher metric errors and inter-frame inconsistencies.
Large-Scale Disaster Mapping
Tests on a two-strip, 60-image earthquake dataset (DLR MACS, 7920 × 6004 px) affirm ZeD-MAP’s resilience to large-scale block misalignment—an area where feed-forward predictors exhibit marked drift in both planimetric and altimetric domains. Only ZeD-MAP (and offline COLMAP) maintain globally consistent geometry across strips. ZeD-MAP achieves comparable local DSM precision (0.024 m vs. COLMAP’s 0.023 m), higher coverage, and acceptable global noise levels (as measured by NMAD and global std-dev), all within a fraction of the time of classical methods.
Implications and Future Directions
The integration of lightweight cluster-based BA into test-time diffusion fundamentally bridges the gap between the rapid generalization capabilities of zero-shot deep networks and the metric rigor traditionally guaranteed by geometric optimization. Practically, ZeD-MAP enables deployment of dense metric mapping in time-critical airborne missions without dependence on large labeled datasets, pretraining, or external SfM. Theoretical value is provided by demonstrating that the metric and temporal consistency deficits of zero-shot diffusion can be largely mitigated through sparse geometric feedback, realized with minimal computational overhead.
Further improvements are possible. Remaining bottlenecks include the intrinsic cost of diffusion inference and scalability to extreme image resolutions or frame rates. Tightening the integration between diffusion and SLAM/graph-optimization, extending support for robust initialization under rapid motion and high dynamic range scenes, and leveraging learned feature extraction specifically optimized for wide-baseline UAV geometry are all credible directions that could enhance both robustness and efficiency.
Conclusion
ZeD-MAP delivers a scalable, metrically consistent, and real-time capable solution for dense aerial depth mapping, combining the generalization capability of zero-shot diffusion models with the globally consistent metric anchoring provided by efficient, cluster-based BA. Experimental results confirm that it outperforms feed-forward and naive zero-shot baselines in both accuracy and consistency, approaches classic dense MVS in local precision, and meets the runtime constraints necessary for operational UAV deployment. The framework represents a significant advancement in harmonizing learned and geometric methods for real-time geospatial computer vision.