Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration?

Published 11 Apr 2026 in cs.CV | (2604.10217v1)

Abstract: Cross-modal optical-SAR (Synthetic Aperture Radar) registration is a bottleneck for disaster-response via remote sensing, yet modern image matchers are developed and benchmarked almost exclusively on natural-image domains. We evaluate twenty-four pretrained matcher families--in a zero-shot setting with no fine-tuning or domain adaptation on satellite or SAR data--on SpaceNet9 and two additional cross-modal benchmarks under a deterministic protocol with tiled large-image inference, robust geometric filtering, and tie-point-grounded metrics. Our results reveal asymmetric transfer--matchers with explicit cross-modal training do not uniformly outperform those without it. While XoFTR (trained for visible-thermal matching) and RoMa achieve the lowest reported mean error at $3.0$ px on the labeled SpaceNet9 training scenes, RoMa achieves this without any cross-modal training, and MatchAnything-ELoFTR ($3.4$ px)--trained on synthetic cross-modal pairs--matches closely, suggesting (as a working hypothesis) that foundation-model features (DINOv2) may contribute to modality invariance that partially substitutes for explicit cross-modal supervision. 3D-reconstruction matchers (MASt3R, DUSt3R), which are not designed for traditional 2D image matching, are highly protocol-sensitive and remain fragile under default settings. Deployment protocol choices (geometry model, tile size, inlier gating) shift accuracy by up to $33\times$ for a single matcher, sometimes exceeding the effect of swapping matchers entirely within the evaluated sweep--affine geometry alone reduces mean error from $12.34$ to $9.74$ px. These findings inform both practical deployment of existing matchers and future matcher design for cross-modal satellite registration.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper demonstrates that careful protocol tuning (geometry, tiling, normalization) can induce up to 33× error variation, making protocol design a key factor in performance.
The study evaluates 24 pretrained matchers in a zero-shot setting across three datasets, highlighting how preprocessing choices and RANSAC filtering affect registration accuracy.
The findings reveal that frozen DINOv2-based architectures exhibit emergent cross-modal invariance, establishing them as strong candidates for operational satellite registration.

Evaluation of Pretrained Image Matchers for SAR-Optical Satellite Registration

Problem Formulation and Motivation

The paper investigates the effectiveness of current off-the-shelf, pretrained image matchers for the challenging problem of cross-modal satellite registration between optical and Synthetic Aperture Radar (SAR) imagery. This task is critical in disaster response where optical imagery is often degraded by atmospheric effects (e.g., cloud cover), necessitating registration of SAR images to pre-existing optical basemaps for timely georeferenced analysis. The difficulty is exacerbated by significant modality-induced appearance gaps: SAR and optical images capture fundamentally different radiometric and geometric properties, leading to distinctive sensor-driven artifacts such as speckle noise, geometric layover, and contrast inversion.

Methodology

The study evaluates twenty-four pretrained matcher models encompassing detector-based, detector-free dense, and 3D-reconstruction-derived paradigms. No fine-tuning or domain adaptation is performed, creating a strict zero-shot transfer regime. Evaluation focuses on three datasets:

SpaceNet9: Manual tie-point supervision over large, orthorectified satellite scenes for labeled train/test splits.
SRIF: 600 affine-labeled cross-modality image pairs.
SARptical: Retrieval-style patch matching with 40 queries and 525 candidate pairs per query.

The protocol involves optional normalization (identity, percentile, z-score, CLAHE), tiling with overlap, RANSAC-based robust geometric filtering (affine or homography models), and displacement computation for metric reporting. Metrics include mean tie-point error, Success@ $\tau$ (the proportion of tie points below $\tau$ pixel error), and failure rates.

Key Findings

A dominant theme is the high protocol sensitivity: intra-matcher accuracy variance due to protocol shifts often matches or exceeds differences induced by swapping models, establishing protocol design as a first-order determinant of operational performance.

Figure 1: Protocol sensitivity per matcher, showing intra-matcher variance from protocol choices often equals or exceeds inter-matcher differences.

Top Model Performance: Despite being trained exclusively on ground-level, single-modality imagery, RoMa achieves state-of-the-art mean error (3.0 px) on SpaceNet9, matching XoFTR (trained for visible–thermal cross-modality), and closely followed by cross-modal variants like MINIMA-RoMa and MatchAnything-ELoFTR (3.4 px). Notably:

XoFTR and RoMa share the lowest error (3.0 px, 78.9–78.4% Success@5, $<$ 0.01 failure).
MA-ELoFTR and MINIMA-based variants maintain competitive performance within 1 px of the leaders.

Protocol Dominance: Ablations reveal that protocol shifts—specifically, model geometry (affine vs. homography), tile size/overlap, normalization, and inlier gating—can induce up to 33× variation in mean error for a given matcher.

Figure 2: Comparing affine versus homography geometry models; affine is consistently superior due to the true orthorectified transform being approximately affine.

Figure 3: RANSAC threshold robustness curves showing that matcher sensitivity to the reprojection threshold often dominates overall registration reliability.

Qualitative Insights

The qualitative gallery illustrates that top matchers yield dense, geometrically consistent inlier correspondences (green lines), whereas less robust or mismatched architectures yield more outliers (orange lines), underscoring the interaction of backbone architecture and protocol tuning.

Figure 4: Qualitative correspondence gallery for SpaceNet9, highlighting matcher-dependent variation in inlier/outlier distribution.

Cross-Dataset and Generalization Analysis

The empirical ranking on SpaceNet9 is consistent on SRIF and SARptical, demonstrating robustness in zero-shot cross-dataset transfer. MINIMA-RoMa shows the best trade-off between accuracy and zero-failure on SRIF (mean corner error 47.0 px, 0% failure), aligning with its top-tier ranking on SpaceNet9 and competitive AUROC on SARptical.

Figure 5: Normalization × matcher heatmap on SRIF; normalization affects matchability differently across architectures, with MINIMA-RoMa + Z-Score normalization yielding lowest error.

Sparse detectors with graph matchers (e.g., XFeat, SuperPoint-LightGlue) trail dense correspondence models; 3D-reconstruction models (MASt3R, DUSt3R) are notably brittle, as their architectural priors (depth, perspective) mismatch the weakly perspective, domain-specific geometry of overhead imagery.

Implications

Practical

Deployment protocols (geometry model, tiling, normalization, inlier gating) must be tailored to satellite data; defaulting to homography or untuned thresholds degrades performance.
Strong zero-shot performance with RoMa and its variants, pointing to the potency of foundation model features (e.g., DINOv2/3) in providing emergent partial modality invariance—even without explicit cross-modal supervision.
For immediate deployment, the best baseline is MINIMA-RoMa or RoMa under affine geometry, with $768^2$ px tiles, $>$ 128 px overlap, percentile or CLAHE normalization, and RANSAC threshold $\leq$ 10 px.

Theoretical

The success of frozen DINOv2-based architectures (despite a lack of explicit cross-modal data) suggests dense backbone feature pretraining learns invariances that are at least partially cross-modal. This challenges the notion that cross-modal supervision is always required, although the exact mechanism (architecture or pretraining corpus diversity) is unresolved.
Architecture–protocol interactions (e.g., correspondence density, RANSAC robustness) remain poorly understood for remote sensing and cross-modality in general.

Limitations

Only three labeled scenes on SpaceNet9 limits statistical power; results are best interpreted as operational diagnostics rather than population estimates.
Evaluation is restricted to pretrained models; fine-tuning or recent models integrating satellite/SAR losses (e.g., MapGlue) are not considered.
The findings pertain to orthorectified satellite data; extension to non-orthorectified (e.g., raw slant-range SAR) regimes is untested.

Speculation on Future Directions

Protocol optimization (e.g., adaptive thresholding, geometry selection) and hybrid pipelines (multi-stage candidate selection + geometric verification) could surpass further gains from changing matchers.
Satellite-specific fine-tuning and data augmentation remain likely to yield incremental improvements, especially for classes of scenes (e.g., dense urban, mountainous) poorly represented in web-trained foundation backbones.
Deeper exploration of foundation model pretraining corpora, architectural ablations, and joint hybrid approaches (combining explicit cross-modal supervision with robust, protocol-aware matching) will elucidate mechanisms underlying the success of present approaches.

Conclusion

Pretrained image matchers—if coupled with careful protocol design—are sufficient for accurate zero-shot registration of SAR and optical satellite imagery at sub-8 px accuracy. Protocol factors (geometry, tiling, normalization) dominate performance variation, and foundation model backbones (e.g., DINOv2-based architectures) exhibit strong modality-invariant properties even without explicit cross-modal training. The combination of dense correspondence architectures with protocol-robust deployment forms the foundation for operational satellite registration pipelines, although statistical generality and adaptability to new or extreme domains require further research.

Markdown Report Issue