GeoMeld: Toward Semantically Grounded Foundation Models for Remote Sensing

Published 12 Apr 2026 in cs.CV and cs.AI | (2604.10591v1)

Abstract: Effective foundation modeling in remote sensing requires spatially aligned heterogeneous modalities coupled with semantically grounded supervision, yet such resources remain limited at scale. We present GeoMeld, a large-scale multimodal dataset with approximately 2.5 million spatially aligned samples. The dataset spans diverse modalities and resolutions and is constructed under a unified alignment protocol for modality-aware representation learning. GeoMeld provides semantically grounded language supervision through an agentic captioning framework that synthesizes and verifies annotations from spectral signals, terrain statistics, and structured geographic metadata, encoding measurable cross-modality relationships within textual descriptions. To leverage this dataset, we introduce GeoMeld-FM, a pretraining framework that combines multi-pretext masked autoencoding over aligned modalities, JEPA representation learning, and caption-vision contrastive alignment. This joint objective enables the learned representation space to capture both reliable cross-sensor physical consistency and grounded semantics. Experiments demonstrate consistent gains in downstream transfer and cross-sensor robustness. Together, GeoMeld and GeoMeld-FM establish a scalable reference framework for semantically grounded multi-modal foundation modeling in remote sensing.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper presents a multi-modal, semantically grounded pretraining framework that aligns diverse remote sensing modalities using agentic captioning.
The methodology leverages multi-pretext masked autoencoding, JEPA-based predictive learning, and image-text contrastive loss for robust feature learning.
Empirical evaluations on datasets like BigEarthNet20k and So2Sat20k demonstrate notable improvements in performance metrics, ensuring cross-modal robustness.

Authoritative Summary of "GeoMeld: Toward Semantically Grounded Foundation Models for Remote Sensing"

Introduction and Motivation

The paper introduces GeoMeld, a multi-modal, semantically grounded dataset and pretraining framework tailored for remote sensing (RS) foundation models. Traditional RS datasets and pretraining approaches exhibit modality fragmentation and insufficient semantic grounding, largely relying on vision-only or annotated label structures. By contrast, GeoMeld aligns heterogeneous modalities—optical (Sentinel-2), SAR (Sentinel-1), DEM, canopy height, land-cover maps, and location metadata—at scale, pairing each spatial anchor with a verified natural-language caption derived through a structured agent-based pipeline. The resulting dataset contains approximately 2.5 million samples, offering broad biome and geographic diversity (Figure 1).

Figure 1: Temporal distribution of GeoMeld samples, showing the number of samples per month and overall seasonal coverage.

GeoMeld-FM, the accompanying foundation model, is pretrained using multi-pretext masked autoencoding, JEPA-based predictive representation learning, and caption–vision contrastive alignment. This enables the joint representation space to capture cross-modal physical consistency and semantic grounding, facilitating downstream transfer and robustness.

Dataset Construction and Modality Alignment

GeoMeld's coordinate pool is instantiated from three strategies: biome-stratified sampling for ecological diversity; incorporation of spatial anchors from existing RS datasets (e.g., MMEarth, SkyScript) without overlap; and targeted sampling of underrepresented regions (Africa, South America, Asia) to address geographic bias. Each spatial anchor defines a standardized $1280\times1280$ m field of view, harmonized to $128\times128$ arrays at 10m GSD across modalities. Temporal alignment is achieved via anchor-based retrieval windows, ensuring sensor coherence and seasonal coverage.

Figure 2: A sample from GeoMeld showing spatially aligned multi-modal inputs derived through a unified alignment protocol.

Agentic Caption Generation Framework

Captions in GeoMeld are synthesized through a sequence of agentic operations:

Orchestrator: Aggregates modality-specific signals and metadata, constructing structured prompts.
Captioner: Generates multiple conditioned candidate captions, incorporating visual and metadata signals.
Evaluator: Ranks captions based on vision–text alignment, prioritizing cross-modal consistency.
Verification Agent: Enforces semantic consistency by cross-checking textual claims with geospatial attributes (land-cover, hydrological, terrain), correcting inconsistencies.

This multi-agent approach surpasses conventional templated or annotation-based captions, integrating relational, cross-modality information and reducing hallucination.

Figure 3: Agentic framework for generating semantically grounded captions and caption quality comparison.

Foundation Model Architecture: GeoMeld-FM

GeoMeld-FM utilizes a ConvNeXtV2 backbone, applying patch-based masking to Sentinel-2 imagery, and lightweight decoders for reconstruction or prediction across aligned modalities. Multi-pretext masked autoencoding (MP-MAE) forms the first major component, targeting pixel-wise cross-modality structure.

A JEPA branch introduces high-level latent-space predictive learning, creating context and target views of Sentinel-2 tiles under distinct masks, with a predictor mapping context representation to target latents from an EMA teacher.

Caption embeddings, produced by a Transformer, are aligned with pooled vision representations using a symmetric InfoNCE contrastive objective, directly embedding visual and textual signals in a joint space. The model is jointly trained with MP-MAE, JEPA, and ITC (image–text contrastive) losses.

Figure 4: GeoMeld-FM pretraining architecture integrating multi-pretext masked autoencoding, JEPA prediction, and vision–language contrastive alignment.

Experimental Evaluation

GeoMeld-FM is evaluated on the GeoBench suite, including BigEarthNet20k (land-cover multi-label), So2Sat20k (urban land-use), Cashew1k (semantic segmentation), and SAcrop3k (crop-type segmentation), using both linear probing (LP) and full fine-tuning (FT).

Significant gains are observed versus optical-only and prior multi-modal pretraining baselines, particularly in linear probing: GeoMeld-FM achieves $49.6$ (F1) on BigEarthNet20k and $50.2$ (accuracy) on So2Sat20k, surpassing MMEarth (MP-MAE) and S2-only models. Dense prediction tasks (Cashew1k, SAcrop3k) show improved IoU scores, demonstrating cross-sensor robustness.

Ablation studies reveal complementary contributions of MP-MAE, JEPA, and ITC. Cross-modality reconstruction yields major improvements; JEPA further enhances semantic structure; ITC enables meaningful bidirectional retrieval (R@5 up to $39.6$ for text-to-image), validating language grounding. The full model achieves the strongest downstream performance across modalities and tasks.

Practical and Theoretical Implications

GeoMeld establishes a new reference for scalable, semantically grounded RS foundation modeling. By directly incorporating richly aligned modalities and validated language supervision, the dataset supports the construction of vision-only, vision–language, and multi-agent models suited for geospatial understanding. The pretraining framework demonstrates superior transferability, spatial robustness, and improved semantic retrieval, essential for applications in Earth monitoring, biophysical modeling, and environmental AI. The agentic caption protocol reduces hallucination and ensures interpretability.

The architecture points toward future developments in RS-AI, including multi-temporal and multi-resolution modeling, fine-grained spatial grounding, interactive vision–language interfaces, and instruction-tuned foundation models optimized for global environmental monitoring.

Conclusion

GeoMeld, together with GeoMeld-FM, provides a large-scale, modality-aligned, semantically verified dataset and pretraining framework for remote sensing foundation models. Empirical results confirm that joint cross-modal reconstruction, predictive latent learning, and caption grounding substantially improve downstream performance and retrieval. GeoMeld offers a robust foundation for subsequent research in multi-modal, vision–language, and agentic AI for geospatial domains (2604.10591).