Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

Published 19 Apr 2026 in cs.CV and cs.AI | (2604.17473v1)

Abstract: Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video LLMs(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent's internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub-goals from remaining ones, and Memory Drift, where the agent's history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual-Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub-goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark-Centric World Model to retrospectively predict object-centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long-horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.

Summary

  • The paper introduces a Dual-Anchoring framework that regularizes agent states through instruction progress and memory landmark anchoring to counteract state drift.
  • It leverages fine-grained progress supervision and dense landmark reconstruction to boost success rates, achieving up to 65.6% on the R2R-CE benchmark.
  • The approach demonstrates robust real-world generalization, validated by zero-shot deployment on a quadruped robot and enhanced trajectory fidelity.

Dual-Anchoring: Structured State Regularization for Continuous Vision-Language Navigation

Introduction and Motivation

Vision-Language Navigation (VLN) in continuous 3D environments presents fundamental challenges for embodied AI, particularly when executing multi-step natural language instructions over extended trajectories. Despite advances in VLN via Video LLMs (Video-LLMs), a persistent failure mode—State Drift—limits robustness in long-horizon navigation. State Drift emerges as the gradual decoupling of the agent's internal cognitive representation from the true environment and task execution status. The paper decomposes this into two cognitive deficits: Progress Drift (mis-tracking instruction progress) and Memory Drift (loss of distinct representations for previously visited landmarks). Figure 1

Figure 1: The Challenge of State Drift in long-horizon VLN, with progress and memory drift compounding navigation errors.

To address these issues, the authors propose a Dual-Anchoring Framework that explicitly regularizes and aligns the internal state of Video-LLM-based VLN agents at two axes: (1) Instruction Progress Anchoring, which supervises the agent to maintain structured, aligned mental checklists of completed and remaining sub-goals, and (2) Memory Landmark Anchoring, where the agent reconstructs dense object-centric representations from salient landmarks to preserve contextual memory. This approach is designed to ensure semantic alignment and mitigate perceptual aliasing, particularly critical for robust, interpretable, and generalizable navigation.

Dual-Anchoring Framework

Figure 2

Figure 2: Overview of the Dual-Anchoring Framework, extending a streaming Video-LLM backbone with auxiliary supervision for instruction progress and memory anchoring.

Instruction Progress Anchoring

Instruction Progress Anchoring addresses Progress Drift by compelling the agent to generate semantically structured progress descriptions at each step, explicitly delineating subsumed versus pending sub-instructions. To obtain high-quality, fine-grained supervision, the framework synthesizes progress data at scale using a multimodal LLM with visual prompting, chain-of-thought reasoning, and instruction-aligned refinement. Figure 3

Figure 3: Data generation pipeline for (a) high-fidelity progress descriptions and (b) temporally grounded landmark frames with object-centric SAM features.

During training, these structured progress annotations are used as co-training targets, with action prediction conditioned on explicit declarations of instruction status. This supervision induces the agent to maintain a persistent, verifiable mapping between its decision process and instruction semantics, thereby preventing brittle or shortcut-driven policy learning.

Memory Landmark Anchoring

Memory Drift is attacked via Memory Landmark Anchoring, introducing a retrospective world modeling component. Here, the framework first decomposes instructions into atomic sub-goals, identifies temporal frames where referenced landmarks are encountered, and extracts their dense spatial features using the Segment Anything Model (SAM). The agent is then supervised to reconstruct these features from its internal video-LLM representations at each step, effectively creating a "rear-view mirror" mechanism. This spatially explicit loss grounds the agent’s memory in persistent, object-centric representations, fostering robust scene understanding and trajectory awareness.

Training Strategy and Data Composition

The methodology leverages an extensive, diverse mixture of standard navigation trajectories, self-collected progress/landmark datasets, DAgger rollouts, and general vision-language corpora to avoid overfitting and catastrophic forgetting. A two-stage training paradigm is adopted: navigation data pre-training with dual-anchoring objectives, followed by DAgger-augmented and vision-language fine-tuning.

Empirical Evaluation

Simulation Results

Figure 4

Figure 4: Qualitative comparison in simulation; baseline diverges due to drift, while Dual-Anchoring maintains trajectory fidelity and interpretable internal states.

Empirical results on R2R-CE and RxR-CE benchmarks demonstrate strong numerical gains: on R2R-CE, Success Rate (SR) increases from 56.9% to 65.6%, and on the more challenging RxR-CE, SR reaches 61.7%, a substantial improvement (+8.8% absolute) over the strong Video-LLM baseline. These improvements are especially pronounced on long-horizon trajectories (24.7% relative gain), validating the framework's critical utility for scaling VLN to real-world time horizons.

Ablation studies further corroborate the complementary nature of the two regularization branches. Instruction Progress Anchoring alone primarily boosts semantic alignment and immediate goal tracking (SR), while Memory Landmark Anchoring yields improvements in navigation efficiency (SPL) and error metrics. The combination consistently yields optimal results across data regimes. Figure 5

Figure 5: Success Rate and SPL improvements as a function of increasing trajectory length, highlighting enhanced stability of Dual-Anchoring under long horizons.

Data Quality and Real-World Generalization

Rigorously designed metrics demonstrate that the self-collected progress and landmark data possesses low hallucination rates and strong logical consistency; object-centric landmark frames are found in target locations at over 75% frequency (versus 13.9% random baseline). These robust signals drive downstream performance gains. Figure 6

Figure 6: Real-world deployment on a quadruped robot, showing robust instruction-state tracking and domain transfer.

Zero-shot deployment on a Unitree Go2 quadruped robot validates transferability: the model, trained solely in simulation, correctly generates interpretable progress descriptions and remains robust to domain shift, mitigating state drift in real-world navigation tasks.

Implications, Theoretical Perspectives, and Future Directions

This work foregrounds the limitations of pure next-action supervised Video-LLMs for embodied navigation, especially in their inability to maintain persistent, grounded internal state representations across extended compositional tasks. The Dual-Anchoring Framework offers a principled augmentation: by integrating fine-grained, verifiable progress and history supervision, it explicitly couples the agent’s cognition to both instruction semantics and fixed visual anchors. This approach advances both the practical performance and interpretability of VLN policies, suggesting a broader design paradigm for embodied AI in which internal state must be explicitly regularized and grounded by structure-aware auxiliary constraints.

Future developments could extend dual-anchoring principles to online adaptation, interactive dialog navigation, and applications beyond navigation such as embodied instruction following with manipulation, general multi-modal sequential decision-making, or continual learning settings. Integration with SLAM-like scene graphs or richer memory architectures is also a promising trajectory for scaling context-awareness. The work opens avenues for using retrospective world models not merely for physical simulation but as continual trackers of semantic and episodic memory in agent architectures.

Conclusion

Dual-Anchoring establishes explicit state regularization as a critical component for robust, generalizable, and interpretable VLN in continuous 3D environments. By anchoring the agent’s cognition along both progress and memory axes, the proposed framework demonstrably overcomes the core limitations of prior Video-LLM navigators under long-horizon and real-world conditions, setting a new performance standard and theoretical baseline for state management in embodied foundation models (2604.17473).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.