Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration
Abstract: Exploration is a prerequisite for learning useful behaviors in sparse-reward, long-horizon tasks, particularly within 3D environments. Curiosity-driven reinforcement learning addresses this via intrinsic rewards derived from the mismatch between the agent's predictive model of the world and reality. However, translating this intrinsic motivation to complex, photorealistic environments remains difficult, as agents can become trapped in local loops and receive fresh rewards for revisiting forgotten states. In this work, we demonstrate that this failure stems from a lack of spatial persistence and episodic context. We show that effective curiosity requires a model of the world that is persistent and continuously updated, paired with an agent that maintains an episodic trajectory history to navigate toward novel regions. We achieve this using an online 3D reconstruction as a persistent model of the world, while the agent policy is parameterized as a sequence model over RGB observations to maintain episodic context. This design enables effective exploration during training while allowing the agent to navigate using solely RGB frames at deployment. Trained purely via curiosity on HM3D, our agent outperforms RL-based active mapping baselines and generalizes zero-shot to Gibson and AI-generated worlds. Our end-to-end policy enables efficient adaptation to downstream tasks, such as apple picking and image-goal navigation, outperforming from-scratch baselines. Please see video results at https://recuriosity.github.io/.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about teaching a virtual โcamera agentโ (think of a tiny drone with a camera) to explore 3D spaces on its own by being curious. Instead of waiting for a teacher to give it rewards like โgood jobโ when it reaches a goal, the agent gives itself points for discovering new parts of the world. The big idea is that curiosity works best when:
- the agent keeps a strong memory of what it has already seen in this episode, and
- thereโs a steady, sensible way to tell whether a view is truly new.
The authors show how to combine these two ideas so the agent explores houses and rooms efficiently, learns useful habits, and later adapts quickly to tasks like finding apples or matching a photo view.
Key Questions
The paper asks three simple questions:
- How can we make a curious agent that doesnโt get stuck wandering in circles?
- What kind of โmemoryโ does the agent need to avoid revisiting the same places over and over?
- Can an agent trained to explore just by being curious later learn real tasks faster than starting from scratch?
How They Did It
To make this easy to picture, imagine youโre exploring a new school building:
- You keep a mental map of where youโve been so you donโt loop back by mistake.
- You feel excited when you see a hallway or room you havenโt visited before.
The authors give the agent two key tools that mirror this:
- A persistent world model during training
- What it is: While the agent explores, a fast 3D builder makes a growing model of the world from the cameraโs pictures and depth (distance) information. Think of it as a living 3D scrapbook of everything seen so far.
- Why it matters: When the agent looks from a new angle, the 3D scrapbook tries to โpredictโ what the camera should see. If the real camera image looks different in an important way, that means the agent has found something truly newโso it gets a curiosity reward. If itโs the same old stuff, it gets little or no reward.
- Important detail: This 3D builder is used only while training to compute the curiosity reward. At test time, the agent doesnโt need a map; it acts just from the video it sees.
- An episodic memory inside the agent
- What it is: The agentโs brain is a sequence model (a transformer) that looks at a chain of recent images and actions, not just the current frame. Itโs like the agent keeps a running memory of the episode.
- Why it matters: With this memory, the agent can backtrack through places it has already seen to reach new branches, instead of getting stuck or forgetting where itโs been.
A few extra, human-friendly notes:
- โCuriosity rewardโ = points for discovering truly new views, measured by how much the prediction from the 3D scrapbook disagrees with the actual camera view (after smoothing out tiny details so it doesnโt get fooled by noisy textures).
- โSparse rewardโ = the world doesnโt hand out points often, so the agent must care about its own curiosity signal to keep learning.
- To prevent the agent from becoming too cautious, the authors sometimes mix in random actions during training. This keeps exploration lively and helps the agent escape slow or repetitive behavior.
Main Findings
Here are the big results and why they matter:
- Better exploration with only a camera at test time: The agent covered more of new 3D homes faster than other methods that rely on hand-built maps or depth sensors during deployment. Thatโs impressive because at test time it only uses RGB video framesโno special map, no extra sensors.
- Memory matters (on both sides):
- If the 3D scrapbook is short-term or non-persistent, the agent can get โfakeโ curiosity points by revisiting forgotten places and ends up looping.
- If the agent itself doesnโt remember its recent journey, it also falls into loops.
- Together, a persistent world model (for the reward) and an agent with episodic memory (for decision-making) are crucial to unlock stable, long-range exploration.
- Generalizes to new worlds: After training on realistic indoor scenes, the agent could explore different buildings and even AI-generated fantasy worlds without extra training. This means it learned general exploration skills, not just memorized specific maps.
- Learns new tasks faster:
- Apple picking: The agent found and โpickedโ more apples than a brand-new agent trained only on that task. This advantage was strongest when apples were rare (sparser rewards), showing the power of a curiosity-trained explorer.
- Image-goal navigation: Given a target picture, the fine-tuned agent reached the matching viewpoint more often than a from-scratch agent. Its exploration habits helped it search smartly.
Why This Matters
- A recipe for curiosity that scales: The paper shows curiosity can work in complex, realistic 3D spacesโif you pair it with both a persistent view of the world (for honest novelty signals) and an agent that remembers its own path (for smart decisions).
- Less reliance on maps and extra sensors at deployment: The trained agent runs end-to-end from camera images alone, which makes it simpler and more flexible to use in different environments and tasks.
- Faster learning on real tasks: Pretraining with curiosity gives the agent a โsense of directionโ for exploration, helping it learn new goals with fewer trialsโespecially when rewards are rare.
- A guide for future world models: As video and 3D world models improve, this work highlights that โspatial persistenceโ and continuous updating are must-haves if we want curiosity-driven agents to behave well in the real world.
In short, the paperโs message is in its title: if you want an AI to be a great explorer, make it remember to be curiousโand give it the memory and steady signals it needs to do that reliably.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, phrased to guide future research.
- Dynamic environments are out of scope; how to extend curiosity with persistent world models to scenes with moving objects, changing lighting, and non-stationary dynamics without rewarding transient changes?
- Training depends on privileged depth and pose to build the 3DGS forward model; what happens under noisy, biased, or self-supervised pose/depth estimates (e.g., DUST3R, VGGTransformer), and can curiosity be learned without ground-truth sensors?
- The computational footprint is large (8ร80GB H100 for 5.5 days; frequent online 3DGS optimization); what trade-offs exist between forward-model quality, update frequency, densification/pruning schedules, and exploration performance on resource-limited hardware?
- Action space and embodiment are simplified (forward, look left/right, pause; spherical free-flight drone); how does the approach fare under continuous control, realistic dynamics (e.g., quadrotor physics), locomotion constraints, and richer action sets?
- Motion is deterministic with ideal collision checking via ray-tracing; robustness to actuation noise, localization drift, and contact uncertainty remains untested.
- Curiosity reward is a binary thresholded low-pass image discrepancy; its sensitivity to filter type, downsampling factor, and thresholds (T_new, T_old) is not analyzed; compare against principled novelty measures (information gain, model uncertainty, density estimation).
- 3DGS provides geometry/appearance persistence but no explicit semantics; can semantic world models (e.g., panoptic 3D reconstructions) yield curiosity that prioritizes task-relevant novelty?
- The forward model is only used during training; can leveraging a distilled or compact persistent state at test time improve targeted navigation without sacrificing end-to-end flexibility?
- Episodic memory design (sliding window size W, placement/capacity of the linear-attention memory, query token formulation) lacks systematic exploration; what are scaling laws and failure modes for very long episodes?
- No theoretical account explains why persistence plus episodic context mitigates curiosity loops; formalize conditions under which intrinsic rewards become stationary and policies converge.
- Baseline coverage excludes several intrinsic motivation methods (RND, Disagreement, RIDE, E3B) in photorealistic 3D; run controlled head-to-head comparisons under matched sensors, action spaces, and collision handling.
- Evaluation relies on 3D completeness and average distance; add metrics for exploration efficiency (time-to-novelty, loopiness), semantic coverage, safety (collision rate, near-misses), and compute/energy per unit coverage.
- Generalization is shown in simulators (Gibson, two AI-generated worlds) but not in the real world; validate on physical robots with real sensing/actuation, clutter, and dynamic agents to assess sim-to-real transfer.
- Multi-agent exploration is not considered; investigate shared persistent world models and coordinated episodic memory for cooperative coverage.
- Reward sparsity mitigation uses a fixed random action injection schedule; evaluate alternative strategies (intrinsic goal-setting, option frameworks, adaptive entropy schedules) and their interactions with episodic memory.
- Forward-model persistence ablations demonstrate qualitative benefits, but the minimal memory horizon required for effective exploration is not quantified; map exploration quality versus forward-model memory length.
- Downstream tasks are limited to apple-picking and image-goal navigation; extend to manipulation, multi-step objectives, language-conditioned goals, and tasks requiring long-term semantic reasoning.
- The image-goal success criterion requires privileged 3D point visibility; devise deployable evaluation protocols that avoid ground-truth meshes and depth at test time.
- Potential 3DGS failure modes (specular/reflective surfaces, textureless regions, strong view-dependent effects) are not characterized; test and adapt curiosity under adverse visual conditions.
- Impact of forward-model bias/artifacts on learning (rewarding reconstruction errors or lag) is unquantified; develop diagnostics and corrective reward shaping to handle model errors.
- Episodic memory is reset per episode; study lifelong exploration where memory persists across episodes/scenes, avoiding relearning and enabling cumulative knowledge.
- Safety is minimally modeled (collisions halt but arenโt penalized strongly); incorporate risk-aware intrinsic rewards and explicit safety budgets to balance novelty-seeking and hazard avoidance.
- Visual backbone choices (DINOv2 vs. alternatives), fusion strategies, and multimodal inputs at test time are not ablated; isolate which features most improve exploratory behavior.
- The behavior policy is annealed from a mixture with uniform random during training but deterministic at test time; assess whether controlled stochasticity at deployment benefits coverage or goal-reaching.
Practical Applications
Immediate Applications
The following applications can be piloted or deployed today by leveraging the paperโs core insights: (1) episodic, long-horizon policies that operate on RGB-only input at deployment; (2) curiosity-driven pretraining using a persistent world model (online 3D Gaussian Splatting) to supply stable intrinsic rewards; (3) efficient fine-tuning to sparse-reward downstream tasks; and (4) simple training-time regularization via intermittent random actions.
- Robotics pretraining for sparse-reward tasks (navigation, object search)
- Sectors: robotics, software, education/academia
- What: Use the exploration-pretrained RGB-only policy as a general backbone, then fine-tune with minimal extrinsic rewards for tasks such as image-goal navigation or object finding (e.g., โapple pickingโ analogs like locating valves, tools, or QR tags).
- Tools/products/workflows: โCuriosity-pretrained policyโ checkpoint; fine-tuning scripts on PPO; reward wrappers for object detectors; ROS integration to map discrete actions to mobile base/mini-UAV commands.
- Assumptions/dependencies: Static or mostly static indoor spaces; training-time pose and depth (obtainable via SLAM/LiDAR or motion capture); sim-to-real calibration; safety layer for collision avoidance.
- After-hours facility exploration and coverage for security and maintenance
- Sectors: security, facilities management, enterprise robotics
- What: Deploy robots after hours to explore offices/warehouses, maximize coverage (3D completeness), and flag hard-to-reach spaces for human inspection.
- Tools/products/workflows: Coverage analytics dashboard (based on the paperโs 3D completeness and average-distance metrics); ROS package to execute the RGB-only policy on a perimeter/patrol robot; basic anomaly tagging via add-on detectors.
- Assumptions/dependencies: Environments are largely static during runs; fallback safety controller; compliance with building access/IT rules.
- Reality-capture โexplorer-in-the-loopโ for scanning teams
- Sectors: AEC (architecture, engineering, construction), real estate, digital twins
- What: Use the policy to suggest โnext movesโ to maximize novel viewpoints and reduce missed areas during photogrammetry or NeRF/3DGS capture (operator-in-the-loop on a handheld rig or tethered drone).
- Tools/products/workflows: Laptop or edge device runs the policy and overlays waypoints on a tablet HUD; post-hoc coverage reporting using the paperโs metrics.
- Assumptions/dependencies: Static scenes during capture; calibrated rig; operator retains control; regulatory compliance for UAVs.
- Automated playtesting and coverage QA for game and synthetic worlds
- Sectors: gaming, simulation/content platforms
- What: Run the agent to probe 3D maps for accessible coverage, dead-ends, and unreachable areas; OOD generalization makes it robust to diverse art styles and procedural content.
- Tools/products/workflows: Editor plugin that spawns exploration episodes, computes coverage metrics, and outputs heatmaps of โunseenโ spaces; CI hook for level regression tests.
- Assumptions/dependencies: Stable control interface to the engine; primarily static level geometry for the exploration runs.
- Retail layout onboarding and inventory mapping pilots
- Sectors: retail robotics, logistics
- What: Use exploration to quickly learn new/store-refit layouts, produce initial coverage sweeps, and seed downstream tasks (e.g., aisle patrolling or shelf scanning) with fine-tuning.
- Tools/products/workflows: Initial exploration run with RGB-only policy; follow-up short fine-tune on store-specific targets; compliance logging for coverage.
- Assumptions/dependencies: Runs scheduled when stores are closed; static shelving during runs; store policies on data capture and privacy.
- Academic toolkit for persistent-curiosity research
- Sectors: academia, open-source software
- What: Package and release a training harness that couples online 3DGS-based intrinsic rewards with episodic transformer agents, plus memory ablations and evaluation scripts.
- Tools/products/workflows: PyTorch training code; Habitat-based environment configs; 3DGS training-time module; standardized 3D completeness metrics and reporting.
- Assumptions/dependencies: Multi-GPU training (the paper used 8ร80GB H100); HM3D/Gibson licenses; 3DGS libraries.
- Training-time regularization recipe for long-horizon RL
- Sectors: robotics/software R&D
- What: Adopt the mixed-policy sampling (scheduled uniform random action injection) to stabilize exploration when intrinsic rewards become sparse.
- Tools/products/workflows: PPO wrappers that track behavior distribution and anneal a mixing coefficient; hyperparameter presets.
- Assumptions/dependencies: Discrete action spaces or discretized controls; careful annealing and logging to avoid destabilization.
- Benchmarks and metrics adoption for exploration coverage
- Sectors: academia, evaluation services, robotics QA
- What: Standardize coverage metrics (3D completeness at fixed step horizons, average surface-point distance) to compare exploration policies in simulators and labs.
- Tools/products/workflows: Evaluation kit; dataset splits; reporting templates for leaderboards.
- Assumptions/dependencies: Ground-truth mesh or sufficiently dense scan for evaluation.
- Pilot deployments for inspection target search in controlled industrial spaces
- Sectors: manufacturing, utilities (static bays/off-hours), data centers
- What: Use the fine-tuned policy to locate visual targets (gauges, panels, indicators) in structured, mostly static environments; combine with small rewards for detections.
- Tools/products/workflows: Detector-in-the-loop reward shaping; safety supervisor; coverage and revisit reporting.
- Assumptions/dependencies: Static or low-dynamics periods; clear line-of-sight to targets; facility safety and compliance.
- Educational demos for embodied AI
- Sectors: education, outreach
- What: Use the agent in Habitat or similar sims to teach curiosity, intrinsic motivation, and memory in RL with tangible downstream tasks.
- Tools/products/workflows: Instructor notebooks; modular ablations to visualize the impact of episodic memory and world persistence.
- Assumptions/dependencies: Access to GPUs and sim assets.
Long-Term Applications
These require further R&D, scaling, dynamic world modeling, or productization beyond current constraints (notably: static-scene assumption, training-time reliance on pose/depth, and heavy training compute).
- Dynamic-world curiosity with persistent action-conditioned models
- Sectors: robotics (service, healthcare, industry), autonomy research
- What: Replace 3DGS with a spatially persistent, action-conditioned video/world model that updates online in dynamic scenes (moving people/objects), enabling curiosity in live environments.
- Tools/products/workflows: Onboard world model with spatial memory; continual learning pipelines; drift detection and safe policy fallback.
- Assumptions/dependencies: Robust spatial persistence in generative models; compute- and memory-efficient on-device inference; safety certification.
- Home service robots that explore, then specialize with minimal supervision
- Sectors: consumer robotics, smart home
- What: Robots that autonomously explore new homes with RGB-only at runtime, then fine-tune to user-specific tasks (find objects, fetch-and-carry, room-aware reminders).
- Tools/products/workflows: Privacy-preserving on-device training; user-in-the-loop reward signals (โfound the keysโ); episodic memory management.
- Assumptions/dependencies: Robust perception under changing layouts; privacy/security guarantees; safe operation around people and pets.
- Search-and-rescue and emergency response exploration
- Sectors: public safety, defense, insurance
- What: UAVs/UGVs autonomously probe unknown buildings post-incident to map coverage, locate exits/victims, and report occluded spaces.
- Tools/products/workflows: Dynamic obstacle handling; thermal/specialty sensor fusion; explainable coverage reports; operator handoff mechanisms.
- Assumptions/dependencies: Highly dynamic scenes; strict safety and regulatory constraints; adverse conditions (smoke, dust, low light).
- Multi-robot cooperative exploration with shared episodic memory
- Sectors: logistics, industrial inspection, construction
- What: Teams of robots share a persistent, continuously updated world memory (distributed or cloud) to coordinate coverage and reduce redundancy.
- Tools/products/workflows: Federated memory fusion (e.g., distributed 3DGS or successor models); comms robustness; task allocation.
- Assumptions/dependencies: Reliable networking or delay-tolerant synchronization; consistent calibration across platforms; conflict resolution in shared maps.
- On-device AR guidance for casual 3D capture
- Sectors: consumer AR, creative tools, real estate
- What: Smartphone AR apps that guide users with โnext-best-viewโ prompts powered by an RGB-only exploration policy, targeting complete, artifact-free scans for digital twins or 3D listings.
- Tools/products/workflows: Lightweight model distillation; on-device DINO-like features; UI that visualizes coverage/novelty in real time.
- Assumptions/dependencies: Mobile inference efficiency; battery and thermal constraints; robust pose estimation on commodity devices.
- Autonomous cinematography and tour generation
- Sectors: media production, travel/tourism, cultural heritage
- What: Camera robots that explore and then plan cinematic coverage of interiors (museums, venues) with minimal operator input.
- Tools/products/workflows: Semantic priors for framing and aesthetics; shot planning over persistent memory; collision-aware smooth trajectories.
- Assumptions/dependencies: Mixed static/dynamic crowds; venue permissions; high-level aesthetic reward models.
- Foundation models for embodied exploration
- Sectors: AI platforms, robotics vendors
- What: Large-scale pretraining of exploration policies across diverse 3D worlds (scans + generative worlds), fine-tuned for downstream tasks (navigation, search, manipulation).
- Tools/products/workflows: Data engines combining simulators and synthetic worlds; standardized reward APIs; cross-embodiment action abstractions.
- Assumptions/dependencies: Broad sim-to-real generalization; governance around synthetic data bias; compute and carbon costs.
- Regulatory and policy frameworks for curiosity-driven autonomy
- Sectors: policy, standards bodies, enterprise governance
- What: Safety, privacy, and accountability standards for intrinsically motivated robots that move through private spaces; audit trails for exploration decisions.
- Tools/products/workflows: Explainability tools that reconstruct episodic memory used for actions; on-device redaction; geofencing and โdo-not-exploreโ constraints.
- Assumptions/dependencies: Consensus on acceptable data retention and use; certification pathways; integration with building access controls.
- Large-scale facility digitization and continual updating
- Sectors: industrial operations, energy, smart buildings
- What: Periodic autonomous exploratory passes keep digital twins up-to-date, flagging structural changes or occluded areas needing human follow-up.
- Tools/products/workflows: Scheduling across downtime windows; change detection atop persistent memory; operator dashboards.
- Assumptions/dependencies: Mixed dynamics in live facilities; infrastructure for autonomous charging/dispatch; integration with CMMS/BIM.
- Curriculum design for robust long-horizon control
- Sectors: academia, industrial R&D
- What: Use scheduled random-action mixtures and intrinsic rewards as a general curriculum for long-horizon tasks beyond exploration (e.g., multi-room manipulation, tool use).
- Tools/products/workflows: RL training curricula templates; policy validation suites; ablation harnesses for memory modules.
- Assumptions/dependencies: Task-specific safety envelopes; scalable training infrastructure.
Cross-cutting assumptions and dependencies to keep in mind
- Static-scene assumption: The presented methodโs strongest results are in static indoor environments; performance may degrade with frequent layout changes or moving agents/objects until dynamic persistent world models mature.
- Training-time privileges: Depth and pose are needed during training to build the 3DGS forward model (can be sourced via SLAM/LiDAR). Deployment uses RGB-only.
- Action space and embodiment: The paper used a discrete action set and a drone-like embodiment; real platforms need action mapping, safety layers, and possibly continuous control.
- Compute and data: Curiosity pretraining is compute-intensive (hundreds of millions of steps) and data-hungry; distilled or smaller models may be needed for edge deployment.
- Safety, privacy, and compliance: Exploration in private or regulated spaces requires data governance, fail-safe behaviors, and operator oversight.
Glossary
- 3D Gaussian Splatting (3DGS): An explicit, real-time 3D radiance field representation using Gaussian primitives for reconstruction and rendering; used here as a persistent world model. "We instantiate the forward model as an online 3D Gaussian Splatting (3DGS) model of the world"
- 3DGS-MCMC: A densification method for 3DGS that leverages Markov Chain Monte Carlo to refine and add Gaussian primitives. "densified via 3DGS-MCMC [15]."
- A* local planner: A graph search algorithm commonly used for path planning; cited here as a baseline component incompatible with the authorsโ setup. "a test-time collision-unaware A* local planner"
- action entropy coefficient: A hyperparameter scaling entropy regularization to maintain policy stochasticity during training. "the action entropy coefficient decayed at a rate of 0.99 from an initial value of 0.1."
- action-conditioned video models: Generative models that predict future observations conditioned on the agentโs actions. "action- conditioned video models show promise."
- actor-critic: An RL architecture combining a policy (actor) and value function (critic) for learning control and state values. "connected to the actor and critic heads that output an action distribution and a value estimate"
- annealing: Gradually reducing a training parameter (e.g., a mixing coefficient) over time to stabilize learning. "with the mixing coefficient annealed to zero over training"
- bird's-eye-view: A top-down projection used for visualization or mapping. "trajectories are overlaid on bird's-eye-view for visualization only."
- causal temporal self-attention: An attention mechanism that only attends to current and past tokens, preserving temporal causality. "Tokens are processed by causal temporal self-attention"
- curiosity-driven reinforcement learning: An RL paradigm where intrinsic rewards based on novelty or prediction error drive exploration. "Curiosity-driven reinforcement learning addresses this via intrinsic rewards derived from the mismatch between the agent's predictive model of the world and reality."
- differentiable renderer: A rendering process that supports gradient-based optimization; here, used to render from 3DGS. "where R denotes the differentiable 3DGS renderer."
- DINOv2: A self-supervised vision transformer producing robust visual features used to augment the agentโs perceptions. "We also take the RGB image processed by DINOv2 [17] to provide richer visual features."
- down-sampling operator: An operation that reduces image resolution, often to stabilize or simplify comparisons. "Ds is down-sampling operator by a factor of s"
- episodic context: Short-term memory of observations within an episode that guides navigation and exploration decisions. "stems from a lack of spatial persistence and episodic context."
- forward model: A predictive model estimating future observations given past observations and actions. "The forward model is tasked with predicting the next observation conditioned on an action"
- Gibson (dataset): A benchmark of indoor environments for embodied AI and navigation evaluation. "generalizes zero-shot to Gibson and AI-generated worlds."
- Habitat (simulator): A simulation platform for embodied agents to interact with 3D environments. "a 90ยฐ FOV forward camera in Habitat [23]."
- HM3D (dataset): Habitat-Matterport 3D dataset of large-scale indoor scenes for embodied AI. "Trained purely via curiosity on HM3D, our agent outperforms active- mapping baselines"
- image-goal navigation: A task where the agent must reach the viewpoint corresponding to a given target image. "our exploration agent, when fine-tuned for a few episodes on image-goal navigation reward, outperforms an agent trained from scratch"
- intrinsic reward: An internally generated learning signal (e.g., surprise) that incentivizes exploration without external task rewards. "the agent derives intrinsic reward from surprise"
- Intrinsic Curiosity Module (ICM): An approach that rewards prediction error in a learned dynamics model to drive exploration. "Traditional methods like ICM [5] lack this property"
- linear attention: An attention variant with linear complexity, often maintaining a compact global state for long contexts. "a linear-attention module with a global hidden state"
- LoGeR: A long-context memory architecture used as inspiration for the authorsโ global memory module. "LoGeR-style long-context architectures [18, 19]."
- low-pass filter: A filter that attenuates high-frequency image details, used to stabilize novelty estimation. "where B is a low-pass filter"
- navmesh: A navigation mesh representing traversable surfaces used by planners; here, explicitly avoided to prevent shortcuts. "Our drone agent is not constrained to the scene navmesh."
- next-best-view (NBV): A strategy to select viewpoints that maximize expected information gain for mapping. "Traditional next-best-view (NBV) methods greedily select viewpoints to maximize geometric information gain"
- Occupancy Anticipation (OccAnt): A learned mapping approach that anticipates occupancy to guide exploration. "Occupancy Anticipation (OccAnt) [8]"
- on-policy RL: Reinforcement learning that updates the policy using data collected by the current policy. "a stable reward for the agent to optimize with on-policy RL."
- online 3D reconstruction: Incremental building of a 3D scene model from streaming RGB-D observations. "We therefore utilize a state-of-the-art online 3D reconstruction method (3DGS) as a proxy"
- Plรผcker-ray image: An image-based encoding of rays using Plรผcker coordinates to represent intended camera motion. "Plรผcker-ray image [16]"
- privileged inputs: Training-only sensor information not available at test time (e.g., depth, pose). "the privileged inputs required at training time are the camera pose and depth image."
- Proximal Policy Optimization (PPO): A policy gradient algorithm using clipped objectives to stabilize updates. "We optimize our actor-critic policy using PPO [20]."
- random policy regularizer: A technique that intermittently samples random actions during rollouts to encourage exploration. "with the random policy regularizer scheduled from 20% to zero over 5 million steps"
- self-supervised RL: Reinforcement learning driven by intrinsic signals or self-generated objectives rather than external labels. "we formulate exploration as a self-supervised RL problem"
- sliding-window attention: Attention restricted to a recent temporal window to keep computation tractable over long sequences. "Sliding-window attention provides efficient direct local context"
- transformer backbone: A transformer-based model serving as the core architecture for sequence processing and control. "we use a transformer backbone"
- uniform random policy: A policy that selects actions uniformly at random, used here to maintain exploration during training. "we occasionally sample actions from a uniform random policy"
- world model: An internal model predicting environmental dynamics or observations to support planning and curiosity. "the prediction error of a world model - trained alongside the agent - to anticipate the consequences of its actions"
- zero-shot generalization: The ability to transfer to new environments or tasks without further training. "generalizes zero-shot to Gibson and AI-generated worlds."
Collections
Sign up for free to add this paper to one or more collections.