A3R: Agentic Affordance Reasoning via Cross-Dimensional Evidence in 3D Gaussian Scenes

Published 2 Apr 2026 in cs.CV | (2604.01882v1)

Abstract: Affordance reasoning in 3D Gaussian scenes aims to identify the region that supports the action specified by a given text instruction in complex environments. Existing methods typically cast this problem as one-shot prediction from static scene observations, assuming sufficient evidence is already available for reasoning. However, in complex 3D scenes, many failure cases arise not from weak prediction capacity, but from incomplete task-relevant evidence under fixed observations. To address this limitation, we reformulate fine-grained affordance reasoning as a sequential evidence acquisition process, where ambiguity is progressively reduced through complementary 3D geometric and 2D semantic evidence. Building on this formulation, we propose A3R, an agentic affordance reasoning framework that enables an MLLM-based policy to iteratively select evidence acquisition actions and update the affordance belief through cross-dimensional evidence acquisition. To optimize such sequential decision making, we further introduce a GRPO-based policy learning strategy that improves evidence acquisition efficiency and reasoning accuracy. Extensive experiments on scene-level benchmarks show that A3R consistently surpasses static one-shot baselines, demonstrating the advantage of agentic cross-dimensional evidence acquisition for fine-grained affordance reasoning in complex 3D Gaussian scenes.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper proposes an agentic framework that dynamically selects between 2D and 3D evidence operators to resolve ambiguities in affordance reasoning within 3D Gaussian scenes.
It models the task as a Partially Observable Markov Decision Process, using a MLLM-driven policy optimized via GRPO to achieve up to 16.55 mIoU gain on unseen affordances.
Experimental results demonstrate that adaptive evidence acquisition leads to spatially precise predictions, significantly outperforming static one-shot inference methods in complex, cluttered environments.

Agentic Cross-Dimensional Evidence Acquisition for Fine-Grained Affordance Reasoning in 3D Gaussian Scenes

Problem Setting and Motivation

Affordance reasoning aims to determine the scene regions functionally supporting a specified interaction described in natural language, serving as a core connection between perception and action for embodied agents. Recent 3D Gaussian Splatting (3DGS) representations enable efficient and geometry-aware scene rendering, supporting fine-grained affordance analysis in complex scenes. However, prior approaches predominantly adopt passive one-shot inference from fixed observations, which is fundamentally limited in ambiguous or cluttered environments where task-relevant evidence may be occluded, distributed, or not available in a single view. This evidence-limited setting highlights that the deficit is often not in detection capacity or network expressivity, but rather in the capacity to acquire missing, task-relevant cues.

To address these limitations, "A3R: Agentic Affordance Reasoning via Cross-Dimensional Evidence in 3D Gaussian Scenes" (2604.01882) proposes an agentic framework for affordance reasoning, where evidence is actively and sequentially acquired via 2D and 3D operators, leveraging policy-driven cross-dimensional querying under partial observability.

Figure 1: The A3R agent maintains a scene-grounded decision state and dynamically selects 2D or 3D evidence operators, progressively updating its affordance belief to resolve ambiguity.

Technical Approach

Problem Formulation

A3R models affordance reasoning as a Partially Observable Markov Decision Process (POMDP) over a 3DGS scene $\mathcal{S}$ and query $q$ . The agent's internal state embeds:

the workspace $\mathcal{S}^{(t)}$ (a region of the 3DGS-mapped environment),
a soft belief map $h_t$ over the $\mathcal{S}$ Gaussians,
an action-evidence history $m_t$ .

At each step, the agent observes context rendered from its current state, and generates evidence-acquisition actions $u_t$ —including 3D part/instance segmentation, 2D semantic mask lifting, zoom, or termination—via an MLLM-based (Multimodal LLM) policy. The acquired evidence is projected onto the Gaussian support and updates $h_t$ , allowing dynamic refinement of the spatial belief in response to ambiguity.

Cross-Dimensional Evidence Operators

A3R interleaves five evidence-acquisition actions:

Inst: 3D instance segmentation for initial distractor suppression,
Part: 3D part segmentation for precise local affordance refinement,
SAM2D: 2D segmentation (via Segment Anything Model) with 3D lifting for semantic disambiguation,
Zoom: recursively focusing the workspace for graduated resolution,
Term: ending evidence acquisition and outputting binary predictions by thresholding $h_t$ .

This design supports coarse-to-fine, cross-modal, and context-dependent evidence gathering, crucial for handling previously unseen affordances or ambiguous scenes.

Figure 2: Sample evidence-acquisition trajectories: the agent adaptively transitions between semantic and geometric operators, iteratively refining affordance localization.

MLLM-Driven Policy Optimization

A3R utilizes a Qwen3-VL architecture for the control policy, with decision-making conditioned on:

multi-view rendered visual context,
the textual affordance query,
structured prompts,
the current decision state and operator schema.

Policy training uses Group Relative Policy Optimization (GRPO), which operates over complete trajectories and updates the MLLM via LoRA, aligning the action generation with task-level reward (spatial mIoU, action efficiency), without requiring dense step-wise supervision.

Experimental Analysis

A3R is evaluated on SeqAffordSplat and 3DAffordSplat, two representative 3DGS-based affordance benchmarks, across Seen and Unseen split protocols. Main results demonstrate:

A3R achieves up to 16.55 mIoU gain over strong one-shot baselines in generalization to unseen affordance types.
The training-free policy already outperforms all static fusion and single-modal variants.
The RL-optimized policy further improves accuracy and step-efficiency.

Qualitative results underline significant differences: static methods yield diffuse or cluttered support in ambiguous cases, while A3R consistently generates spatially precise predictions tightly localized to functional regions, especially for challenging cases with many distractors or subtle part-level distinctions.

Figure 3: A3R yields spatially precise affordance predictions in ambiguous scenes where static methods mislocalize or over-activate regions.

Ablation reveals that geometric operators (especially part segmentation) are critical for localization in 3DGS, semantic operators (SAM2D) boost disambiguation, and workspace refinement (Zoom) further enhances localization sharpness. Adaptive sequencing via the policy backbone outperforms hand-designed or greedy operator chains, particularly on ambiguous and unseen cases.

Theoretical and Practical Implications

A3R challenges the paradigm of passive, one-shot inference for fine-grained affordance grounding in complex 3D environments. By formalizing the problem as evidence-limited and partially observable, it motivates integrating active perception and agentic decision-making with large pretrained models. The demonstrated efficacy of cross-dimensional evidence acquisition establishes that geometry and semantics must be invoked adaptively, not statically fused.

Practically, these agentic affordance models are well-aligned with the requirements of robotic manipulation, human-robot interaction, and AR/VR, where reasoning under uncertainty, dynamic viewpoint or level-of-detail switching, and robust function localization are essential.

On the theoretical side, A3R links perceptual evidence accumulation models to RL+MLLM-driven scene reasoning, with policy learning mechanisms that leverage end-to-end reward without manual trajectory supervision.

Future Directions

The work suggests several further lines:

Extension to open-vocabulary or compositional affordance queries,
More general policies operating over additional evidence operators (e.g., tactile cues, multispectral imagery),
End-to-end fine-tuning where segmentation operators are also updated,
Multi-agent or collaborative perception settings,
Integration into closed-loop action pipelines for robot learning.

Conclusion

A3R constitutes a substantial advance in scene-level affordance reasoning, demonstrating that sequential, agentic cross-dimensional evidence acquisition dramatically improves generalization and localization in 3DGS environments over static or single-modal approaches. The framework establishes a foundation for more general, robust, and interactive reasoning agents—moving affordance grounding methodology towards active, open-world perception and action.

Markdown Report Issue