GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

Published 12 May 2026 in cs.LG, cs.AI, and cs.CL | (2605.11853v2)

Abstract: Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper presents GEAR, a novel framework that leverages self-distillation for fine-grained credit assignment in reinforcement learning for LLM agents.
It employs token-level reverse KL divergence and entropy-based segmentation to dynamically reweight advantages, achieving up to 20% accuracy gains on complex benchmarks.
The framework enhances policy stability and sample efficiency in long-horizon tasks without relying on external reward models.

Granularity-Adaptive Credit Assignment for LLM Agents: The GEAR Framework

Motivation and Problem Setting

LLMs deployed as agents in complex reasoning and tool-use environments require reinforcement learning (RL) with fine-grained credit assignment for effective policy optimization. Traditional agentic RL methods such as Group Relative Policy Optimization (GRPO) assign outcome-level rewards, broadcasting identical trajectory-level advantages to all tokens or actions in a trajectory, disregarding the local heterogeneity of their contributions. In long-horizon tasks, this coarse credit propagation introduces credit diffusion, reinforcing suboptimal behaviors and penalizing beneficial actions, leading to noisy policy updates and slower convergence.

Fine-grained token- or step-level credit assignment methods exist, but practical application is complicated by the absence of reliable local reward signals and misalignment between arbitrary segmentation units and the behavioral structure of agentic trajectories. The GEAR (Granularity-adaptivE Advantage Reweighting) framework proposes a solution that leverages self-distillation to adaptively partition trajectories at behaviorally salient points and redistribute advantage accordingly, yielding more targeted and outcome-consistent policy gradients.

GEAR: Methodological Overview

GEAR reshapes trajectory-level advantage signals by modulating them with reference-guided, token- and segment-level weights derived from self-distillation. The pipeline consists of the following components:

1. Reference-Guided Divergence Estimation:

For each token in a trajectory, GEAR computes the reverse Kullback–Leibler (KL) divergence between the action probabilities of the on-policy student and a ground-truth-conditioned teacher policy. This reference policy is constructed by conditioning the model on the ground-truth data, yielding a behavioral deviation signal for each token.

2. Adaptive Trajectory Segmentation:

Empirical observations reveal that reverse KL divergence spikes sharply at tokens corresponding to semantic departures or transitions (e.g., discourse markers or reasoning shifts), detectable in practice (Figure 1).

Figure 1: Frequency of top-20 tokens with normalized reverse-KL $>0.1$ in 100 sampled math trajectories.

Segment boundaries are anchored at these reverse KL spikes, while the extent of each segment is determined by subsequent increases in the token-level policy entropy, which reflects local uncertainty and potential shifts in semantic context (Figure 2).

Figure 2: Token-level visualization results of normalized KL divergence and normalized entropy.

All tokens within a divergent segment inherit the credit weight of the segment onset token, propagating localized discrepancy signals. Tokens not in divergent segments retain their individual KL-based weight.

Figure 3: GEAR computes token-wise reverse KL divergence to detect segment onsets and uses entropy to determine segment extent, redistributing trajectory-level advantage adaptively.

3. Advantage-Aware Modulation:

The reweighted local credit signal is modulated by the sign of the original trajectory-level advantage: tokens with high divergence (relative to the teacher) are downweighted for successful trajectories and upweighted for unsuccessful ones. This outcome-aware asymmetry ensures that undesirable deviations are suppressed in positive outcomes and emphasized for correction in negative-outcome rollouts. The modulation preserves the low-variance property of GRPO, only introducing multiplicative weights without injecting high-variance step-level rewards.

Empirical Evaluation

GEAR is evaluated on eight benchmarks spanning mathematical reasoning (e.g., MATH, GSM8K, AIME24/25) and agentic tool-use tasks (e.g., ToolSandbox, BFCL, ACEBench), using Qwen3-4B and Qwen3-8B models. Comparative baselines include GRPO, feature-based multi-turn RL methods (ARPO, MT-GRPO), and token/turn-level self-distillation approaches.

Strong empirical results are reported:

On difficult long-horizon settings (low baseline GRPO performance), GEAR achieves up to 20% improvement in accuracy over GRPO.
GEAR exhibits gains in both in-domain and cross-domain generalization, with notably robust performance on challenging, distribution-shifted math leaderboards (AIME24/25).
Training curves confirm that GEAR consistently sustains higher policy entropy during training, indicative of greater exploration and avoidance of premature convergence (Figure 4).
Figure 4: Training curves of GRPO, GEAR, and variants: GEAR achieves the highest mean reward and maintains higher entropy.

Ablation studies reinforce that both segment-level propagation and the dynamic segmentation strategy (combining reverse KL triggering and entropy-based termination) are essential for achieving stable, fine-grained credit propagation. Using standard units (e.g., turns, tool invocations) or neglecting either segmentation cue results in degraded performance and less accurate credit localization.

Theoretical and Practical Implications

GEAR’s integration of self-distillation divergence and dynamic adaptive segmentation addresses key limitations of fixed-unit credit assignment and step-level reward dependence in RL for LLM agents. By using only ground-truth-conditioned self-reference and intrinsic uncertainty without external reward models or manual annotation, GEAR is applicable to a broad class of RL settings with outcome-only supervision. The approach reduces overfitting to in-domain trajectories and fosters transferability, as evidenced by robust cross-domain results.

Practically, this improves policy reliability, stability, and sample efficiency for LLM-based agents operating in multi-step, compositional environments. Theoretically, it foregrounds the value of leveraging intrinsic model signals for behavioral diagnostics and control, suggesting a broader research agenda around reference-guided, structure-adaptive RL for language agents.

Future Directions

GEAR’s design presumes availability of reliable ground-truth trajectories for constructing the teacher policy. Extension to open-ended environments (where ground truth is ambiguous or multiple valid behaviors exist) could employ proxy signals (e.g., ensemble verification, preference modeling) for reference construction. Broader application to multimodal agentic settings and scaling to more expressive trajectory structures are additional avenues.

Further, integrating GEAR’s adaptive advantage propagation with hierarchical or search-augmented policy architectures, and evaluating its interaction with more sophisticated exploration/exploitation schedules, would offer valuable insights.

Conclusion

GEAR introduces a principled, granularity-adaptive advantage reweighting procedure, leveraging reverse KL divergence and entropy-based segmentation anchored in self-distillation, to advance fine-grained credit assignment in RL for LLM agents. Empirical results validate that GEAR consistently outperforms both outcome-level and token/turn-level baselines, achieving strong improvements especially in long-horizon, challenging settings. The framework reduces the reliance on external/local reward models, facilitating scalable, robust RL fine-tuning for agentic LLM systems and motivating new directions in model-driven trajectory analysis and credit assignment (2605.11853).