- The paper proposes a hybrid MT-GRPO+GTPO method that combines discounted per-turn returns with dampened outcome advantages to correct reward misalignment.
- It introduces an Iterative Reward Calibration (IRC) pipeline to assign reward tiers based on empirical discriminative power, achieving significant performance gains.
- Empirical results show that the approach boosts pass rates, outperforming larger models by focusing gradient updates on critical turns in multi-turn, tool-calling customer service tasks.
Problem Setting and Motivation
This paper targets the training of LLMs for tool-calling agent tasks—specifically, realistic multi-turn customer service dialogues requiring policy adherence, reasoning, and database manipulation. A major challenge in these settings is the sparse nature of reward signals (typically binary task success) and difficult temporal credit assignment across conversation turns. Existing RL approaches such as Group Relative Policy Optimization (GRPO), Multi-Turn GRPO (MT-GRPO), and Generalized Token-level Policy Optimization (GTPO) have not previously been evaluated on agentic dialogue with tool-use, user simulators, and long horizon interactions.
The crucial insight of the paper is that naively-designed dense per-turn rewards, although intuitive, can degrade RL performance due to reward-advantage misalignment that leads to incorrect gradient signals and suboptimal policy updates. Sparse outcome-only reward formulations “accidentally” focus learning on high-leverage turns, but are not systematically optimal. The core focus, therefore, is on effective per-turn reward design and advantage computation that maximize discriminative power for RL signal and credit assignment.
Methodology: MT-GRPO+GTPO Hybrid and Iterative Reward Calibration
MT-GRPO and GTPO Recap
MT-GRPO extends GRPO by enabling credit assignment at each conversation turn; per-turn rewards are group-normalized, and their sum is combined with a group-normalized terminal outcome advantage. GTPO augments this with discounted returns, further tempering the outcome's influence over early turns.
A key technical finding is that, for multi-turn tool-calling tasks, standard MT-GRPO with dense rewards leads to advantage direction mismatches: small positive rewards for intermediate (e.g., read-only) actions are overwhelmed by large negative outcome signals in failed episodes, inadvertently suppressing necessary actions. Thus, the resulting policy gradient is counterproductive for these agentic tasks.
To resolve this, the authors introduce a hybrid MT-GRPO+GTPO advantage formulation that:
- Computes discounted, group-normalized per-turn returns.
- Combines this with a dampened (λ-weighted) group-normalized outcome advantage.
This construction ensures non-discriminative or routine actions receive zero or minimal gradient, while outcome-relevant actions get the appropriate reinforcement or suppression, eliminating direction mismatches.
Iterative Reward Calibration (IRC)
IRC provides a principled, data-driven pipeline for reward tier assignment:
- Analyze a large buffer of rollouts.
- For each reward tier, measure empirical discriminative power (e.g., point-biserial correlation between tier presence and task success).
- Set reward values proportional to discriminative power, reducing to zero for non-discriminative actions (such as routine read-only tool calls).
- Confirm post-normalization advantage direction to match intended policy reinforcement.
- Iterate until all advantage alignments are correct and reward-outcome correlation is high.
This loop is critical: initial naive dense rewards yielded a pronounced drop in task pass rates (up to 14 percentage points), while IRC-recalibrated rewards not only recover but surpass base performance.
Deep Argument Comparison
A nontrivial issue in tool-calling is robust matching between generated and golden actions, as arguments are deeply structured (JSON). The paper implements a recursive argument normalization and comparison function that handles ordering, type, and emptiness, leading to a 23.5% reduction in false positives during policy evaluation and reward assignment.
Empirical Results
The models are trained and evaluated on the Tau-Bench airline customer service domain, which provides both gold traces and an LLM-based user simulator. Two model families were assessed:
Key empirical results:
- On the held-out Tau-Bench airline test set, the MT-GRPO+GTPO+IRC approach brings Qwen3.5-4B to 66.7% pass rate (up +2.9pp from a strong baseline).
- The Qwen3-30B-A3B MoE improves dramatically from 58.0% to 69.5% (+11.5pp).
- The trained 4B model outperforms GPT-4.1 (49.4%) and GPT-4o (42.8%) on this benchmark, despite being ~50x smaller. The 30B MoE approaches the performance of Claude Sonnet 4.5 (70.0%) in this task.
- Naively applying dense rewards with standard MT-GRPO/GRPO degrades both models' performance below their base checkpoints.
Qualitatively, trained models exhibit sharper action grounding, more efficient dialogues (fewer turns and less verbosity), and greater resistance to user social manipulation compared to base models.
Analysis and Ablations
In-depth ablations show:
- The majority of RL performance loss under naive dense rewards derives from inappropriate learning rate scaling and gradient misallocation to non-discriminative turns.
- Sparse rewards exploit “dead turn gradient focusing,” where routine actions naturally receive zero gradient, concentrating updates on outcome-relevant actions.
- The advantage misalignment is responsible for a nontrivial, but smaller, share of the performance drop.
- Cross-domain transfer from airline to retail is promising (77.4% pass), but less so for harder telecom cases (32.0%).
Theoretical and Practical Implications
The findings provide several actionable implications:
- Reward calibration is critical: dense per-turn reward assignment in agentic RL must be empirically validated rather than intuition-driven.
- Outcome reward signal mixing requires careful formulation: hybridizing discounted per-turn returns with appropriately weighted terminal outcome normalization ensures usable gradients.
- Existing general-purpose RL methods are insufficient: idiosyncratic properties of tool-calling, multi-turn dialogue tasks expose new forms of advantage misalignment not observed in QA/math or simpler agentic settings.
- Automated calibration tools are viable: the IRC pipeline can be extended via Empirical Discriminative Gating, enabling online adaptive reward tier adjustment.
- The use of synthetic user simulators enables reproducible, scalable RL for task-oriented dialogue agents but does not fully close the gap with human-in-the-loop interaction.
Future Directions
Further development should focus on:
- Automating IRC using online reward-discriminative statistics gathered from ongoing policy evolution.
- Extending calibration and hybrid advantage techniques to multi-domain, multi-user, or real-world human feedback agents.
- Studying generalization dynamics across more diverse task ontologies, including less-structured, open-domain, or adversarial settings.
Conclusion
This work makes a substantial technical contribution by identifying and correcting the reward-advantage misalignment in RL for tool-calling agents, via a calibrated hybrid of MT-GRPO and GTPO with systematic reward value assignment using IRC. The result is robust policy improvement, strong empirical gains over both baseline and larger proprietary models, and a framework extensible to a wide array of multi-turn RL formulations for LLMs in agentic environments. These insights should strongly influence the design of future RL pipelines for practical task-oriented LLM deployments, emphasizing the necessity of empirical calibration over naive reward engineering and highlighting novel advantage computation challenges unique to agentic, tool-integrated, multi-turn settings.
Reference: "Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration" (2604.02869).