Policy and World Modeling Co-Training for Language Agents

Published 1 Jun 2026 in cs.LG and cs.AI | (2606.02388v1)

Abstract: Reinforcement learning (RL) improves LLM agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.

Abstract PDF Upgrade to Chat

Authors (12)

Summary

The paper introduces a novel co-training approach (PaW) that jointly optimizes policy and world-model objectives for language agents.
It employs action-entropy-based data selection and noise-tolerant loss functions to enhance stability in predicting environment dynamics under sparse rewards.
Experiments across ALFWorld, WebShop, and QA domains show significant gains, with success rates increasing up to 62.2% in challenging settings.

Policy and World Modeling Co-Training for Language Agents: An Expert Summary

Motivation and Context

LLM agents have demonstrated significant progress in complex interactive environments, particularly when enhanced by Reinforcement Learning (RL). However, standard RL paradigms for LLM agents focus exclusively on maximizing extrinsic rewards derived from the environment, neglecting explicit learning about the effects of actions on environmental dynamics. This omission leads to brittle agents prone to failure in long-horizon, open-ended tasks, especially under reward sparsity or delayed feedback. Traditional approaches to world modeling (WM) in LLM agents have introduced separate simulators, auxiliary training stages, or incurred heavy inference costs, complicating deployment and scaling.

The PaW Framework: Joint Policy and World Modeling

The PaW (Policy and World-modeling co-training) framework addresses these limitations by reusing the on-policy RL rollouts for auxiliary world modeling supervision in a single, unified model, without altering the inference pipeline. At each RL update, next-observation tokens are appended to each action–observation transition, offering dense supervision on how actions alter the state of the environment. The agent’s parameters are updated to optimize both the standard RL objective (for action selection) and an auxiliary WM objective (for action-conditioned next-observation prediction).

Figure 1: Comparison of world modeling paradigms—PaW eliminates the need for separate simulators or inference-time planning by co-training the policy and world model jointly.

Key to PaW are three architectural innovations to ensure robustness and stability given the noisy, unpredictable nature of text environments:

Action-entropy-based data selection: Preferentially selects high-entropy action transitions, which are more informative for dynamics estimation.
Noise-tolerant Clipped Mean Absolute Error (MAE) loss: Replaces cross-entropy with a bounded loss on WM targets and applies confidence-based clipping to avoid overfitting unpredictable or spurious tokens.
Reward-adaptive auxiliary loss balancing: Dynamically scales the WM loss term based on the mean group reward, focusing world-model supervision on agents with insufficient task success.
Figure 3: Overview of PaW—joint RL and world-model losses are applied over action-entropy-selected transitions with adaptive weighting and noise-robust objectives.

Methodological Details

Auxiliary World Modeling Objective:

Given a trajectory with observed $(h_t, a_t, o_{t+1})$ tuples, the world modeling loss is applied over the most informative transitions, as determined by the action-entropy criterion. The clipped MAE loss further mitigates the effect of meaningless or highly stochastic observation tokens. This synergy ensures WM signals remain both relevant and robust with respect to the non-determinism and surface-level noise prevalent in textual observations.

Reward-Adaptive Balancing:

To prevent domination of the RL loss by dense WM supervision, PaW dynamically attenuates the auxiliary loss weighting as group episode return approaches the environment maximum, allowing greater world-model learning when policy learning is challenged.

Efficiency:

PaW is architected so that action and observation loss computations occur within the same model and forward pass, and no modification to generation or policy-inference procedures is required at deployment; only training-time is affected.

Figure 2: Training time and memory overhead—PaW incurs only modest increases (~2%) in resource usage, demonstrating practical scalability.

Experimental Results

Extensive evaluations were conducted across three domains:

ALFWorld: Embodied household task environment
WebShop: Open-domain web shopping with sparse reward signals
Search-augmented QA: Tool-augmented, multi-turn question answering

PaW was tested on RL algorithms (GRPO, GIGPO, PPO, RLOO) and across model scales. The method consistently yielded improvements over strong RL-only baselines. For example, on ALFWorld and WebShop, success rates for GRPO increased by up to +8% and +9% respectively. Notably, in settings where classic RL collapses (e.g., Llama3.2-3B-Instruct on WebShop, where reward signals are extremely sparse), PaW elevated success from 4.0% to 62.2%, indicating substantial robustness benefits.

Figure 5: Training rewards on WebShop—PaW overcomes reward sparsity where vanilla RL fails, rapidly obtaining positive success signals.

Ablations confirm that each architectural component—adaptive loss balancing and noise-robust WM loss—is critical for the observed gains. PaW’s improvements manifest even when hyperparameters (clipping threshold, entropy ratio) are varied broadly, underscoring the method’s stability.

Training and Optimization Dynamics

Training monitors highlight that PaW increases cumulative reward without altering the essential loss landscape or update statistics of policy optimization, confirming that world-model supervision integrates cleanly as an auxiliary objective.

Figure 4: Policy-side training dynamics—PaW improves training reward while preserving policy gradient and clipping statistics relative to baseline RL.

WM objective utilization and attenuation over training progression follows expectations: as episodic success rates rise, adaptive down-weighting of WM loss occurs, while overall prediction error on next observations declines.

Figure 8: PaW auxiliary training dynamics—reward-adaptive world-model loss and the clipped token ratio smoothly respond to policy improvements and noise.

Theoretical and Practical Implications

PaW recasts standard RL rollouts as a dual source of both reward-based and action-conditioned dynamics supervision, leveraging information traditionally ignored. This advocates for a paradigm shift in LLM agent training: dense, local world modeling signals available in on-policy data maximize sample efficiency and enable agents to internalize environment transitions. Practically, PaW’s zero-cost inference protocol and small training overhead make it viable for deployment in real-world, resource-constrained settings.

On a theoretical level, PaW supports the premise that LLM agents benefit from local environment modeling, reminiscent of Sutton’s DYNA architecture, but without dedicated imagination or separate modeling stages. The gains in robustness—especially under reward sparsity—suggest future work in extending to multi-step or trajectory-level world modeling, alternative selection strategies for auxiliary targets, and online deduplication of supervision.

Conclusion

PaW demonstrates that policy and world-modeling co-training is an effective, efficient, and general approach for improving LLM agent performance in interactive tasks. It delivers robust gains across algorithms, agent architectures, and domains, with negligible computational overhead and no inference-time impact. The evidence suggests that action-conditioned observation modeling leverages a potent, untapped signal in existing RL pipelines, promoting more resilient and sample-efficient language agents.

Future Outlook

Incorporating longer-horizon world modeling, deduplication over sampled transitions, and further integration with imagination-based or planning reinforcement learning, are promising avenues. More sophisticated selection criteria for auxiliary targets could further improve training dynamics and generalization. These developments will be integral as LLM agents move toward autonomous operation in open-ended, dynamic environments.

Markdown Report Issue