AI Agents

Reinforcement World Model Learning for LLM-based Agents

XXiao YuBBaolin PengRRuize XuYYelong ShenPPengcheng HeSSuman NathNNikhil SinghJJiangfeng GaoZZhou Yu
Published
February 5, 2026
Authors
9
Word Count
12,276
Code
Includes code

Enhancing LLMs with self-supervised world model learning.

Abstract

Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and τ^2 Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and τ^2 Bench respectively, while matching the performance of expert-data training.

Key Takeaways

  • 1

    RWML improves LLM decision-making in dynamic environments.

  • 2

    Self-supervised training aligns model predictions with real outcomes.

  • 3

    Enhances LLMs' internal world model without expert data.

Limitations

  • Requires initial interactions with the environment for data collection.

  • Relies on pre-trained embedding space for reward function.

Keywords

world-modelingreinforcement learningaction-conditioned world modelssim-to-real gap rewardsnext-state token predictionreward hackingtask-success rewardsembedding spaceagent-based systems

More in AI Agents

View all
Reinforcement World Model Learning for LLM-based Agents | Paperchime