Latest Reinforcement Learning Research Papers

Research on learning through interaction, reward optimization, policy learning, and decision-making AI systems.

64 Papers

Showing 20 of 20 papers

On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models

Shumin Wang, Yuexiang Xie, Wenhao Zhang +4 more

Entropy serves as a critical metric for measuring the diversity of outputs generated by large language models (LLMs), providing valuable insights into their exploration capabilities. While recent studies increasingly focus on monitoring and adjusting entropy to better balance exploration and exploit...

entropylarge language modelsreinforcement fine-tuningRFTlogit update+4 more

Feb 3, 202642

CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs

Zhiyuan Yao, Yi-Kai Zhang, Yuxin Chen +7 more

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key approach for enhancing LLM reasoning.However, standard frameworks like Group Relative Policy Optimization (GRPO) typically employ a uniform rollout budget, leading to resource inefficiency. Moreover, existing adaptive methods...

Reinforcement Learning with Verifiable RewardsGRPOrollout budgetadaptive methodsCapability-Oriented Value function+3 more

Feb 3, 202632

ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning

Jie Xiao, Meng Chen, Qingnan Ren +16 more

Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resourc...

reinforcement learningpost-traininglarge language modelsdistributed RLrollout generation+6 more

Feb 2, 202612

RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

Yinjie Wang, Tianbao Xie, Ke Shen +2 more

We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with in...

reinforcement learningenvironment modelingpolicy modelsreward modelsclosed-loop optimization+10 more

Feb 2, 202623

SSL: Sweet Spot Learning for Differentiated Guidance in Agentic Optimization

Jinyang Wu, Changpeng Yang, Yuhao Shen +9 more

Reinforcement learning with verifiable rewards has emerged as a powerful paradigm for training intelligent agents. However, existing methods typically employ binary rewards that fail to capture quality differences among trajectories achieving identical outcomes, thereby overlooking potential diversi...

reinforcement learningverifiable rewardstrajectory optimizationpolicy optimizationgradient signal-to-noise ratio+2 more

Jan 30, 20269

Beyond Imitation: Reinforcement Learning for Active Latent Planning

Zhi Zheng, Wee Sun Lee

Aiming at efficient and dense chain-of-thought (CoT) reasoning, latent reasoning methods fine-tune Large Language Models (LLMs) to substitute discrete language tokens with continuous latent tokens. These methods consume fewer tokens compared to the conventional language CoT reasoning and have the po...

chain-of-thought reasoninglatent reasoninglarge language modelslatent tokensconditional variational auto-encoder+4 more

Jan 29, 20266

Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

Weidong Huang, Zhehan Li, Hangxin Liu +3 more

Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy alg...

Proximal Policy OptimizationSoft Actor-Criticon-policy methodsoff-policy RLmodel-based RL+8 more

Jan 29, 20264

Language-based Trial and Error Falls Behind in the Era of Experience

Haoyu Wang, Guozheng Ma, Shugang Cui +7 more

While Large Language Models (LLMs) excel in language-based agentic tasks, their applicability to unseen, nonlinguistic environments (e.g., symbolic or spatial tasks) remains limited. Previous work attributes this performance gap to the mismatch between the pretraining distribution and the testing di...

Large Language Modelsagentic taskspretraining distributiontesting distributionexploration cost+10 more

Jan 29, 202615

Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

Yiju Guo, Tianyi Hu, Zexu Sun +1 more

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem di...

Reinforcement Learning with Verifiable Rewardsexplorationrollout budgetsampling successpolicy optimization+4 more

Jan 29, 202612

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Yanqi Dai, Yuxiang Ji, Xiao Zhang +3 more

Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their import...

Reinforcement Learning with Verifiable RewardsGroup Relative Policy OptimizationDifficulty-Aware Group Policy OptimizationMulti-Aspect Question Reformulationmathematical reasoning+4 more

Jan 28, 202690

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric +8 more

Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottlen...

reinforcement learningverifiable rewardsrich feedbackSelf-Distillation Policy OptimizationSDPO+7 more

Jan 28, 202619

Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning

Jinyang Wu, Shuo Yang, Changpeng Yang +4 more

Reinforcement learning has empowered large language models to act as intelligent agents, yet training them for long-horizon tasks remains challenging due to the scarcity of high-quality trajectories, especially under limited resources. Existing methods typically scale up rollout sizes and indiscrimi...

reinforcement learninglarge language modelslong-horizon taskstrajectory scarcityrollout size+5 more

Jan 28, 202618

Glance and Focus Reinforcement for Pan-cancer Screening

Linshan Wu, Jiaxin Zhuang, Hao Chen

Pan-cancer screening in large-scale CT scans remains challenging for existing AI methods, primarily due to the difficulty of localizing diverse types of tiny lesions in large CT volumes. The extreme foreground-background imbalance significantly hinders models from focusing on diseased regions, while...

reinforcement learningglance modelfocus modelsegmentationnon-differentiable+5 more

Jan 27, 20264

Endless Terminals: Scaling RL Environments for Terminal Agents

Kanishk Gandhi, Shivam Garg, Noah D. Goodman +1 more

Environments are the bottleneck for self-improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal-us...

reinforcement learningPPOterminal benchmarksprocedural generationcontainerized environments+3 more

Jan 23, 20266

Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

Haocheng Xi, Charlie Ruan, Peiyuan Liao +7 more

Reinforcement learning (RL) is essential for enhancing the complex reasoning capabilities of large language models (LLMs). However, existing RL training pipelines are computationally inefficient and resource-intensive, with the rollout phase accounting for over 70% of total training time. Quantized ...

reinforcement learninglarge language modelsquantized RL trainingFP8 precisionBF16 precision+9 more

Jan 20, 202615

KAGE-Bench: Fast Known-Axis Visual Generalization Evaluation for Reinforcement Learning

Egor Cherepanov, Daniil Zelezetsky, Alexey K. Kovalev +1 more

Pixel-based reinforcement learning agents often fail under purely visual distribution shift even when latent dynamics and rewards are unchanged, but existing benchmarks entangle multiple sources of shift and hinder systematic analysis. We introduce KAGE-Env, a JAX-native 2D platformer that factorize...

pixel-based reinforcement learningvisual distribution shiftlatent dynamicsreward functionJAX-native+5 more

Jan 20, 20268

Behavior Knowledge Merge in Reinforced Agentic Models

Xiangchi Yuan, Dachuan Shi, Chunhui Zhang +4 more

Reinforcement learning (RL) is central to post-training, particularly for agentic models that require specialized reasoning behaviors. In this setting, model merging offers a practical mechanism for integrating multiple RL-trained agents from different tasks into a single generalist model. However, ...

reinforcement learningmodel mergingagentic modelstask vectorssupervised fine-tuning+5 more

Jan 20, 202622

DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing

Qian Cao, Yahui Liu, Wei Bi +6 more

Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization eff...

Jan 14, 20262

Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Zhiyuan Hu, Yunhai Hu, Juncheng Liu +9 more

Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often spars...

Jan 14, 202676

Your Group-Relative Advantage Is Biased

Fengkai Yang, Zherui Chen, Xiaohan Wang +10 more

Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid lear...

Reinforcement Learning from Verifier Rewardsgroup-based methodsGRPOadvantage estimationbias correction+4 more

Jan 13, 2026128

PreviousPage 3 of 4Next

View all categories