Latest Reinforcement Learning Research Papers

Research on learning through interaction, reward optimization, policy learning, and decision-making AI systems.

64 Papers
Showing 20 of 20 papers

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie +3 more

On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this wo...

on-policy distillationlogit distributiondense KL-constrained RLreward scaling factorreward extrapolation+4 more
Feb 12, 202655

Online Causal Kalman Filtering for Stable and Effective Policy Optimization

Shuo He, Lang Feng, Xin Cheng +2 more

Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust eac...

importance samplingpolicy optimizationreinforcement learninglarge language modelsKalman filter+5 more
Feb 11, 202610

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

Yicheng Chen, Zerun Ma, Xinchen Xie +2 more

In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the data recipe, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing u...

Large Language Modelsdata recipereinforcement learningproxy rewarddownstream performance+3 more
Feb 11, 202612

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen, Chenxiao Zhao, Xiang Cheng +2 more

Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. ...

reinforcement learninglarge language modelspolicy stalenessasynchronous trainingimportance sampling+6 more
Feb 11, 2026167

Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

Kirill Pavlenko, Alexander Golubev, Simon Karasik +1 more

Group Relative Policy Optimization (GRPO) assigns a single scalar advantage to all tokens in a completion. For structured generations with explicit segments and objectives, this couples unrelated reward signals across segments, leading to objective interference and misattributed credit. We propose B...

Group Relative Policy Optimizationadvantage estimationreward interferencestructured generationstext blocks+3 more
Feb 10, 20265

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Zhaoyang Wang, Canwen Xu, Boyi Liu +5 more

Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent ...

large language modelautonomous agentsmulti-turn interactionstool-use agentsreinforcement learning+4 more
Feb 10, 202634

Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

Lang Feng, Longtao Zheng, Shuo He +2 more

Multi-agent LLM systems enable advanced reasoning and tool use via role specialization, yet reliable reinforcement learning (RL) post-training for such systems remains difficult. In this work, we theoretically pinpoint a key reason for training instability when extending group-based RL to multi-agen...

multi-agent LLM systemsreinforcement learningGRPO-style optimizationgradient-norm instabilityagent-wise remedy+3 more
Feb 9, 202611

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang +10 more

Large Language Model (LLM) agents have shown stunning results in complex tasks, yet they often operate in isolation, failing to learn from past experiences. Existing memory-based methods primarily store raw trajectories, which are often redundant and noise-heavy. This prevents agents from extracting...

large language model agentsreinforcement learningskill discoveryrecursive evolutionskill library+5 more
Feb 9, 202653

WorldCompass: Reinforcement Learning for Long-Horizon World Models

Zehan Wang, Tengfei Wang, Haiyu Zhang +9 more

This work presents WorldCompass, a novel Reinforcement Learning (RL) post-training framework for the long-horizon, interactive video-based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively "steer" the world model's explorat...

Reinforcement Learningworld modelsvideo generationrollout strategyreward functions+3 more
Feb 9, 202615

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov +3 more

Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, large group sizes are not feasible due to computational limits, which biases learning toward trajectories that are already likely. Smaller group...

reinforcement learningverifiable rewardsgroup samplingadvantage estimationpolicy updates+5 more
Feb 6, 202628

ECO: Energy-Constrained Optimization with Reinforcement Learning for Humanoid Walking

Weidong Huang, Jingwen Zhang, Jiongye Li +6 more

Achieving stable and energy-efficient locomotion is essential for humanoid robots to operate continuously in real-world applications. Existing MPC and RL approaches often rely on energy-related metrics embedded within a multi-objective optimization framework, which require extensive hyperparameter t...

model predictive controlreinforcement learningconstrained optimizationLagrangian methodenergy-constrained optimization+3 more
Feb 6, 20263

InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

Yuchen Yan, Liang Jiang, Jin Jiang +7 more

Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing interme...

chain-of-thoughtiterative reasoningreinforcement learningtrajectory-level reinforcement learningsummarization+2 more
Feb 6, 20265

Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation

Zhiqi Yu, Zhangquan Chen, Mengting Liu +2 more

Reinforcement Learning with Verifiable Rewards (RLVR), particularly GRPO, has become the standard for eliciting LLM reasoning. However, its efficiency in exploration and difficulty adaptation remains an open challenge. In this work, we argue that these bottlenecks stem from an implicit advantage sym...

Reinforcement Learning with Verifiable RewardsGRPOGroup Relative Advantage EstimationGRAEasymmetric suppression+5 more
Feb 5, 202611

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

Wei Liu, Jiawei Xu, Yingru Li +4 more

High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these ca...

reinforcement learningkernel generationreward hackingpolicy gradientGRPO+6 more
Feb 5, 202622

Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities

Pengyi Li, Elizaveta Goncharova, Andrey Kuznetsov +1 more

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard policy optimization methods, such as Group Relative Policy Optimization (GRPO), often converge to low-entropy policies, leading to...

Reinforcement Learning with Verifiable Rewardspolicy optimizationGroup Relative Policy Optimizationadvantage estimationPrompt Perplexity+9 more
Feb 5, 202612

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Fanfan Liu, Youyang Yin, Peng Shi +3 more

Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often re...

Reinforcement Learning with Verifiable RewardsLLMsVision-Language Modelsresponse lengthsequence policy optimization+4 more
Feb 5, 202645

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

Zhenghao Xu, Qin Lu, Changlong Yu +1 more

Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KL-regularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed-form PMD updates require reliable pa...

policy mirror descentKL-regularized policy improvementlog-partition termmean rewardlog-policy space+5 more
Feb 5, 20265

Self-Hinting Language Models Enhance Reinforcement Learning

Baohao Liao, Hanze Dong, Xinxing Xu +2 more

Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advanta...

Group Relative Policy Optimizationreinforcement learningprivileged supervisionself-hintrollout distribution+5 more
Feb 3, 202617

Neural Predictor-Corrector: Solving Homotopy Problems with Reinforcement Learning

Jiayao Mai, Bangyan Liao, Zhenjun Zhao +6 more

The Homotopy paradigm, a general principle for solving challenging problems, appears across diverse domains such as robust optimization, global optimization, polynomial root-finding, and sampling. Practical solvers for these problems typically follow a predictor-corrector (PC) structure, but rely on...

predictor-correctorreinforcement learningsequential decision-makingamortized trainingneural solver+3 more
Feb 3, 202612

Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

Ziru Chen, Dongdong Chen, Ruinan Jin +3 more

Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide a...

large language modelsreinforcement learningmulti-turn code generationMarkov decision processcontextual bandit learning+6 more
Feb 3, 20263
PreviousPage 2 of 4Next
Latest Reinforcement Learning Research | Reinforcement Learning Papers