Latest Reinforcement Learning Research Papers

Research on learning through interaction, reward optimization, policy learning, and decision-making AI systems.

64 Papers
Showing 20 of 20 papers

TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

Yihong Luo, Tianyang Hu, Weijian Luo +1 more

While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through ...

reinforcement learningfew-step modelsdiffusion modelsTrajectory Distribution Matchingsurrogate reward learning+5 more
Mar 8, 202612

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

Yuan Li, Bo Wang, Yufei Gao +4 more

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of l...

PPOtrust regionsf-divergencesconvex optimizationentropy collapse+3 more
Mar 5, 202651

Specificity-aware reinforcement learning for fine-grained open-world classification

Samuele Angheben, Davide Berasi, Alessandro Conti +2 more

Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions whe...

Large Multimodal Modelsreinforcement learningspecificityopen-world settingfine-grained image classification+2 more
Mar 3, 202611

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Jiyuan Wang, Chunyu Lin, Lei Sun +8 more

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective t...

diffusion modelsreinforcement learning3D editingmulti-view consistencysupervised fine-tuning+3 more
Mar 3, 2026122

Next Embedding Prediction Makes World Models Stronger

George Bredis, Nikita Balagansky, Daniil Gavrilov +1 more

Capturing temporal dependencies is critical for model-based reinforcement learning (MBRL) in partially observable, high-dimensional domains. We introduce NE-Dreamer, a decoder-free MBRL agent that leverages a temporal transformer to predict next-step encoder embeddings from latent state sequences, d...

temporal transformernext-step encoder embeddingstemporal predictive alignmentrepresentation spacemodel-based reinforcement learning+4 more
Mar 3, 202616

Heterogeneous Agent Collaborative Reinforcement Learning

Zhixia Zhang, Zixuan Huang, Xin Xia +7 more

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during traini...

heterogeneous agentscollaborative optimizationon-policy optimizationmulti-agent reinforcement learningrollout sharing+4 more
Mar 3, 2026146

Efficient RLVR Training via Weighted Mutual Information Data Selection

Xinyu Zhou, Boyu Zhu, Haotian Zhang +2 more

Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints wi...

reinforcement learningdifficulty-based heuristicsepistemic uncertaintyBayesian latent success ratesweighted mutual information+6 more
Mar 2, 202612

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

Qiyuan Zhang, Yufei Wang, Tianhe Wu +5 more

Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of diffe...

Generative Reward ModelsChain-of-ThoughtBreadth-CoTDepth-CoTmodular synthesis pipeline+4 more
Mar 2, 202629

Learn Hard Problems During RL with Reference Guided Fine-tuning

Yangzhen Wu, Shanda Li, Zixin Wen +5 more

Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity: for challenging problems, LLM fails to sample any correct trajectories, preventing RL from receiving meaningful positive feedback. At the same time, there often exist human-written reference solutions along with ...

reinforcement learningmathematical reasoningreward sparsitysupervised accuracyDAPO training+1 more
Mar 1, 202611

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Weinan Dai, Hanlin Wu, Qiying Yu +13 more

GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kern...

CUDA kernel optimizationlarge language modelstorch.compilereinforcement learningkernel generation+5 more
Feb 27, 202641

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Xiaoxuan Wang, Han Zhang, Haixin Wang +11 more

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to lar...

agentic reinforcement learningpolicy gradienttraining stabilitypolicy optimizationARLArena+1 more
Feb 25, 202617

Reinforced Fast Weights with Next-Sequence Prediction

Hee Seung Hwang, Xindi Wu, Sanghyuk Chun +1 more

Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length. However, their potential is limited by the next-token prediction (NTP) training paradigm. NTP optimizes single-token...

fast weight architecturesattention-based transformersnext-token predictionnext-sequence predictionreinforcement learning+7 more
Feb 18, 202612

Multi-agent cooperation through in-context co-player inference

Marissa A. Weis, Maciej Wołczyk, Rajai Nasser +4 more

Achieving cooperation among self-interested agents remains a fundamental challenge in multi-agent reinforcement learning. Recent work showed that mutual cooperation can be induced between "learning-aware" agents that account for and shape the learning dynamics of their co-players. However, existing ...

multi-agent reinforcement learningsequence modelsin-context learningcooperative behaviorlearning-aware agents+5 more
Feb 18, 202615

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Shiqi Liu, Zeyu He, Guojian Zhan +10 more

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, le...

reinforcement learningpolicy gradientstoken probabilitypolicy entropytraining instability+8 more
Feb 17, 20263

GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics

Modi Jin, Yiming Zhang, Boyuan Sun +3 more

This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions. Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought...

chain-of-thoughtgeolocation datasetgeo-similarity rewardconsistency rewardconsistency agent+2 more
Feb 13, 20265

RLinf-Co: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models

Liangzhi Shi, Shuaihang Chen, Feng Gao +7 more

Simulation offers a scalable and low-cost way to enrich vision-language-action (VLA) training, reducing reliance on expensive real-robot demonstrations. However, most sim-real co-training methods rely on supervised fine-tuning (SFT), which treats simulation as a static source of demonstrations and d...

vision-language-actionsim-real co-trainingreinforcement learningsupervised fine-tuningpolicy optimization+2 more
Feb 13, 20269

FLAC: Maximum Entropy RL via Kinetic Energy Regularized Bridge Matching

Lei Lv, Yunfei Li, Yu Luo +2 more

Iterative generative policies, such as diffusion models and flow matching, offer superior expressivity for continuous control but complicate Maximum Entropy Reinforcement Learning because their action log-densities are not directly accessible. To address this, we propose Field Least-Energy Actor-Cri...

diffusion modelsflow matchingMaximum Entropy Reinforcement Learningvelocity fieldGeneralized Schrödinger Bridge+5 more
Feb 13, 20263

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

Futing Wang, Jianhao Yan, Yun Luo +6 more

Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to e...

In-Context ExplorationState Coverage theoryShallow Exploration TrapLength-Incentivized Explorationautoregressive generation+2 more
Feb 12, 202628

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie +3 more

On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this wo...

on-policy distillationlogit distributiondense KL-constrained RLreward scaling factorreward extrapolation+4 more
Feb 12, 202655
Page 1 of 4Next