Reinforcement Learning

Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities

PPengyi LiEElizaveta GoncharovaAAndrey KuznetsovIIvan Oseledets
Published
February 5, 2026
Authors
4
Word Count
7,203
Code
Includes code

Enhances LLM exploration via probabilistic policy optimization.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard policy optimization methods, such as Group Relative Policy Optimization (GRPO), often converge to low-entropy policies, leading to severe mode collapse and limited output diversity. We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths, thereby suppressing valid alternative reasoning chains. To address this, we propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses. By incorporating Prompt Perplexity and Answer Confidence into the advantage estimation, our method dynamically reshapes the reward signal to attenuate the gradient updates of over-confident reasoning paths, while redistributing probability mass toward under-explored correct solutions. Empirical results demonstrate that our approach significantly enhances generative diversity and response entropy while maintaining competitive accuracy, effectively achieving a superior trade-off between exploration and exploitation in reasoning tasks. Empirical results on Qwen2.5 and DeepSeek models across mathematical and coding benchmarks show that ProGRPO significantly mitigates entropy collapse. Specifically, on Qwen2.5-7B, our method outperforms GRPO by 5.7% in Pass@1 and, notably, by 13.9% in Pass@32, highlighting its superior capability in generating diverse correct reasoning paths.

Key Takeaways

  • 1

    Proposes ProGRPO to enhance LLM reasoning diversity.

  • 2

    Introduces ARM to reshape reward distribution dynamically.

  • 3

    Mitigates entropy collapse in reinforcement learning for LLMs.

Limitations

  • Requires fine-tuning of confidence-aware signals.

  • Benchmarked primarily on specific models and tasks.

Keywords

Reinforcement Learning with Verifiable Rewardspolicy optimizationGroup Relative Policy Optimizationadvantage estimationPrompt PerplexityAnswer Confidenceentropy collapsegenerative diversityresponse entropyexplorationexploitationreasoning tasksPass@1Pass@32

More in Reinforcement Learning

View all
Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities | Paperchime