Reinforcement Learning via Self-Distillation

JJonas HübotterFFrederike LübeckLLejs BehricAAnton BaumannMMarco BagatellaDDaniel MartaIIdo HakimiIIdan ShenfeldTThomas Kleine BueningCCarlos GuestrinAAndreas Krause

Published: January 28, 2026
Authors: 11
Word Count: 3,031
Code: Includes code

View on arXiv Download PDF

SDPO revolutionizes reinforcement learning with self-distillation.

Abstract

Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.

Key Takeaways

1
SDPO leverages rich feedback for self-improvement.
2
Outperforms baselines in sample efficiency and accuracy.
3
Effective on very hard questions and complex tasks.

Limitations

Requires rich feedback for effective learning.
May struggle without initial high-quality model performance.

Keywords

reinforcement learningverifiable rewardsrich feedbackSelf-Distillation Policy OptimizationSDPOtokenized feedbackdense learning signalreward modelpolicy distillationin-context learningsample efficiencyaccuracy

More in Reinforcement Learning

View all

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen, Chenxiao Zhao +3

Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference e...

Feb 11167

Heterogeneous Agent Collaborative Reinforcement Learning

Zhixia Zhang, Zixuan Huang +8

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative...

Mar 3146

Your Group-Relative Advantage Is Biased

Fengkai Yang, Zherui Chen +11

Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its vari...

Jan 13128

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Jiyuan Wang, Chunyu Lin +9

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scar...

Mar 3122

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Yanqi Dai, Yuxiang Ji +4

Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challen...

Jan 2890

More Reinforcement Learning papers