Reinforcement Learning

Learn Hard Problems During RL with Reference Guided Fine-tuning

YYangzhen WuSShanda LiZZixin WenXXin ZhouAAmeet TalwalkarYYiming YangWWenhao HuangTTianle Cai
Published
March 1, 2026
Authors
8
Word Count
7,207
Code
Includes code

ReGFT uses partial reference guidance to generate model-aligned solutions for hard problems before RL training.

Abstract

Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity: for challenging problems, LLM fails to sample any correct trajectories, preventing RL from receiving meaningful positive feedback. At the same time, there often exist human-written reference solutions along with the problem (e.g., problems from AoPS), but directly fine-tuning on these solutions offers no benefit because models often cannot imitate human proofs that lie outside their own reasoning distribution. We introduce Reference-Guided Fine-Tuning (ReGFT), a simple and effective method that utilizes human-written reference solutions to synthesize positive trajectories on hard problems and train on them before RL. For each problem, we provide the model with a partial reference solution and let it generate its own reasoning trace, ensuring the resulting trajectories remain in the model's reasoning space while still benefiting from reference guidance. Fine-tuning on these reference-guided trajectories increases the number of solvable problems and produces a checkpoint that receives more positive rewards during RL. Across three benchmarks (AIME24, AIME25, BeyondAIME), ReGFT consistently improves supervised accuracy, accelerates DAPO training, and raises the final performance plateau of RL. Our results show that ReGFT effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.

Key Takeaways

  • 1

    ReGFT uses partial reference solutions as hints to generate model-aligned correct trajectories for hard problems.

  • 2

    Reference-guided trajectories overcome reward sparsity by providing dense learning signals during RL training.

  • 3

    ReGFT consistently outperforms direct reference fine-tuning and ReFT across multiple mathematical reasoning benchmarks.

Limitations

  • ReGFT only targets problems where model solves less than 25% of the time, leaving easier problems unaddressed.

  • Method requires human-written reference solutions to exist for each training problem in the dataset.

Keywords

reinforcement learningmathematical reasoningreward sparsitysupervised accuracyDAPO trainingreinforcement learning-based mathematical reasoning

More in Reinforcement Learning

View all
Learn Hard Problems During RL with Reference Guided Fine-tuning | Paperchime