Reinforcement Learning

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

FFuting WangJJianhao YanYYun LuoGGanqu CuiZZhi WangXXiaoye QuYYue ZhangYYu ChengTTao Lin
Published
February 12, 2026
Authors
9
Word Count
9,847

Train language models to think longer by incentivizing exploration of diverse reasoning paths.

Abstract

Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.

Key Takeaways

  • 1

    Language models get stuck in shallow reasoning patterns despite having more tokens available for exploration.

  • 2

    Length-incentivized training enables models to explore diverse reasoning paths within a single response trajectory.

  • 3

    Count-based exploration using last-n-grams effectively measures and maximizes distinct reasoning patterns in model outputs.

Limitations

  • Standard RL training doesn't naturally incentivize longer responses due to exponential decay in sampling probability.

  • State abstraction using last-n-grams may miss important distinctions in reasoning patterns beyond token sequences.

Keywords

In-Context ExplorationState Coverage theoryShallow Exploration TrapLength-Incentivized Explorationautoregressive generationredundancy penaltystate coverage

More in Reinforcement Learning

View all
Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning | Paperchime