VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
Guobin Shen, Chenxiao Zhao +3
Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference e...