Reinforcement Learning

ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning

JJie XiaoMMeng ChenQQingnan RenJJingwei SongJJiaqi HuangYYangshen DengCChris TongWWanyi ChenSSuli WangZZiqian BiSShuo LuYYiqun DuanXXu WangRRymon YuWWeen YangLLynn AiEEric YangBBill ShiSSong Jingwei
Published
February 2, 2026
Authors
19
Word Count
8,534
Code
Includes code

ECHO-2 cuts RL training costs 33-36% by distributing cheap rollout generation while keeping expensive policy learning centralized.

Abstract

Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of 4B and 8B models under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.

Key Takeaways

  • 1

    ECHO-2 achieves 33-36% cost reduction by moving rollout generation to cheaper distributed resources instead of premium GPU clusters.

  • 2

    Policy staleness is treated as a control parameter, allowing rollouts from policies up to S steps old to reduce synchronization overhead.

  • 3

    The system balances rollout generation and policy learning through an overlap condition that keeps the learner continuously busy without idle time.

Limitations

  • Wide-area distributed systems introduce latency and heterogeneous throughput that can create training inefficiencies if not carefully managed.

  • The approach assumes modern RL objectives like GRPO are robust to moderate policy lag, which may not apply to all RL settings.

Keywords

reinforcement learningpost-traininglarge language modelsdistributed RLrollout generationreward evaluationcentralized learningpolicy stalenesspeer-assisted pipelined broadcastcost-aware activationGRPO

More in Reinforcement Learning

View all
ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning | Paperchime