Multimodal AI

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

GGigaBrain TeamBBoyuan WangCChaojun NiGGuan HuangGGuosheng ZhaoHHao LiJJie LiJJindi LvJJingyu LiuLLv FengMMingming YuPPeng LiQQiuping DengTTianze LiuXXinyu ZhouXXinze ChenXXiaofeng WangYYang WangYYifan LiYYifei NieYYilong LiYYukun ZhouYYun YeZZhichao LiuZZheng Zhu
Published
February 12, 2026
Authors
25
Word Count
8,295

GigaBrain-0.5M* enhances robot learning by integrating world model predictions into vision-language-action policies.

Abstract

Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose GigaBrain-0.5M*, a VLA model trained via world model-based reinforcement learning. Built upon GigaBrain-0.5, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. GigaBrain-0.5M* further integrates world model-based reinforcement learning via RAMP (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that RAMP achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including Laundry Folding, Box Packing, and Espresso Preparation. Critically, GigaBrain-0.5M^* exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our https://gigabrain05m.github.io{project page}.

Key Takeaways

  • 1

    GigaBrain-0.5M* combines vision-language-action models with world model predictions for better robot decision-making.

  • 2

    World models provide dense future state predictions that enable robots to anticipate consequences before acting.

  • 3

    RAMP framework gives policies rich signals including future predictions and value estimates instead of sparse advantage signals.

Limitations

  • Traditional VLAs suffer from distribution shift as errors compound over multi-step tasks due to reactive architecture.

  • Sparse binary advantage signals in prior approaches like RECAP discard valuable information for policy learning.

Keywords

Vision-language-action modelsworld modelsreinforcement learningcross-task adaptationRAMPRoboChallenge benchmarkrobotic manipulation

More in Multimodal AI

View all
GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning | Paperchime