GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

GGigaBrain TeamBBoyuan WangCChaojun NiGGuan HuangGGuosheng ZhaoHHao LiJJie LiJJindi LvJJingyu LiuLLv FengMMingming YuPPeng LiQQiuping DengTTianze LiuXXinyu ZhouXXinze ChenXXiaofeng WangYYang WangYYifan LiYYifei NieYYilong LiYYukun ZhouYYun YeZZhichao LiuZZheng Zhu

Published: February 12, 2026
Authors: 25
Word Count: 8,295

View on arXiv Download PDF

GigaBrain-0.5M* enhances robot learning by integrating world model predictions into vision-language-action policies.

Abstract

Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose GigaBrain-0.5M*, a VLA model trained via world model-based reinforcement learning. Built upon GigaBrain-0.5, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. GigaBrain-0.5M* further integrates world model-based reinforcement learning via RAMP (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that RAMP achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including Laundry Folding, Box Packing, and Espresso Preparation. Critically, GigaBrain-0.5M^* exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our https://gigabrain05m.github.io{project page}.

Key Takeaways

1
GigaBrain-0.5M* combines vision-language-action models with world model predictions for better robot decision-making.
2
World models provide dense future state predictions that enable robots to anticipate consequences before acting.
3
RAMP framework gives policies rich signals including future predictions and value estimates instead of sparse advantage signals.

Limitations

Traditional VLAs suffer from distribution shift as errors compound over multi-step tasks due to reactive architecture.
Sparse binary advantage signals in prior approaches like RECAP discard valuable information for policy learning.

Keywords

Vision-language-action modelsworld modelsreinforcement learningcross-task adaptationRAMPRoboChallenge benchmarkrobotic manipulation

More in Multimodal AI

View all

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li +22

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Re...

Feb 26190

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu +436

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are...

Feb 4190

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang +2

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static...

Feb 26148

More Multimodal AI papers