Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

II. ApanasevichMM. ArtemyevRR. BabakyanPP. FedotovaDD. GrankinEE. KupryashinAA. MisailidiDD. NerusAA. NutalapatiGG. SidorovII. EfremovMM. GerasyovDD. PikurovYY. SenchenkoSS. DavidenkoDD. KulikovMM. SultankinKK. AskarbekOO. ShamaninDD. StatovoyEE. ZalyaevII. ZorinAA. LetkinEE. RusakovAA. SilchenkoVV. VorobyovSS. SobolnikovAA. Postnikov

Published: January 31, 2026
Authors: 28
Word Count: 11,523

View on arXiv Download PDF

Green-VLA: Advancing adaptable, intelligent robotic systems.

Abstract

We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE WidowX and CALVIN ABC-D, as well as real-robot evaluations, show strong generalization and performance gains from RL alignment in success rate, robustness, and long-horizon efficiency.

Key Takeaways

1
Green-VLA enhances robotic adaptability and efficiency.
2
Unified action space allows versatile robot control.
3
Reinforcement learning significantly improves task performance.

Limitations

Requires high-quality, curated data for training.
Performance may drop in truly novel scenarios.

Keywords

Vision-Language-Actionmultimodal groundingmulti-embodiment pretrainingembodiment-specific adaptationreinforcement-learningepisode-progress predictionout-of-distribution detectionjoint-prediction-based guidance

More in Multimodal AI

View all

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li +22

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Re...

Feb 26190

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu +436

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are...

Feb 4190

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang +2

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static...

Feb 26148

More Multimodal AI papers