Multimodal AI

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

II. ApanasevichMM. ArtemyevRR. BabakyanPP. FedotovaDD. GrankinEE. KupryashinAA. MisailidiDD. NerusAA. NutalapatiGG. SidorovII. EfremovMM. GerasyovDD. PikurovYY. SenchenkoSS. DavidenkoDD. KulikovMM. SultankinKK. AskarbekOO. ShamaninDD. StatovoyEE. ZalyaevII. ZorinAA. LetkinEE. RusakovAA. SilchenkoVV. VorobyovSS. SobolnikovAA. Postnikov
Published
January 31, 2026
Authors
28
Word Count
11,523

Green-VLA: Advancing adaptable, intelligent robotic systems.

Abstract

We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE WidowX and CALVIN ABC-D, as well as real-robot evaluations, show strong generalization and performance gains from RL alignment in success rate, robustness, and long-horizon efficiency.

Key Takeaways

  • 1

    Green-VLA enhances robotic adaptability and efficiency.

  • 2

    Unified action space allows versatile robot control.

  • 3

    Reinforcement learning significantly improves task performance.

Limitations

  • Requires high-quality, curated data for training.

  • Performance may drop in truly novel scenarios.

Keywords

Vision-Language-Actionmultimodal groundingmulti-embodiment pretrainingembodiment-specific adaptationreinforcement-learningepisode-progress predictionout-of-distribution detectionjoint-prediction-based guidance

More in Multimodal AI

View all
Green-VLA: Staged Vision-Language-Action Model for Generalist Robots | Paperchime