MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

ZZongxia LiHHongyang DuCChengsong HuangXXiyang WuLLantao YuYYicheng HeJJing XieXXiaomin WuZZhichao LiuJJiarui ZhangFFuxiao Liu

Published: March 10, 2026
Authors: 11
Word Count: 8,720
Code: Includes code

View on arXiv Download PDF

MM-Zero enables vision-language models to self-evolve without real data using synthetic image generation.

Abstract

Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending the frontier of self-improvement beyond the conventional two-model paradigm.

Key Takeaways

1
MM-Zero trains vision-language models without any real images by generating synthetic visual content programmatically.
2
The framework uses three specialized roles—Proposer, Coder, and Solver—all initialized from the same base model.
3
Carefully designed reward mechanisms balance task difficulty using the Goldilocks principle to optimize model self-improvement.

Limitations

Approach still requires executable code generation capability, limiting applicability to domains without clear programmatic representations.
Performance gains demonstrated only on specific multimodal benchmarks; generalization to diverse real-world tasks unclear.

Keywords

self-evolvingLarge Language ModelsVision Language Modelsreinforcement learningmultimodal reasoningGroup Relative Policy Optimizationvisual conceptsexecutable codevisual verificationdifficulty balancing

More in Multimodal AI

View all

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li +22

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Re...

Feb 26190

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu +436

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are...

Feb 4190

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang +2

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static...

Feb 26148

More Multimodal AI papers