UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

ZZimo WenBBoxiu LiWWanbo ZhangJJunxiang LeiXXiaoyu ChenYYijia FanQQi ZhangYYujiang WangLLili QiuBBo LiZZiwei LiuCCaihua ShanYYifan YangYYifei Shen

Published: March 3, 2026
Authors: 14
Word Count: 27,178
Code: Includes code

View on arXiv Download PDF

Unified multimodal models underperform on understanding unless generation aids spatial reasoning.

Abstract

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

Key Takeaways

1
Unified models combining understanding and generation generally underperform base vision-language models on standard understanding tasks.
2
Generation helps understanding specifically in spatial reasoning, visual illusions, and multi-round reasoning tasks requiring visual transformations.
3
Model behaviors cluster predictably by shared pretraining data and architecture, suggesting generation-understanding coupling induces consistent inductive biases.

Limitations

Generate-then-Answer inference typically degrades performance by propagating visual errors in structurally constrained domains.
Existing benchmarks lack systematic evaluation of when generation actively aids understanding versus when it interferes with core tasks.

Keywords

Unified multimodal modelsVision-Language Modelsgeneration-to-understandingG2U evaluationspatial intelligencevisual illusionsmulti-round reasoningGenerate-then-Answer inferencemulti-step intermediate image statesinductive biases

More in Multimodal AI

View all

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li +22

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Re...

Feb 26190

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu +436

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are...

Feb 4190

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang +2

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static...

Feb 26148

More Multimodal AI papers