Multimodal AI

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

ZZimo WenBBoxiu LiWWanbo ZhangJJunxiang LeiXXiaoyu ChenYYijia FanQQi ZhangYYujiang WangLLili QiuBBo LiZZiwei LiuCCaihua ShanYYifan YangYYifei Shen
Published
March 3, 2026
Authors
14
Word Count
27,178
Code
Includes code

Unified multimodal models underperform on understanding unless generation aids spatial reasoning.

Abstract

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

Key Takeaways

  • 1

    Unified models combining understanding and generation generally underperform base vision-language models on standard understanding tasks.

  • 2

    Generation helps understanding specifically in spatial reasoning, visual illusions, and multi-round reasoning tasks requiring visual transformations.

  • 3

    Model behaviors cluster predictably by shared pretraining data and architecture, suggesting generation-understanding coupling induces consistent inductive biases.

Limitations

  • Generate-then-Answer inference typically degrades performance by propagating visual errors in structurally constrained domains.

  • Existing benchmarks lack systematic evaluation of when generation actively aids understanding versus when it interferes with core tasks.

Keywords

Unified multimodal modelsVision-Language Modelsgeneration-to-understandingG2U evaluationspatial intelligencevisual illusionsmulti-round reasoningGenerate-then-Answer inferencemulti-step intermediate image statesinductive biases

More in Multimodal AI

View all
UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? | Paperchime