Multimodal AI

WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models

RRunjie ZhouYYoubo ShaoHHaoyu LuBBowei XingTTongtong BaiYYujie ChenJJie ZhaoLLin SuiHHaotian YaoZZijia ZhaoHHao YangHHaoning WuZZaida ZhouJJinguo ZhuZZhiqi HuangYYiping BaoYYangyang LiuYY. CharlesXXinyu Zhou
Published
January 28, 2026
Authors
19

Abstract

We introduce WorldVQA, a benchmark designed to evaluate the atomic visual world knowledge of Multimodal Large Language Models (MLLMs). Unlike current evaluations, which often conflate visual knowledge retrieval with reasoning, WorldVQA decouples these capabilities to strictly measure "what the model memorizes." The benchmark assesses the atomic capability of grounding and naming visual entities across a stratified taxonomy, spanning from common head-class objects to long-tail rarities. We expect WorldVQA to serve as a rigorous test for visual factuality, thereby establishing a standard for assessing the encyclopedic breadth and hallucination rates of current and next-generation frontier models.

Keywords

Multimodal Large Language Modelsvisual world knowledgevisual knowledge retrievalreasoningatomic visual world knowledgevisual factualityencyclopedic breadthhallucination rates

More in Multimodal AI

View all
WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models | Paperchime