Latest Multimodal AI Research Papers

Research on AI systems that process multiple types of data including vision-language models and cross-modal understanding.

202 Papers
Showing 20 of 20 papers

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

GigaBrain Team, Boyuan Wang, Chaojun Ni +22 more

Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robu...

Vision-language-action modelsworld modelsreinforcement learningcross-task adaptationRAMP+2 more
Feb 12, 202646

MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

Chenhao Zhang, Yazhe Niu, Hongsheng Li

Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visua...

Multimodal Large Language ModelsVisual Question AnsweringTheory of Mindvisual reinforcement learningimage implication tasks+7 more
Feb 11, 20264

PhyCritic: Multimodal Critic Models for Physical AI

Tianyi Xiong, Shihao Wang, Guilin Liu +5 more

With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existin...

multimodal modelsphysical AIRLVR pipelinephysical skill warmup stageself-referential critic finetuning+3 more
Feb 11, 202629

Causal-JEPA: Learning World Models through Object-Level Latent Interventions

Heejeong Nam, Quentin Le Lidec, Lucas Maes +2 more

World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric ...

object-centric representationsmasked joint embedding predictioncounterfactual reasoninglatent interventionscausal inductive bias+2 more
Feb 11, 20264

GENIUS: Generative Fluid Intelligence Evaluation Suite

Ruichuan An, Sihan Yang, Ziyu Guo +8 more

Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess Crystallized Intelligence, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks Generative Fluid Intelligence (GFI): the capacity t...

Unified Multimodal ModelsGenerative Fluid IntelligenceInducing Implicit PatternsExecuting Ad-hoc ConstraintsAdapting to Contextual Knowledge+1 more
Feb 11, 202637

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

Chenlong Deng, Mengjie Deng, Junjie Wu +10 more

Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than c...

multimodal retrieval systemsvisual streamstemporal sequencesagentic paradigmimage retrieval+8 more
Feb 11, 202642

Olaf-World: Orienting Latent Actions for Video World Modeling

Yuxin Jiang, Yuchao Gu, Ivor W. Tsang +1 more

Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate syste...

action-controllable world modelslatent action learningtemporal feature differencesself-supervised video encodersequence-level control-effect alignment+3 more
Feb 10, 202621

Code2World: A GUI World Model via Renderable Code Generation

Yuhao Zheng, Li'an Zhong, Yi Wang +6 more

Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneousl...

vision-language coderGUI World modelaction-conditioned predictionAndroidCodeHTML generation+7 more
Feb 10, 2026143

BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation

Yucheng Hu, Jianke Zhang, Yuanfei Luo +9 more

Equipping embodied agents with the ability to reason about tasks, foresee physical outcomes, and generate precise actions is essential for general-purpose manipulation. While recent Vision-Language-Action (VLA) models have leveraged pre-trained foundation models, they typically focus on either lingu...

Vision-Language-Action modelslinguistic planningvisual forecastingaction generationpretrained unified understanding+3 more
Feb 10, 202613

VideoWorld 2: Learning Transferable Knowledge from Real-world Videos

Zhongwei Ren, Yunchao Wei, Xiao Yu +5 more

Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-wo...

latent dynamics modelvideo diffusion modelaction dynamicsvisual appearancelatent codes+5 more
Feb 10, 202610

EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration

Modi Shi, Shijia Peng, Jin Chen +6 more

Human demonstrations offer rich environmental diversity and scale naturally, making them an appealing alternative to robot teleoperation. While this paradigm has advanced robot-arm manipulation, its potential for the more challenging, data-hungry problem of humanoid loco-manipulation remains largely...

vision-language-action policyegocentric human demonstrationshumanoid loco-manipulationembodiment gapview alignment+2 more
Feb 10, 202620

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Jingwen Sun, Wenyao Zhang, Zekun Qi +6 more

Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion,...

Vision-Language-ActionJEPAlatent-action objectivespixel variationaction-relevant state transitions+16 more
Feb 10, 202611

SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

Hongchi Xia, Xuan Li, Zhaoshuo Li +9 more

Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present...

embodied agentssimulation-ready environmentsscene-generation systemsagentic frameworklayout generation+8 more
Feb 10, 20264

P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads

Yun Luo, Futing Wang, Qianjia Cheng +28 more

The transition from symbolic manipulation to science-grade reasoning represents a pivotal frontier for Large Language Models (LLMs), with physics serving as the critical test anchor for binding abstract logic to physical reality. Physics demands that a model maintain physical consistency with the la...

vision-language modelscurriculum reinforcement learningagentic augmentationmultimodal perceptionscientific reasoning+4 more
Feb 10, 202651

MOVA: Towards Scalable and Synchronized Video-Audio Generation

SII-OpenMOSS Team, Donghua Yu, Mingshu Chen +37 more

Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sor...

Mixture-of-ExpertsMoEaudio-visual contentlip-synced speechsound effects+5 more
Feb 9, 2026111

Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Yuhao Dong, Shulin Tian, Shuai Liu +6 more

Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models' static, internal knowledge, rather than their ability to learn and adapt from dynamic, novel contexts from few examples. ...

Multimodal Large Language Modelsvideo understandingin-context learningvideo benchmarksDemo-ICL-Bench+2 more
Feb 9, 202628

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Shoubin Yu, Yue Zhang, Zun Wang +4 more

Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination,...

Multimodal Large Language Modelsvisual spatial reasoningworld modelsvisual imaginationtest-time adaptation+4 more
Feb 9, 20265

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Linli Yao, Yuancheng Wei, Yaojie Zhang +12 more

This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling reade...

Omni Dense Captioningstructured schemascript-like captionsTimeAwareSFT+5 more
Feb 9, 202619

OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

Feilong Tang, Xiang An, Yunyao Yan +16 more

Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have ...

artificial general intelligencecompression problemresonancedeep learningvisual signals+16 more
Feb 9, 202641

Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling

Ruijie Ye, Jiayi Zhang, Zhuoxin Liu +10 more

We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user's intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evalu...

agentic planner-executor frameworkcontext foldingimage layer decompositionmulti-turn editinghigh-fidelity editing+7 more
Feb 9, 202618
PreviousPage 4 of 11Next