Latest Multimodal AI Research Papers

Research on AI systems that process multiple types of data including vision-language models and cross-modal understanding.

202 Papers

Showing 20 of 20 papers

OmniGAIA: Towards Native Omni-Modal AI Agents

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin +8 more

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified c...

multi-modal LLMsomni-modal perceptioncross-modal reasoningtool-integrated reasoninghindsight-guided tree exploration+1 more

Feb 26, 202651

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang, Shikun Zhang +1 more

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, ...

Large Multimodal Modelsreinforcement learningdiagnostic-driven progressive evolutioncontinual learningmultimodal data+2 more

Feb 26, 2026148

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li, Yuhang Xu +21 more

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Recent advancements represented by video generation models like Sora have demonstrated the potential o...

World Modelsvideo generation modelsUnified Multimodal Modelmultimodal learningmulti-frame reasoning+4 more

Feb 26, 2026190

Imagination Helps Visual Reasoning, But Not Yet in Latent Space

You Li, Chi Chen, Yanghao Li +4 more

Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the tru...

Multimodal Large Language Modelscausal mediation analysislatent tokensvisual reasoninginput-latent disconnect+2 more

Feb 26, 202638

Large Multimodal Models as General In-Context Classifiers

Marco Garosi, Matteo Farina, Alessandro Conti +2 more

Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tas...

Vision-Language ModelsLarge Multimodal Modelszero-shot classificationin-context learningopen-world classification+4 more

Feb 26, 202620

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Zhaochen Su, Jincheng Gao, Hangyu Guo +10 more

Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under ro...

multimodal agentsvisual reasoningtool usebenchmarklong-horizon tasks+3 more

Feb 26, 202636

SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

Guibin Chen, Dixuan Lin, Jiangping Yang +46 more

SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while shar...

Multimodal Diffusion TransformerMMDiTMultimodal Large Language ModelsMMLMvideo audio generation+6 more

Feb 25, 202630

VecGlypher: Unified Vector Glyph Generation with Language Models

Xiaoke Huang, Bhavul Gauri, Kam Woh Ng +12 more

Vector glyphs are the atomic units of digital typography, yet most learning-based pipelines still depend on carefully curated exemplar sheets and raster-to-vector postprocessing, which limits accessibility and editability. We introduce VecGlypher, a single multimodal language model that generates hi...

multimodal language modelvector glyphsSVG path tokensautoregressive generationtypography-aware data+4 more

Feb 25, 202611

Solaris: Building a Multiplayer Video World Model in Minecraft

Georgy Savva, Oscar Michel, Daohan Lu +6 more

Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To e...

video world modelsmultiplayermulti-agent interactionsdata collectionstaged pipeline+4 more

Feb 25, 202613

PyVision-RL: Forging Open Agentic Vision Models via RL

Shitian Zhao, Shaoheng Lin, Ming Li +4 more

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models th...

reinforcement learningagentic multimodal modelsinteraction collapseoversampling-filtering-rankingaccumulative tool reward+4 more

Feb 24, 202623

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Yuhao Wu, Maojia Song, Yihuai Lan +8 more

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess a...

Vision-Language Modeldiffusion-based modelsphysical constraintscausal constraintsinteractive 3D+2 more

Feb 24, 202621

See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

Jaehyun Park, Minyoung Ahn, Minkyu Kim +3 more

Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a...

diffusion transformerartifact injection toolspatch-wise embedding manipulationartifact-aware methodologiesartifact-annotated datasets+3 more

Feb 24, 202612

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Abdelrahman Shaker, Ahmed Heakl, Jaseel Muhammad +8 more

Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligenc...

vision-language-diffusion modelMobile Conditioning Projectordepthwise-separable convolutionslayerwise alignmentcross-modal conditioning+6 more

Feb 23, 202618

VLANeXt: Recipes for Building Strong VLA Models

Xiao-Ming Wu, Bin Fan, Kang Liao +6 more

Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA mo...

Vision-Language-Action modelsfoundation modelspolicy learningVLA design spaceRT-2+6 more

Feb 20, 202645

MMA: Multimodal Memory Agent

Yihao Lu, Wanru Cheng, Zeyu Zhang +1 more

Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability ...

multimodal agentsexternal memorysimilarity-based retrievalreliability scoringsource credibility+9 more

Feb 18, 20265

Learning Situated Awareness in the Real World

Chuhan Li, Ruilin Han, Joy Hsu +5 more

A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (rel...

multimodal foundation modelsegocentric videosobserver-centric relationshipssituated awarenessspatial reasoning+3 more

Feb 18, 20265

CADEvolve: Creating Realistic CAD via Program Evolution

Maksim Elistratov, Marina Barannikov, Gregory Ivanov +4 more

Computer-Aided Design (CAD) delivers rapid, editable modeling for engineering and manufacturing. Recent AI progress now makes full automation feasible for various CAD tasks. However, progress is bottlenecked by data: public corpora mostly contain sketch-extrude sequences, lack complex operations, mu...

CADVLMevolution-based pipelineparametric generatorsCadQuery+4 more

Feb 18, 202626

Visual Persuasion: What Influences Decisions of Vision-Language Models?

Manuel Cherep, Pranav M R, Pattie Maes +1 more

The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferen...

vision-language modelsvisual utilityrevealed preferencevisual prompt optimizationimage generation model+4 more

Feb 17, 20263

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng +33 more

State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dyn...

World Action Modelvideo diffusionvideo backbonephysical dynamicsautoregressive video diffusion model+3 more

Feb 17, 20269

Visual Memory Injection Attacks for Multi-Turn Conversations

Christian Schlarmann, Matthias Hein

Generative large vision-language models (LVLMs) have recently achieved impressive performance gains, and their user base is growing rapidly. However, the security of LVLMs, in particular in a long-context multi-turn setting, is largely underexplored. In this paper, we consider the realistic scenario...

generative large vision-language modelsVisual Memory Injectionmulti-turn conversationadversarial marketingpolitical persuasion+2 more