Latest Multimodal AI Research Papers

Research on AI systems that process multiple types of data including vision-language models and cross-modal understanding.

202 Papers
Showing 20 of 20 papers

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Linli Yao, Yuancheng Wei, Yaojie Zhang +12 more

This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling reade...

Omni Dense Captioningstructured schemascript-like captionsTimeAwareSFT+5 more
Feb 9, 202619

Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling

Ruijie Ye, Jiayi Zhang, Zhuoxin Liu +10 more

We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user's intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evalu...

agentic planner-executor frameworkcontext foldingimage layer decompositionmulti-turn editinghigh-fidelity editing+7 more
Feb 9, 202618

MIND: Benchmarking Memory Consistency and Action Control in World Models

Yixuan Ye, Xuanyu Lu, Yuxin Jiang +7 more

World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and a...

world modelsmemory consistencyaction controlclosed-loop benchmarkinteractive Video-to-World baseline+3 more
Feb 8, 20268

Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

Yalcin Tur, Jalal Naghiyev, Haoquan Fang +4 more

Current Vision-Language-Action (VLA) models rely on fixed computational depth, expending the same amount of compute on simple adjustments and complex multi-step manipulation. While Chain-of-Thought (CoT) prompting enables variable computation, it scales memory linearly and is ill-suited for continuo...

Vision-Language-Action modelsChain-of-Thought promptingrecurrent architectureweight-tied action headtruncated backpropagation through time+4 more
Feb 8, 202640

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

Shenyuan Gao, William Liang, Kaiyuan Zheng +27 more

Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels....

world modelegocentric videoscontinuous latent actionsaction labelsdistillation pipeline+5 more
Feb 6, 202618

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Haoyu Zhang, Zhipeng Li, Yiwen Guo +1 more

Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-...

omni-modal large language models3D facial animationspeech unitstoken-as-query gated fusiontemporal scaffolding+2 more
Feb 6, 20269

PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

Junxian Li, Kai Liu, Leyang Chen +7 more

Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in comp...

unified multimodal modelsimage generationimage editingspatial reasoningprocedural understanding+6 more
Feb 6, 20265

Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

Hai Zhang, Siqi Liang, Li Chen +5 more

Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely ...

vision-language navigationlarge language modelsbeyond-the-view navigationvideo generation modelssparse future planning+2 more
Feb 5, 202618

Self-Improving World Modelling with Latent Actions

Yifu Qiu, Zheng Zhao, Waylon Li +4 more

Internal modelling of the world -- predicting transitions between previous states X and next states Y under actions Z -- is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framewo...

Forward World ModellingInverse Dynamics Modellingvariational information maximisationELBO maximisationreinforcement learning+6 more
Feb 5, 202611

SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

Jintao Tong, Shilin Yan, Hongwei Xue +5 more

Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inje...

Multimodal Large Language Modelsvisual thoughtscontinuous hidden statesreasoning-switchableautoregressive formulation+5 more
Feb 5, 202610

InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

Sirui Xu, Samuel Schulter, Morteza Ziyadi +4 more

Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors ...

variational policyimitation learningreinforcement learningmotion priorlatent skills+5 more
Feb 5, 202620

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

Dongyang Chen, Chaoyang Wang, Dezhao SU +6 more

Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to activ...

Multimodal Large Language ModelsChain-of-Thought reasoningmultimodal retrievalvisual encodingsevidence-driven retrieval+8 more
Feb 5, 20267

Reinforced Attention Learning

Bangzheng Li, Jianmo Ni, Chen Qu +5 more

Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We...

Reinforcement LearningLarge Language ModelsMultimodal LLMspolicy-gradient frameworkattention distributions+5 more
Feb 4, 202614

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

Hyeonbeom Choi, Daechul Ahn, Youhan Lee +3 more

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward pass...

Vision-Language-Action modelstest-time scalingactive inferenceself-uncertaintyvisual perception+5 more
Feb 4, 202617

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Yu Bai, MingMing Yu, Chaojie Li +3 more

Deploying humanoid robots in real-world settings is fundamentally challenging, as it demands tight integration of perception, locomotion, and manipulation under partial-information observations and dynamically changing environments. As well as transitioning robustly between sub-tasks of different ty...

vision-language modellocomotion primitiveshead movementsmanipulation commandshuman-robot interactions+6 more
Feb 4, 202629

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu, Tian Wu +435 more

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-s...

autoregressive foundation modelunified multimodal understandingunified next-group-of-tokens prediction objectivemixture-of-expertsmodality-agnostic expert routing+3 more
Feb 4, 2026190

Training Data Efficiency in Multimodal Process Reward Models

Jinyuan Li, Chengsong Huang, Langlin Huang +4 more

Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM training.Our preliminar...

Multimodal Process Reward ModelsMonte Carlo-annotated corporaVisualProcessBenchBalanced-Information Scorelabel mixtures+3 more
Feb 4, 202660

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

Jinlong Ma, Yu Zhang, Xuefeng Bai +5 more

Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, mov...

Multimodal Large Language ModelsGMNERmodality biascross-modal reasoningMulti-style Reasoning Schema Injection+2 more
Feb 4, 20265

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

Pingyue Zhang, Zihan Huang, Yue Wang +11 more

Spatial embodied intelligence requires agents to act to acquire information under partial observability. While multimodal foundation models excel at passive perception, their capacity for active, self-directed exploration remains understudied. We propose Theory of Space, defined as an agent's abilit...

spatial embodied intelligencemultimodal foundation modelsactive explorationspatial beliefcognitive mapping+5 more
Feb 4, 202615
PreviousPage 5 of 11Next