Latest Multimodal AI Research Papers

Research on AI systems that process multiple types of data including vision-language models and cross-modal understanding.

205 Papers

Showing 20 of 20 papers

SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

Yiming Ren, Junjie Wang, Yuxin Meng +11 more

Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic "Needle-In-A-Haystack" tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose th...

multimodal large language modelsscientific papersevidence chainsSIN-DataSIN-Bench+6 more

Jan 15, 20266

DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Hengyu Shen, Tiancheng Gu, Bin Qin +10 more

Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoptio...

Jan 15, 202632

Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Siqi Kou, Jiachun Jin, Zetong Zhou +8 more

Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers -- they employ LLMs merely as text enco...

Jan 15, 202626

FlowAct-R1: Towards Interactive Humanoid Video Generation

Lizhen Wang, Yongming Zhu, Zhipeng Ge +15 more

Interactive humanoid video generation aims to synthesize lifelike visual agents that can engage with humans through continuous and responsive video. Despite recent advances in video synthesis, existing methods often grapple with the trade-off between high-fidelity synthesis and real-time interaction...

Jan 15, 202618

LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Linquan Wu, Tianxiang Jiang, Yifei Dong +6 more

Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamen...

Jan 15, 20269

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang, Rui Dai +4 more

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., bui...

vision-language modelcross-modal recognitionmulti-stage reasoningreinforcement learningsocio-semantic segmentation+2 more

Jan 15, 2026151

Future Optical Flow Prediction Improves Robot Control & Video Generation

Kanchana Ranasinghe, Honglu Zhou, Yu Fang +7 more

Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We in...

Vision-Language ModelDiffusion architectureoptical flow forecastingmultimodal reasoningpixel-level generative fidelity+5 more

Jan 15, 202616

HeartMuLa: A Family of Open Sourced Music Foundation Models

Dongchao Yang, Yuxin Xie, Yuguo Yin +25 more

We present a family of open-source Music Foundation Models designed to advance large-scale music understanding and generation across diverse tasks and modalities. Our framework consists of four major components: (1) HeartCLAP, an audio-text alignment model; (2) HeartTranscriptor, a robust lyric reco...

Music Foundation Modelsaudio-text alignment modellyric recognition modelmusic codec tokenizerLLM-based song generation model+4 more

Jan 15, 202623

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao, Chunrui Han +90 more

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimod...

Jan 14, 2026169

World Craft: Agentic Framework to Create Visualizable Worlds via Text

Jianwen Sun, Yukang Feng, Kaining Ying +8 more

Large Language Models (LLMs) motivate generative agent simulation (e.g., AI Town) to create a ``dynamic world'', holding immense value across entertainment and research. However, for non-experts, especially those without programming skills, it isn't easy to customize a visualizable environment by th...

generative agent simulationAI Townlarge language modelsagentic world creation frameworkworld scaffold+4 more

Jan 14, 202616

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Chi-Pin Huang, Yunze Man, Zhiding Yu +4 more

Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy ...

Jan 14, 202623

SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL

Lijun Liu, Linwei Chen, Zhishou Zhang +7 more

General-purpose Large Vision-Language Models (LVLMs), despite their massive scale, often falter in dermatology due to "diffuse attention" - the inability to disentangle subtle pathological lesions from background noise. In this paper, we challenge the assumption that parameter scaling is the only pa...

Jan 14, 202624

Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

Dongjie Cheng, Yongqi Li, Zhixin Ma +5 more

Multimodal Large Language Models (MLLMs) are making significant progress in multimodal reasoning. Early approaches focus on pure text-based reasoning. More recent studies have incorporated multimodal information into the reasoning steps; however, they often follow a single task-specific reasoning pa...

Jan 14, 20262

OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang +1 more

We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that desc...

Jan 14, 20261

CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion

Ralf Römer, Yi Zhang, Angela P. Schoellig

To teach robots complex manipulation tasks, it is now a common practice to fine-tune a pre-trained vision-language-action model (VLA) on task-specific data. However, since this recipe updates existing representations, it is unsuitable for long-term operation in the real world, where robots must cont...

vision-language-action modelscontinual learningparameter-efficient fine-tuningmodular adaptersfeature similarity+3 more

Jan 14, 20263

VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory

Shaoan Wang, Yuanfei Luo, Xingyu Chen +6 more

VLA models have shown promising potential in embodied navigation by unifying perception and planning while inheriting the strong generalization abilities of large VLMs. However, most existing VLA models rely on reactive mappings directly from observations to actions, lacking the explicit reasoning c...

Jan 13, 20266

ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios

António Loison, Quentin Macé, Antoine Edy +7 more

Retrieval-Augmented Generation (RAG) pipelines must address challenges beyond simple single-document retrieval, such as interpreting visual elements (tables, charts, images), synthesizing information across documents, and providing accurate source grounding. Existing benchmarks fail to capture this ...

Jan 13, 20267

UM-Text: A Unified Multimodal Model for Image Understanding

Lichen Ma, Xiaolong Fu, Gaojing Zhou +6 more

With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image...

Jan 13, 20264

More Images, More Problems? A Controlled Analysis of VLM Failure Modes

Anurag Das, Adrian Bulat, Alberto Baldrati +4 more

Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored. While existing benchmarks have initiated the evaluation of multi-image models, a comprehensive analysis of their core ...

Large Vision Language Modelsmulti-image capabilitiesbenchmarkdiagnostic experimentscross-image aggregation+2 more

Jan 12, 20265

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

Jiapeng Shi, Junke Wang, Zuyao You +2 more

This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and ...

Jan 12, 20266

PreviousPage 10 of 11Next

View all categories

Latest Multimodal AI Research | Multimodal AI Papers