Latest Multimodal AI Research Papers

Research on AI systems that process multiple types of data including vision-language models and cross-modal understanding.

205 Papers
Showing 20 of 20 papers

SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

Yiming Ren, Junjie Wang, Yuxin Meng +11 more

Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic "Needle-In-A-Haystack" tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose th...

multimodal large language modelsscientific papersevidence chainsSIN-DataSIN-Bench+6 more
Jan 15, 20266

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang, Rui Dai +4 more

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., bui...

vision-language modelcross-modal recognitionmulti-stage reasoningreinforcement learningsocio-semantic segmentation+2 more
Jan 15, 2026151

Future Optical Flow Prediction Improves Robot Control & Video Generation

Kanchana Ranasinghe, Honglu Zhou, Yu Fang +7 more

Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We in...

Vision-Language ModelDiffusion architectureoptical flow forecastingmultimodal reasoningpixel-level generative fidelity+5 more
Jan 15, 202616

HeartMuLa: A Family of Open Sourced Music Foundation Models

Dongchao Yang, Yuxin Xie, Yuguo Yin +25 more

We present a family of open-source Music Foundation Models designed to advance large-scale music understanding and generation across diverse tasks and modalities. Our framework consists of four major components: (1) HeartCLAP, an audio-text alignment model; (2) HeartTranscriptor, a robust lyric reco...

Music Foundation Modelsaudio-text alignment modellyric recognition modelmusic codec tokenizerLLM-based song generation model+4 more
Jan 15, 202623

World Craft: Agentic Framework to Create Visualizable Worlds via Text

Jianwen Sun, Yukang Feng, Kaining Ying +8 more

Large Language Models (LLMs) motivate generative agent simulation (e.g., AI Town) to create a ``dynamic world'', holding immense value across entertainment and research. However, for non-experts, especially those without programming skills, it isn't easy to customize a visualizable environment by th...

generative agent simulationAI Townlarge language modelsagentic world creation frameworkworld scaffold+4 more
Jan 14, 202616

CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion

Ralf Römer, Yi Zhang, Angela P. Schoellig

To teach robots complex manipulation tasks, it is now a common practice to fine-tune a pre-trained vision-language-action model (VLA) on task-specific data. However, since this recipe updates existing representations, it is unsuitable for long-term operation in the real world, where robots must cont...

vision-language-action modelscontinual learningparameter-efficient fine-tuningmodular adaptersfeature similarity+3 more
Jan 14, 20263

More Images, More Problems? A Controlled Analysis of VLM Failure Modes

Anurag Das, Adrian Bulat, Alberto Baldrati +4 more

Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored. While existing benchmarks have initiated the evaluation of multi-image models, a comprehensive analysis of their core ...

Large Vision Language Modelsmulti-image capabilitiesbenchmarkdiagnostic experimentscross-image aggregation+2 more
Jan 12, 20265
PreviousPage 10 of 11Next
Latest Multimodal AI Research | Multimodal AI Papers