Latest Multimodal AI Research Papers

Research on AI systems that process multiple types of data including vision-language models and cross-modal understanding.

48 Papers
Showing 20 of 20 papers

PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models

Qiyuan Zhang, Biao Gong, Shuai Tan +7 more

Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation. This gap highlights a critical limitation in rendering rigid body motion, a core tenet of classical mechanics. While computer graphics and physics-based simul...

reinforcement learningvideo generationphysical collision ruleshigh-dimensional spacesphysics-aware+2 more
Jan 16, 20268

ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Yawar Siddiqui, Duncan Frost, Samir Aroudj +9 more

Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casuall...

3D shape generationvisual-inertial SLAM3D detection algorithmsvision-language modelsrectified flow transformer+1 more
Jan 16, 202614

ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

Linqing Zhong, Yi Liu, Yifei Wei +4 more

Vision-Language-Action (VLA) models have emerged as essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model (VLM) embeddings. Recent advancements have introduced explicit intermediary ...

Vision-Language-Modelaction spaceaction chain-of-thoughtACoTExplicit Action Reasoner+4 more
Jan 16, 202622

SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

Yiming Ren, Junjie Wang, Yuxin Meng +11 more

Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic "Needle-In-A-Haystack" tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose th...

multimodal large language modelsscientific papersevidence chainsSIN-DataSIN-Bench+6 more
Jan 15, 20266

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang, Rui Dai +4 more

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., bui...

vision-language modelcross-modal recognitionmulti-stage reasoningreinforcement learningsocio-semantic segmentation+2 more
Jan 15, 2026151

HeartMuLa: A Family of Open Sourced Music Foundation Models

Dongchao Yang, Yuxin Xie, Yuguo Yin +25 more

We present a family of open-source Music Foundation Models designed to advance large-scale music understanding and generation across diverse tasks and modalities. Our framework consists of four major components: (1) HeartCLAP, an audio-text alignment model; (2) HeartTranscriptor, a robust lyric reco...

Music Foundation Modelsaudio-text alignment modellyric recognition modelmusic codec tokenizerLLM-based song generation model+4 more
Jan 15, 202623

Future Optical Flow Prediction Improves Robot Control & Video Generation

Kanchana Ranasinghe, Honglu Zhou, Yu Fang +7 more

Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We in...

Vision-Language ModelDiffusion architectureoptical flow forecastingmultimodal reasoningpixel-level generative fidelity+5 more
Jan 15, 202616

CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion

Ralf Römer, Yi Zhang, Angela P. Schoellig

To teach robots complex manipulation tasks, it is now a common practice to fine-tune a pre-trained vision-language-action model (VLA) on task-specific data. However, since this recipe updates existing representations, it is unsuitable for long-term operation in the real world, where robots must cont...

vision-language-action modelscontinual learningparameter-efficient fine-tuningmodular adaptersfeature similarity+3 more
Jan 14, 20263
PreviousPage 2 of 3Next