Latest Multimodal AI Research Papers

Research on AI systems that process multiple types of data including vision-language models and cross-modal understanding.

205 Papers

Showing 20 of 20 papers

Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

Zhixiang Wei, Yi Li, Zhehan Kan +38 more

Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent...

Vision-Language Modelsautoregressive supervisionvision-as-targetvision-as-inputvisual tokens+3 more

Jan 27, 202629

Innovator-VL: A Multimodal Large Language Model for Scientific Discovery

Zichen Wen, Boxue Yang, Shuang Chen +31 more

We present Innovator-VL, a scientific multimodal large language model designed to advance understanding and reasoning across diverse scientific domains while maintaining excellent performance on general vision tasks. Contrary to the trend of relying on massive domain-specific pretraining and opaque ...

multimodal large language modelscientific multimodal large language modelend-to-end reproducible training pipelinesupervised fine-tuningreinforcement learning+5 more

Jan 27, 202667

Towards Pixel-Level VLM Perception via Simple Points Prediction

Tianhui Song, Haoyu Lu, Hao Yang +8 more

We present SimpleSeg, a strikingly simple yet highly effective approach to endow Multimodal Large Language Models (MLLMs) with native pixel-level perception. Our method reframes segmentation as a simple sequence generation problem: the model directly predicts sequences of points (textual coordinates...

Multimodal Large Language Modelssegmentationsequence generationpoint predictionReinforcement Learning+4 more

Jan 27, 202612

Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Jialong Wu, Xiaoying Zhang, Hongyi Yuan +7 more

Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-le...

chain-of-thought reasoninglarge language modelsmultimodal modelsvisual generationworld models+4 more

Jan 27, 202623

TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

Fangxu Yu, Xingang Guo, Lingzhi Yuan +6 more

Time series data is ubiquitous in real-world scenarios and crucial for critical applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist models to solve practical problems. However, this dimension is no...

time series datageneralist modelsmulti-modal benchmarktime series reasoningperception+12 more

Jan 26, 20269

Agentic Very Long Video Understanding

Aniket Rege, Arka Sadhu, Yuliang Li +5 more

The advent of always-on personal AI assistants, enabled by all-day wearable devices such as smart glasses, demands a new level of contextual understanding, one that goes beyond short, isolated events to encompass the continuous, longitudinal stream of egocentric video. Achieving this vision requires...

entity scene graphsagentic frameworklong-horizon video understandingstructured searchtemporal reasoning+3 more

Jan 26, 20267

A Pragmatic VLA Foundation Model

Wei Wu, Fan Lu, Yunnan Wang +22 more

Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around...

Vision-Language-Actionreal-world datarobotic manipulationgeneralizationefficient codebase+2 more

Jan 26, 202638

AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

Mingyang Song, Haoyu Sun, Jiawei Gu +4 more

When humans face problems beyond their immediate capabilities, they rely on tools, providing a promising paradigm for improving visual reasoning in multimodal large language models (MLLMs). Effective reasoning, therefore, hinges on knowing which tools to use, when to invoke them, and how to compose ...

multimodal large language modelstool usereinforcement learningend-task successadaptive learning mechanism+5 more

Jan 26, 202645

AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking

Xilin Jiang, Qiaolin Wang, Junkai Wu +30 more

Internet audio-visual clips convey meaning through time-varying sound and motion, which extend beyond what text alone can represent. To examine whether AI models can understand such signals in human cultural contexts, we introduce AVMeme Exam, a human-curated benchmark of over one thousand iconic In...

multimodal large language modelsaudio-visual clipshuman-curated benchmarkiconic Internet soundscultural context+2 more

Jan 25, 202621

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

Chenyu Mu, Xin He, Qu Yang +13 more

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea an...

video generationdialogue-to-cinematic-videoScripterAgentDirectorAgentcross-scene continuous generation+3 more

Jan 25, 202648

AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

Dongjie Cheng, Ruifeng Yuan, Yongqi Li +5 more

Real-world perception and interaction are inherently multimodal, encompassing not only language but also vision and speech, which motivates the development of "Omni" MLLMs that support both multimodal inputs and multimodal outputs. While a sequence of omni MLLMs has emerged, most existing systems st...

Omni MLLMsautoregressive modelingTransformer decodertask-aware loss reweightingtoken-level perceptual alignment loss+1 more

Jan 25, 20266

VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

Zirui Wang, Junyi Zhang, Jiaxin Ge +9 more

Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puz...

vision-language modelsmulti-step visual interactionsperceptionmemoryaction+10 more

Jan 23, 202629

VIOLA: Towards Video In-Context Learning with Minimal Annotations

Ryo Fujii, Hideo Saito, Ryo Hachiuma

Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, whi...

In-Context Learningmultimodal large language modelslabel-efficient frameworkdensity-uncertainty-weighted samplingconfidence-aware retrieval+5 more

Jan 22, 20264

SAMTok: Representing Any Mask with Two Words

Yikang Zhou, Tao Zhang, Dengxian Gong +13 more

Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we pr...

multi-modal LLMsregion maskdiscrete mask tokenizerSAM2mask encoder+12 more

Jan 22, 202640

IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance

Jongwoo Park, Kanchana Ranasinghe, Jinhyeok Jang +3 more

Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the mod...

Vision-Language-Action modelsvision encoderlanguage-model layervisual-token interactionsgeometric structure+3 more

Jan 22, 20263

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin +8 more

Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-tr...

video modelsrobot policylatent diffusion processaction generationworld model+4 more

Jan 22, 202613

Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

Tingyu Song, Yanzhao Zhang, Mingxin Li +6 more

Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precis...

composed image retrievalmultimodal embedding modelsimage editingfine-grained benchmarkmodality biases+1 more

Jan 22, 202613

Rethinking Video Generation Model for the Embodied World

Yufan Deng, Zilin Pan, Hongyu Zhang +6 more

Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interact...

video generation modelsembodied intelligencerobotics benchmarkrobot-oriented video generationtask domains+9 more

Jan 21, 202642

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Jianshu Zhang, Chengxuan Qian, Haosen Sun +4 more

Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To ...

Vision-Language Modelsprogress reasoningProgress-BenchProgressLM-45KProgressLM-3B+2 more

Jan 21, 202612

BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Shijie Lian, Bin Yu, Xiaopeng Lin +6 more

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets...

Vision-Language-Action modelsInformation CollapseBayesian decompositionlatent action queriesconditional Pointwise Mutual Information+4 more

Jan 21, 202654

PreviousPage 8 of 11Next

View all categories

Latest Multimodal AI Research | Multimodal AI Papers