Latest Multimodal AI Research Papers

Research on AI systems that process multiple types of data including vision-language models and cross-modal understanding.

205 Papers
Showing 20 of 20 papers

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

Hyeonbeom Choi, Daechul Ahn, Youhan Lee +3 more

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward pass...

Vision-Language-Action modelstest-time scalingactive inferenceself-uncertaintyvisual perception+5 more
Feb 4, 202617

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Yu Bai, MingMing Yu, Chaojie Li +3 more

Deploying humanoid robots in real-world settings is fundamentally challenging, as it demands tight integration of perception, locomotion, and manipulation under partial-information observations and dynamically changing environments. As well as transitioning robustly between sub-tasks of different ty...

vision-language modellocomotion primitiveshead movementsmanipulation commandshuman-robot interactions+6 more
Feb 4, 202629

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

Jinlong Ma, Yu Zhang, Xuefeng Bai +5 more

Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, mov...

Multimodal Large Language ModelsGMNERmodality biascross-modal reasoningMulti-style Reasoning Schema Injection+2 more
Feb 4, 20265

VLS: Steering Pretrained Robot Policies via Vision-Language Models

Shuo Liu, Ishneet Sukhvinder Singh, Yiqing Xu +2 more

Why do pretrained diffusion or flow-matching policies fail when the same task is performed near an obstacle, on a shifted support surface, or amid mild clutter? Such failures rarely reflect missing motor skills; instead, they expose a limitation of imitation learning under train-test shifts, where a...

diffusion policiesflow-matching policiesimitation learningtrain-test shiftsaction generation+6 more
Feb 3, 202614

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman +12 more

Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world comp...

vision-language modelsspatial reasoningbenchmarkvisual question-answer pairsreal-world complexity+4 more
Feb 3, 20267

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

Minjun Zhu, Zhen Lin, Yixuan Weng +6 more

High-quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well-recognized bottleneck in both academia and industry. We present FigureBench, the first large-scale benchmark for generating scientific i...

scientific illustrationslong-form scientific textFigureBenchAutoFigureagentic framework+7 more
Feb 3, 202619

Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration

Yu Zhang, Mufan Xu, Xuefeng Bai +4 more

Modality following serves as the capacity of multimodal large language models (MLLMs) to selectively utilize multimodal contexts based on user instructions. It is fundamental to ensuring safety and reliability in real-world deployments. However, the underlying mechanisms governing this decision-maki...

multimodal large language modelsinstruction tokensmodality arbitrationattention layersmultimodal cues+4 more
Feb 3, 20265

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Yu Zeng, Wenxuan Huang, Zhen Fang +13 more

Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations...

Multimodal Large Language ModelsVision-DeepResearchvisual-textual fact-findingVQAvision search+3 more
Feb 2, 2026103

ObjEmbed: Towards Universal Multimodal Object Embeddings

Shenghao Fu, Yukun Su, Fengyun Rao +3 more

Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and s...

multimodal embedding modelsvisual groundinglocal image retrievalglobal image retrievalobject embedding+4 more
Feb 2, 20263

AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

Xintong Zhang, Xiaowen Zhang, Jongrong Wu +8 more

Adaptive multimodal reasoning has emerged as a promising frontier in Vision-Language Models (VLMs), aiming to dynamically modulate between tool-augmented visual reasoning and text reasoning to enhance both effectiveness and efficiency. However, existing evaluations rely on static difficulty labels a...

Vision-Language Modelsadaptive multimodal reasoningMatthews Correlation Coefficienttask difficultymodel capacity+5 more
Feb 2, 20268

Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory

Ruiqi Wu, Xuanhua He, Meng Cheng +8 more

We propose Infinite-World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradi...

world modelvisual memoryhierarchical pose-free memory compressorlatent distillationgenerative backbone+4 more
Feb 2, 202610

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Yuling Shi, Chaoxiang Xie, Zhensu Sun +7 more

Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, ...

Multimodal LLMssource code understandingtoken compressionvisual cuessyntax highlighting+5 more
Feb 2, 202669

Kimi K2.5: Visual Agentic Intelligence

Kimi Team, Tongtong Bai, Yifan Bai +323 more

We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vis...

multimodal agentic modeljoint text-vision pre-trainingzero-vision SFTjoint text-vision reinforcement learningAgent Swarm+2 more
Feb 2, 2026130

How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

Huanyu Zhang, Xuehai Bai, Chengzu Li +9 more

Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To...

generative modelsimage editingvisual instruction followingdeictic groundingmorphological manipulation+4 more
Feb 2, 202613

UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

Dianyi Wang, Chaofan Ma, Feng Han +8 more

Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmon...

text-to-image generationimage editingdual reasoning paradigmworld knowledge-enhanced planningvisual self-correction+5 more
Feb 2, 202666

Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation

Jun He, Junyan Ye, Zilong Huang +6 more

While text-to-image generation has achieved unprecedented fidelity, the vast majority of existing models function fundamentally as static text-to-pixel decoders. Consequently, they often fail to grasp implicit user intentions. Although emerging unified understanding-generation models have improved i...

text-to-image generationunified understanding-generation modelsimplicit user intentionsmultimodal evidencereasoning tools+5 more
Feb 2, 202621

Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

Chenlong Wang, Yuhang Chen, Zhihan Hu +4 more

Recent advances in unified multimodal models (UMM) have demonstrated remarkable progress in both understanding and generation tasks. However, whether these two capabilities are genuinely aligned and integrated within a single model remains unclear. To investigate this question, we introduce GapEval,...

unified multimodal modelsbidirectional benchmarkcognitive coherencebidirectional inference capabilitycross-modal consistency+2 more
Feb 2, 20265

Toward Cognitive Supersensing in Multimodal Large Language Model

Boyi Li, Yifan Shen, Yuanzhe Liu +12 more

Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require visual memory. Current approaches primarily scale Chain-of-Tho...

Multimodal Large Language ModelsChain-of-Thought reasoningvisual reasoningLatent Visual Imagery Predictionreinforcement learning+4 more
Feb 2, 202614

Unified Personalized Reward Model for Vision Generation

Yibin Wang, Yuhang Zang, Feng Han +4 more

Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforce...

multimodal reward modelsvisual generationBradley-Terry-style preference modelinggenerative VLMsreinforcement learning+11 more
Feb 2, 202615

RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval

Tyler Skow, Alexander Martin, Benjamin Van Durme +2 more

Reranking is a critical component of modern retrieval systems, which typically pair an efficient first-stage retriever with a more expressive model to refine results. While large reasoning models have driven rapid progress in text-centric reranking, reasoning-based reranking for video retrieval rema...

rerankingvideo retrievalreasoning-based rerankercurriculum trainingsupervised fine-tuning+5 more
Feb 2, 202618
PreviousPage 6 of 11Next
Latest Multimodal AI Research | Multimodal AI Papers