Latest Multimodal AI Research Papers

Research on AI systems that process multiple types of data including vision-language models and cross-modal understanding.

205 Papers
Showing 20 of 20 papers

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Xiaomin Yu, Yi Xin, Wenjie Zhang +12 more

Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this ...

multimodal contrastive learningmodality gapgeometric anomalyisotropic assumptionsFixed-frame Modality Gap Theory+10 more
Feb 2, 2026108

PromptRL: Prompt Matters in RL for Flow-Based Image Generation

Fu-Yun Wang, Han Zhang, Michael Gharbi +2 more

Flow matching models (FMs) have revolutionized text-to-image (T2I) generation, with reinforcement learning (RL) serving as a critical post-training strategy for alignment with reward objectives. In this research, we show that current RL pipelines for FMs suffer from two underappreciated yet importan...

flow matching modelstext-to-image generationreinforcement learningprompt overfittinglanguage models+8 more
Feb 1, 20265

Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agentic Reasoning

Yu Xu, Yuxin Zhang, Juan Cao +5 more

A visual metaphor constitutes a high-order form of human creativity, employing cross-domain semantic fusion to transform abstract concepts into impactful visual rhetoric. Despite the remarkable progress of generative AI, existing models remain largely confined to pixel-level instruction alignment an...

visual metaphor transferconceptual blending theoryschema grammarmulti-agent frameworkcross-domain semantic fusion+6 more
Feb 1, 202612

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

I. Apanasevich, M. Artemyev, R. Babakyan +25 more

We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodime...

Vision-Language-Actionmultimodal groundingmulti-embodiment pretrainingembodiment-specific adaptationreinforcement-learning+3 more
Jan 31, 2026138

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

Benno Krojer, Shravan Nayak, Oscar Mañas +4 more

Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily proce...

Vision-Language Modelvisual tokensembedding spaceLLMMLP transformation+4 more
Jan 31, 202615

ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought

Fanmeng Wang, Haotian Liu, Guojiang Zhao +2 more

While Chain-of-Thought (CoT) significantly enhances the performance of Large Language Models (LLMs), explicit reasoning chains introduce substantial computational redundancy. Recent latent reasoning methods attempt to mitigate this by compressing reasoning processes into latent space, but often suff...

Chain-of-ThoughtLarge Language Modelslatent reasoningVariational Auto-Encodingposterior distribution+2 more
Jan 30, 20268

PaperBanana: Automating Academic Illustration for AI Scientists

Dawei Zhu, Rui Meng, Yale Song +4 more

Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready a...

VLMsimage generation modelsagentic frameworkpublication-ready illustrationsmethodology diagrams+3 more
Jan 30, 20268

JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

Anthony Chen, Naomi Ken Korem, Tavi Halperin +5 more

Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from suc...

audio-video diffusion modelLoRAvideo-to-video dubbinggenerative modelmultilingual videos+3 more
Jan 29, 20264

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Haozhe Xie, Beichen Wen, Jiarui Zheng +4 more

Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control. We present DynamicVLA, a framework for...

Vision-Language-Action modelstemporal reasoningclosed-loop adaptationconvolutional vision encodermultimodal inference+5 more
Jan 29, 202658

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Wenxuan Huang, Yu Zeng, Qiuchen Wang +12 more

Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call'' for visual and textual search engines to ob...

multimodal large language modelsvisual and textual search enginesreasoning-then-tool-callmultimodal deep-researchmulti-turn search+5 more
Jan 29, 2026116

VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

Yibo Wang, Yongcheng Jing, Shunyu Liu +5 more

Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, wh...

vision-language modelsoptical memorytoken compressionlong-context reasoningvision-text compression+7 more
Jan 29, 20267

MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models

Sangyun Chung, Se Yeon Kim, Youngchae Chee +1 more

Multimodal Large Language Models (MLLMs) suffer from cross-modal hallucinations, where one modality inappropriately influences generation about another, leading to fabricated output. This exposes a more fundamental deficiency in modality-interaction control. To address this, we propose Modality-Adap...

Multimodal Large Language Modelscross-modal hallucinationsmodality-interaction controlModality-Adaptive Decodingcontrastive decoding+3 more
Jan 29, 20267

Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

Zihan Su, Hongyang Wei, Kangrui Cen +4 more

Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding...

Unified Multimodal Modelspost-training methodsvisual understandingvisual generationauxiliary generation tasks+7 more
Jan 29, 20263

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Yufeng Zhong, Lei Chen, Xuanle Zhao +7 more

The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from...

OCRvision-centric OCRtext-centric OCRend-to-end OCRdata engineering+5 more
Jan 29, 202642

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Honglin Lin, Zheng Liu, Yun Zhu +6 more

Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source VLMs still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM ...

Vision Language ModelsChain-of-Thoughtmultimodal reasoningQwen3-VLCoT rationale generation+4 more
Jan 29, 202645

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

Yaorui Shi, Shugui Liu, Yu Yang +7 more

Long-horizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value detail...

memory systemscontext windowvisual layoutstructured rich-text memoryreinforcement learning+2 more
Jan 29, 20267

WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models

Runjie Zhou, Youbo Shao, Haoyu Lu +16 more

We introduce WorldVQA, a benchmark designed to evaluate the atomic visual world knowledge of Multimodal Large Language Models (MLLMs). Unlike current evaluations, which often conflate visual knowledge retrieval with reasoning, WorldVQA decouples these capabilities to strictly measure "what the model...

Multimodal Large Language Modelsvisual world knowledgevisual knowledge retrievalreasoningatomic visual world knowledge+3 more
Jan 28, 20265

Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning

Chengzu Li, Zanyi Wang, Jiaang Li +9 more

Vision-Language Models have excelled at textual reasoning, but they often struggle with fine-grained spatial understanding and continuous action planning, failing to simulate the dynamics required for complex visual reasoning. In this work, we formulate visual reasoning by means of video generation ...

vision-language modelsvideo generation modelsvisual reasoningmaze navigationtangram puzzle+3 more
Jan 28, 202615
PreviousPage 7 of 11Next