Latest Multimodal AI Research Papers

Research on AI systems that process multiple types of data including vision-language models and cross-modal understanding.

202 Papers
Showing 20 of 20 papers

Visual Memory Injection Attacks for Multi-Turn Conversations

Christian Schlarmann, Matthias Hein

Generative large vision-language models (LVLMs) have recently achieved impressive performance gains, and their user base is growing rapidly. However, the security of LVLMs, in particular in a long-context multi-turn setting, is largely underexplored. In this paper, we consider the realistic scenario...

generative large vision-language modelsVisual Memory Injectionmulti-turn conversationadversarial marketingpolitical persuasion+2 more
Feb 17, 20263

Uncertainty-Aware Vision-Language Segmentation for Medical Imaging

Aryan Das, Tanishq Rachamalla, Koushik Biswas +2 more

We introduce a novel uncertainty-aware multimodal segmentation framework that leverages both radiological images and associated clinical text for precise medical diagnosis. We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-m...

Modality Decoding Attention BlockState Space Mixercross-modal fusionlong-range dependency modellingSpectral-Entropic Uncertainty Loss+3 more
Feb 16, 20263

UniWeTok: An Unified Binary Tokenizer with Codebook Size 2^{128} for Unified Multimodal Large Language Model

Shaobin Zhuang, Yuang Ai, Jiaming Han +12 more

Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously supports high-fidelity reconstruction, complex semantic extraction, and generative suitability. However, existing visual tokenizers typically struggle to satisfy these conflicting objectives within a...

visual tokenizersdiscrete tokenizerbinary codebookPre-Post DistillationGenerative-Aware Prior+9 more
Feb 15, 20264

LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

Shufan Li, Yuchen Zhu, Jiuxiang Gu +6 more

Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that b...

diffusion language modelsmultimodal understandingmultimodal generationunified post-training frameworksupervised fine-tuning+7 more
Feb 15, 20263

Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings

Haonan Jiang, Yuji Wang, Yongjie Zhu +5 more

Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific repres...

Multimodal Large Language ModelsUniversal Multimodal EmbeddingsChain-of-Thought reasoningEmbedder-Guided Reinforcement LearningReasoner+5 more
Feb 14, 20265

BrowseComp-V^3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents

Huanyao Zhang, Jiepeng Zhou, Bo Li +22 more

Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments. However, existing benchmarks for multimodal browsing remain ...

multimodal large language modelsmultimodal browsingdeep searchweb browsingmultimodal information integration+4 more
Feb 13, 20264

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

Yunheng Li, Hengrui Zhang, Meng-Hao Guo +5 more

Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descri...

audiovisual instruction annotationssupervised fine-tuningaudiovisual captioningattribute-wise captioningcaption-based QA+1 more
Feb 13, 20266

Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision

Aadarsh Sahoo, Georgia Gkioxari

Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., "left-most apple") and overlooks functional and physical reasoning (e.g., "where can I safely store the knif...

conversational image segmentationreferring image groundinglanguage-guided segmentationsegmentation priorslanguage understanding+2 more
Feb 13, 20263

Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

Rui Cai, Jun Guo, Xinze He +20 more

In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on...

vision-language-actioncross-embodiment robot trajectoriespre-trained VLMcatastrophic forgettingasynchronous execution+5 more
Feb 13, 20263

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Baorong Shi, Bo Cui, Boyuan Jiang +17 more

We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimod...

vision-language foundation modelentity-aware continual pretrainingheterogeneous medical corporalong-tail gapsreinforcement learning+5 more
Feb 13, 202644

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

Xu Guo, Fulong Ye, Qichao Sun +7 more

Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. F...

conditional diffusion transformersymmetric conditional injection schemedual-level disentanglementsynchronized RoPEstructured captions+8 more
Feb 12, 202637

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Dianyi Wang, Ruihang Li, Feng Han +17 more

Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities co...

unified multimodal modelsimage generationimage editingparameter scaleVLM layers+11 more
Feb 12, 202675

Code2Worlds: Empowering Coding LLMs for 4D World Generation

Yi Zhang, Yunshuang Wang, Zeyu Zhang +1 more

Achieving spatial intelligence requires moving beyond visual plausibility to build world simulators grounded in physical laws. While coding LLMs have advanced static 3D scene generation, extending this paradigm to 4D dynamics remains a critical frontier. This task presents two fundamental challenges...

language-to-simulation code generationdual-stream architectureretrieval-augmented object generationhierarchical environmental orchestrationphysics-aware closed-loop mechanism+4 more
Feb 12, 20263

Thinking with Drafting: Optical Decompression via Logical Reconstruction

Jingxuan Wei, Honghao He, Caijun Jia +9 more

Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative ...

multimodal large language modelsvisual perceptionvisual generationoptical decompressionvisual tokens+5 more
Feb 12, 202631

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Lai Wei, Liangbo He, Jun Lan +9 more

Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of i...

Multimodal Large Language Modelsvisual question answeringfine-grained perceptionThinking-with-Imagesregion-to-image distillation+5 more
Feb 12, 202652

Adapting Vision-Language Models for E-commerce Understanding at Scale

Matteo Nulli, Vladimir Orshulevich, Tala Bazazo +9 more

E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the a...

Vision-Language Modelsmultimodal comprehensione-commerce dataattribute-centricmulti-image+6 more
Feb 12, 202611

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Leon Liangyu Chen, Haoyu Ma, Zhipeng Fan +11 more

Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, o...

unified modelsmultimodal understandingmultimodal generationtest-time scalingchain-of-thought reasoning+5 more
Feb 12, 202619

Multimodal Fact-Level Attribution for Verifiable Reasoning

David Wan, Han Wang, Ziyang Wang +3 more

Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal groundi...

Multimodal large language modelsmultimodal groundingmultimodal reasoningfact-level attributionautomatic evaluation framework+2 more
Feb 12, 20264
PreviousPage 3 of 11Next