Latest Multimodal AI Research Papers

Research on AI systems that process multiple types of data including vision-language models and cross-modal understanding.

48 Papers
Showing 20 of 20 papers

Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

Tingyu Song, Yanzhao Zhang, Mingxin Li +6 more

Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precis...

composed image retrievalmultimodal embedding modelsimage editingfine-grained benchmarkmodality biases+1 more
Jan 22, 202613

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin +8 more

Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-tr...

video modelsrobot policylatent diffusion processaction generationworld model+4 more
Jan 22, 202611

VIOLA: Towards Video In-Context Learning with Minimal Annotations

Ryo Fujii, Hideo Saito, Ryo Hachiuma

Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, whi...

In-Context Learningmultimodal large language modelslabel-efficient frameworkdensity-uncertainty-weighted samplingconfidence-aware retrieval+5 more
Jan 22, 20264

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Jianshu Zhang, Chengxuan Qian, Haosen Sun +4 more

Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To ...

Vision-Language Modelsprogress reasoningProgress-BenchProgressLM-45KProgressLM-3B+2 more
Jan 21, 202611

Rethinking Video Generation Model for the Embodied World

Yufan Deng, Zilin Pan, Hongyu Zhang +6 more

Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interact...

video generation modelsembodied intelligencerobotics benchmarkrobot-oriented video generationtask domains+9 more
Jan 21, 202642

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Yifan Wang, Shiyu Li, Peiming Li +3 more

Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and...

Chain-of-Thought promptingLarge Language Modelsvision encodersVision Language Modelstoken compression+5 more
Jan 21, 202614

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Haowei Zhang, Shudong Yang, Jinlan Fu +2 more

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding ...

Multimodal Large Language Modelsvideo understandingstreaming video inputsreal-time responsesKV cache+4 more
Jan 21, 202666

Typhoon OCR: Open Vision-Language Model For Thai Document Extraction

Surapon Nonesung, Natapong Nitarach, Teetouch Jaknamon +2 more

Document extraction is a core component of digital workflows, yet existing vision-language models (VLMs) predominantly favor high-resource languages. Thai presents additional challenges due to script complexity from non-latin letters, the absence of explicit word boundaries, and the prevalence of hi...

vision-language modelsdocument extractionOCRlayout reconstructionstructural consistency+4 more
Jan 21, 202613

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Letian Zhang, Sucheng Ren, Yanqing Liu +9 more

This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output t...

vision encoderVAE-compressed image latentsViT encoderViT-VAE decodercontrastive learning+6 more
Jan 21, 202616

BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Shijie Lian, Bin Yu, Xiaopeng Lin +6 more

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets...

Vision-Language-Action modelsInformation CollapseBayesian decompositionlatent action queriesconditional Pointwise Mutual Information+4 more
Jan 21, 202653

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Qian Chen, Jinlan Fu, Changsong Li +2 more

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the ...

Multimodal Large Language Modelsaudio-visual cuesfuture forecastingcross-modal causal reasoningtemporal reasoning+4 more
Jan 20, 202628

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Said Taghadouini, Adrien Cavaillès, Baptiste Aubertin

We present LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documen...

vision-language modelOCR pipelinedistillation mixpretrainingcheckpoint averaging+5 more
Jan 20, 202617

FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

Jing Zuo, Lingzhou Mu, Fan Jiang +3 more

Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-T...

Chain-of-ThoughtVision-and-Language NavigationVisual AutoRegressorimplicit reasoningmultimodal CoT+4 more
Jan 20, 202611

GutenOCR: A Grounded Vision-Language Front-End for Documents

Hunter Heidenreich, Ben Elliott, Olivia Dinica +1 more

GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and ...

vision-language modelsfine-tuninggrounded OCRprompt-based interfacedocument understanding+3 more
Jan 20, 202628

XR: Cross-Modal Agents for Composed Image Retrieval

Zhongyu Yang, Wei Pang, Yingfang Yuan

Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalitie...

compositional image retrievalmulti-agent frameworkcross-modal generationhybrid matchingtargeted reasoning
Jan 20, 20269

Think3D: Thinking with Space for Spatial Reasoning

Zaibin Zhang, Yuhan Wu, Lianjie Jia +9 more

Understanding and reasoning about the physical world requires spatial intelligence: the ability to interpret geometry, perspective, and spatial relations beyond 2D perception. While recent vision large models (VLMs) excel at visual understanding, they remain fundamentally 2D perceivers and struggle ...

vision large models3D reconstruction modelspoint cloudscamera posesspatial reasoning+4 more
Jan 19, 202641

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

Hao Luo, Ye Wang, Wanpeng Zhang +9 more

We introduce Being-H0.5, a foundational Vision-Language-Action (VLA) model designed for robust cross-embodiment generalization across diverse robotic platforms. While existing VLAs often struggle with morphological heterogeneity and data scarcity, we propose a human-centric learning paradigm that tr...

Vision-Language-Actioncross-embodiment generalizationhuman-centric learningmultimodal dataUnified Action Space+4 more
Jan 19, 202668

MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

Peizhou Huang, Zixuan Zhong, Zhongwei Wan +12 more

Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce MMDeepResearch-Bench (MMDR-Bench), a benchmark of 140 expert-...

multimodal evidence usecitation-grounded report generationmultimodal understandingdeep research agentsFormula-LLM Adaptive Evaluation+2 more
Jan 18, 202646

ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Yawar Siddiqui, Duncan Frost, Samir Aroudj +9 more

Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casuall...

3D shape generationvisual-inertial SLAM3D detection algorithmsvision-language modelsrectified flow transformer+1 more
Jan 16, 202614
Page 1 of 3Next