Latest Multimodal AI Research Papers

Research on AI systems that process multiple types of data including vision-language models and cross-modal understanding.

205 Papers
Showing 20 of 20 papers

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Letian Zhang, Sucheng Ren, Yanqing Liu +9 more

This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output t...

vision encoderVAE-compressed image latentsViT encoderViT-VAE decodercontrastive learning+6 more
Jan 21, 202616

Typhoon OCR: Open Vision-Language Model For Thai Document Extraction

Surapon Nonesung, Natapong Nitarach, Teetouch Jaknamon +2 more

Document extraction is a core component of digital workflows, yet existing vision-language models (VLMs) predominantly favor high-resource languages. Thai presents additional challenges due to script complexity from non-latin letters, the absence of explicit word boundaries, and the prevalence of hi...

vision-language modelsdocument extractionOCRlayout reconstructionstructural consistency+4 more
Jan 21, 202613

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Yifan Wang, Shiyu Li, Peiming Li +3 more

Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and...

Chain-of-Thought promptingLarge Language Modelsvision encodersVision Language Modelstoken compression+5 more
Jan 21, 202614

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Haowei Zhang, Shudong Yang, Jinlan Fu +2 more

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding ...

Multimodal Large Language Modelsvideo understandingstreaming video inputsreal-time responsesKV cache+4 more
Jan 21, 202671

TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

Bin Yu, Shijie Lian, Xiaopeng Lin +8 more

Standard Vision-Language-Action (VLA) models typically fine-tune a monolithic Vision-Language Model (VLM) backbone explicitly for robotic control. However, this approach creates a critical tension between maintaining high-level general semantic understanding and learning low-level, fine-grained sens...

Vision-Language-Action modelsVision-Language Modelsrobotic controlcatastrophic forgettingfrozen Left Brain+5 more
Jan 20, 202656

FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

Jing Zuo, Lingzhou Mu, Fan Jiang +3 more

Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-T...

Chain-of-ThoughtVision-and-Language NavigationVisual AutoRegressorimplicit reasoningmultimodal CoT+4 more
Jan 20, 202611

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Qian Chen, Jinlan Fu, Changsong Li +2 more

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the ...

Multimodal Large Language Modelsaudio-visual cuesfuture forecastingcross-modal causal reasoningtemporal reasoning+4 more
Jan 20, 202628

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Said Taghadouini, Adrien Cavaillès, Baptiste Aubertin

We present LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documen...

vision-language modelOCR pipelinedistillation mixpretrainingcheckpoint averaging+5 more
Jan 20, 202617

ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch

Zheng Liu, Honglin Lin, Chonghan Qin +13 more

Chart reasoning is a critical capability for Vision Language Models (VLMs). However, the development of open-source models is severely hindered by the lack of high-quality training data. Existing datasets suffer from a dual challenge: synthetic charts are often simplistic and repetitive, while the a...

ChartVerseRollout Posterior Entropycomplexity-aware chart codertruth-anchored inverse QA synthesisanswer-first paradigm+4 more
Jan 20, 20266

GutenOCR: A Grounded Vision-Language Front-End for Documents

Hunter Heidenreich, Ben Elliott, Olivia Dinica +1 more

GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and ...

vision-language modelsfine-tuninggrounded OCRprompt-based interfacedocument understanding+3 more
Jan 20, 202629

XR: Cross-Modal Agents for Composed Image Retrieval

Zhongyu Yang, Wei Pang, Yingfang Yuan

Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalitie...

compositional image retrievalmulti-agent frameworkcross-modal generationhybrid matchingtargeted reasoning
Jan 20, 20269

Think3D: Thinking with Space for Spatial Reasoning

Zaibin Zhang, Yuhan Wu, Lianjie Jia +9 more

Understanding and reasoning about the physical world requires spatial intelligence: the ability to interpret geometry, perspective, and spatial relations beyond 2D perception. While recent vision large models (VLMs) excel at visual understanding, they remain fundamentally 2D perceivers and struggle ...

vision large models3D reconstruction modelspoint cloudscamera posesspatial reasoning+4 more
Jan 19, 202641

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

Hao Luo, Ye Wang, Wanpeng Zhang +9 more

We introduce Being-H0.5, a foundational Vision-Language-Action (VLA) model designed for robust cross-embodiment generalization across diverse robotic platforms. While existing VLAs often struggle with morphological heterogeneity and data scarcity, we propose a human-centric learning paradigm that tr...

Vision-Language-Actioncross-embodiment generalizationhuman-centric learningmultimodal dataUnified Action Space+4 more
Jan 19, 202668

MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

Peizhou Huang, Zixuan Zhong, Zhongwei Wan +12 more

Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce MMDeepResearch-Bench (MMDR-Bench), a benchmark of 140 expert-...

multimodal evidence usecitation-grounded report generationmultimodal understandingdeep research agentsFormula-LLM Adaptive Evaluation+2 more
Jan 18, 202646

Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility

Honglin Lin, Chonghan Qin, Zheng Liu +7 more

While synthetic data has proven effective for improving scientific reasoning in the text domain, multimodal reasoning remains constrained by the difficulty of synthesizing scientifically rigorous images. Existing Text-to-Image (T2I) models often produce outputs that are visually plausible yet scient...

Text-to-Image modelsLarge Multimodal Modelsscientific correctnessvisual-logic divergenceImgCoder+5 more
Jan 17, 202624

UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

Ruiheng Zhang, Jingfeng Yao, Huangxuan Zhao +9 more

Despite recent progress, medical foundation models still struggle to unify visual understanding and generation, as these tasks have inherently conflicting goals: semantic abstraction versus pixel-level reconstruction. Existing approaches, typically based on parameter-shared autoregressive architectu...

medical foundation modelsautoregressive architecturesdiffusion modelscross-modal self-attentiondata cleaning pipeline+3 more
Jan 16, 202617

PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models

Qiyuan Zhang, Biao Gong, Shuai Tan +7 more

Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation. This gap highlights a critical limitation in rendering rigid body motion, a core tenet of classical mechanics. While computer graphics and physics-based simul...

reinforcement learningvideo generationphysical collision ruleshigh-dimensional spacesphysics-aware+2 more
Jan 16, 20268

ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Yawar Siddiqui, Duncan Frost, Samir Aroudj +9 more

Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casuall...

3D shape generationvisual-inertial SLAM3D detection algorithmsvision-language modelsrectified flow transformer+1 more
Jan 16, 202614

ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

Linqing Zhong, Yi Liu, Yifei Wei +4 more

Vision-Language-Action (VLA) models have emerged as essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model (VLM) embeddings. Recent advancements have introduced explicit intermediary ...

Vision-Language-Modelaction spaceaction chain-of-thoughtACoTExplicit Action Reasoner+4 more
Jan 16, 202622
PreviousPage 9 of 11Next
Latest Multimodal AI Research | Multimodal AI Papers