Latest Multimodal AI Research Papers

Research on AI systems that process multiple types of data including vision-language models and cross-modal understanding.

205 Papers

Showing 5 of 5 papers

Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?

Jie Zhu, Yiyang Su, Xiaoming Liu

Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC), a core perception task that requires subtle visual discrimination and is crucial for many real-world applications. A widely adopted strategy for boo...

Jan 11, 20261

KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

Tingyu Wu, Zhisheng Chen, Ziyan Weng +8 more

Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present \BenchName, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, ...

Jan 8, 202648

Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization

Mizanur Rahman, Mohammed Saidul Islam, Md Tahmid Rahman Laskar +2 more

Text-to-Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed-source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity, qualities that can only be assessed pos...

Jan 8, 20262

FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou +1 more

Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring sign...

Jan 7, 20262

Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments

Hansen Jin Lillemark, Benhao Huang, Fangneng Zhan +2 more

Embodied systems experience the world as 'a symphony of flows': a combination of many continuous streams of sensory input coupled to self-motion, interwoven with the dynamics of external objects. These streams obey smooth, time-parameterized symmetries, which combine through a precisely structured a...

Jan 3, 20264

PreviousPage 11 of 11

View all categories