Multimodal AI

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

DDongyang ChenCChaoyang WangDDezhao SUXXi XiaoZZeyu ZhangJJing XiongQQing LiYYuzhang ShangSShichao Ka
Published
February 5, 2026
Authors
9
Word Count
7,591
Code
Includes code

Revolutionizing image search with visual-driven reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.

Key Takeaways

  • 1

    V-Retrver uses visual inspection to improve image retrieval accuracy.

  • 2

    Incorporates dynamic visual tools for targeted evidence acquisition.

  • 3

    Enhances multimodal retrieval with agentic reasoning capabilities.

Limitations

  • Requires external visual tools for evidence verification.

  • May be computationally intensive due to dynamic visual processing.

Keywords

Multimodal Large Language ModelsChain-of-Thought reasoningmultimodal retrievalvisual encodingsevidence-driven retrievalagentic reasoningvisual inspectionmultimodal interleaved reasoningcurriculum-based learningsupervised reasoning activationrejection-based refinementreinforcement learningevidence-aligned objective

More in Multimodal AI

View all
V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval | Paperchime