AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

ZZhaochen SuJJincheng GaoHHangyu GuoZZhenhua LiuLLueyang ZhangXXinyu GengSShijue HuangPPeng XiaGGuanyu JiangCCheng WangYYue ZhangYYi R. FungJJunxian He

Published: February 26, 2026
Authors: 13
Word Count: 11,386

View on arXiv Download PDF

AgentVista benchmark exposes critical gaps in multimodal agents' ability to solve realistic, visually-grounded multi-step tasks.

Abstract

Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under routing constraints. However, existing multimodal benchmarks mainly evaluate single-turn visual reasoning or specific tool skills, and they do not fully capture the realism, visual subtlety, and long-horizon tool use that practical agents require. We introduce AgentVista, a benchmark for generalist multimodal agents that spans 25 sub-domains across 7 categories, pairing realistic and detail-rich visual scenarios with natural hybrid tool use. Tasks require long-horizon tool interactions across modalities, including web search, image search, page navigation, and code-based operations for both image processing and general programming. Comprehensive evaluation of state-of-the-art models exposes significant gaps in their ability to carry out long-horizon multimodal tool use. Even the best model in our evaluation, Gemini-3-Pro with tools, achieves only 27.3% overall accuracy, and hard instances can require more than 25 tool-calling turns. We expect AgentVista to accelerate the development of more capable and reliable multimodal agents for realistic and ultra-challenging problem solving.

Key Takeaways

1
AgentVista is a benchmark with 209 tasks requiring multimodal agents to combine visual reasoning with long-horizon tool use across realistic scenarios.
2
Existing multimodal agents struggle significantly, with even the best model achieving only 27.3% accuracy on AgentVista's challenging tasks.
3
The benchmark construction process rigorously filtered 300,000 candidate images down to 209 verified tasks through agent-centric and human-expert validation.

Limitations

Current state-of-the-art multimodal agents fail on most AgentVista tasks, leaving substantial gaps in long-horizon tool use capabilities.
The benchmark focuses on 7 specific categories and may not cover all real-world multimodal agent use cases comprehensively.

Keywords

multimodal agentsvisual reasoningtool usebenchmarklong-horizon tasksmultimodal tool interactionshybrid tool usevisual scenarios

More in Multimodal AI

View all

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li +22

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Re...

Feb 26190

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu +436

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are...

Feb 4190

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang +2

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static...

Feb 26148

More Multimodal AI papers