PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval

TTianyi XuRRong ShanJJunjie WuJJiadeng HuangTTeng WangJJiachen ZhuWWenteng ChenMMinxin TuQQuantao DouZZhaoxiang WangCChangwang ZhangWWeinan ZhangJJun WangJJianghao Lin

Published: March 2, 2026
Authors: 14
Word Count: 13,824

View on arXiv Download PDF

PhotoBench shifts photo retrieval from visual matching to personalized intent-driven multi-source reasoning.

Abstract

Personal photo albums are not merely collections of static images but living, ecological archives defined by temporal continuity, social entanglement, and rich metadata, which makes the personalized photo retrieval non-trivial. However, existing retrieval benchmarks rely heavily on context-isolated web snapshots, failing to capture the multi-source reasoning required to resolve authentic, intent-driven user queries. To bridge this gap, we introduce PhotoBench, the first benchmark constructed from authentic, personal albums. It is designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning. Based on a rigorous multi-source profiling framework, which integrates visual semantics, spatial-temporal metadata, social identity, and temporal events for each image, we synthesize complex intent-driven queries rooted in users' life trajectories. Extensive evaluation on PhotoBench exposes two critical limitations: the modality gap, where unified embedding models collapse on non-visual constraints, and the source fusion paradox, where agentic systems perform poor tool orchestration. These findings indicate that the next frontier in personal multimodal retrieval lies beyond unified embeddings, necessitating robust agentic reasoning systems capable of precise constraint satisfaction and multi-source fusion. Our PhotoBench is available.

Key Takeaways

1
Personal photo retrieval requires multi-source reasoning beyond visual matching, integrating temporal, social, and spatial metadata.
2
Current benchmarks lack ecological fidelity and use shallow queries, failing to capture real-world personalized photo album complexity.
3
Unified embedding models collapse on non-visual constraints while agentic systems struggle with multi-source fusion as query complexity increases.

Limitations

Unified embedding models function primarily as visual similarity calculators rather than holistic multi-source reasoners.
Agentic systems exhibit non-linear performance degradation and poor tool orchestration as query complexity increases.

Keywords

personalized photo retrievalmulti-source reasoningintent-driven queriesvisual semanticsspatial-temporal metadatasocial identitytemporal eventsunified embedding modelsagentic reasoning systemsmulti-source fusion

More in Multimodal AI

View all

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li +22

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Re...

Feb 26190

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu +436

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are...

Feb 4190

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang +2

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static...

Feb 26148

More Multimodal AI papers