Multimodal AI

PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval

TTianyi XuRRong ShanJJunjie WuJJiadeng HuangTTeng WangJJiachen ZhuWWenteng ChenMMinxin TuQQuantao DouZZhaoxiang WangCChangwang ZhangWWeinan ZhangJJun WangJJianghao Lin
Published
March 2, 2026
Authors
14
Word Count
13,824

PhotoBench shifts photo retrieval from visual matching to personalized intent-driven multi-source reasoning.

Abstract

Personal photo albums are not merely collections of static images but living, ecological archives defined by temporal continuity, social entanglement, and rich metadata, which makes the personalized photo retrieval non-trivial. However, existing retrieval benchmarks rely heavily on context-isolated web snapshots, failing to capture the multi-source reasoning required to resolve authentic, intent-driven user queries. To bridge this gap, we introduce PhotoBench, the first benchmark constructed from authentic, personal albums. It is designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning. Based on a rigorous multi-source profiling framework, which integrates visual semantics, spatial-temporal metadata, social identity, and temporal events for each image, we synthesize complex intent-driven queries rooted in users' life trajectories. Extensive evaluation on PhotoBench exposes two critical limitations: the modality gap, where unified embedding models collapse on non-visual constraints, and the source fusion paradox, where agentic systems perform poor tool orchestration. These findings indicate that the next frontier in personal multimodal retrieval lies beyond unified embeddings, necessitating robust agentic reasoning systems capable of precise constraint satisfaction and multi-source fusion. Our PhotoBench is available.

Key Takeaways

  • 1

    Personal photo retrieval requires multi-source reasoning beyond visual matching, integrating temporal, social, and spatial metadata.

  • 2

    Current benchmarks lack ecological fidelity and use shallow queries, failing to capture real-world personalized photo album complexity.

  • 3

    Unified embedding models collapse on non-visual constraints while agentic systems struggle with multi-source fusion as query complexity increases.

Limitations

  • Unified embedding models function primarily as visual similarity calculators rather than holistic multi-source reasoners.

  • Agentic systems exhibit non-linear performance degradation and poor tool orchestration as query complexity increases.

Keywords

personalized photo retrievalmulti-source reasoningintent-driven queriesvisual semanticsspatial-temporal metadatasocial identitytemporal eventsunified embedding modelsagentic reasoning systemsmulti-source fusion

More in Multimodal AI

View all
PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval | Paperchime