Multimodal AI

Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling

RRuijie YeJJiayi ZhangZZhuoxin LiuZZihao ZhuSSiyuan YangLLi LiTTianfu FuFFranck DernoncourtYYue ZhaoJJiacheng ZhuRRyan RossiWWenhao ChaiZZhengzhong Tu
Published
February 9, 2026
Authors
13
Word Count
6,988

Revolutionary AI-driven image editing for professionals.

Abstract

We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user's intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evaluation at around 1K resolution is misaligned with real workflows that often operate on ultra high-definition images (e.g., 4K). We propose Agent Banana, a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing. Agent Banana introduces two key mechanisms: (1) Context Folding, which compresses long interaction histories into structured memory for stable long-horizon control; and (2) Image Layer Decomposition, which performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. To support rigorous evaluation, we build HDD-Bench, a high-definition, dialogue-based benchmark featuring verifiable stepwise targets and native 4K images (11.8M pixels) for diagnosing long-horizon failures. On HDD-Bench, Agent Banana achieves the best multi-turn consistency and background fidelity (e.g., IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following, and also attains strong performance on standard single-turn editing benchmarks. We hope this work advances reliable, professional-grade agentic image editing and its integration into real workflows.

Key Takeaways

  • 1

    Agent Banana enables high-fidelity, object-aware image editing.

  • 2

    Uses Context Folding and Image Layer Decomposition.

  • 3

    Addresses limitations of traditional image editing tools.

Limitations

  • Complex setup may require significant computational resources.

  • Dependent on the effectiveness of Vision-Language Models.

Keywords

agentic planner-executor frameworkcontext foldingimage layer decompositionmulti-turn editinghigh-fidelity editingobject-aware editingdeliberative editingHDD-Benchdialogue-based benchmarknative-resolution outputslong-horizon controlstepwise targets

More in Multimodal AI

View all
Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling | Paperchime