AI Agents

MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

GGuangyi LiuPPengxiang ZhaoYYaozhen LiangQQinyi LuoSShunye TangYYuxiang ChaiWWeifeng LinHHan XiaoWWenHao WangSSiheng ChenZZhengxi LuGGao WuHHao WangLLiang LiuYYong Liu
Published
February 3, 2026
Authors
15
Word Count
29,012
Code
Includes code

Benchmarking memory capabilities of mobile GUI agents.

Abstract

Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textit{fully open-sourced and continuously maintained} at https://lgy0404.github.io/MemGUI-Bench/.

Key Takeaways

  • 1

    Introduces MemGUI-Bench to evaluate mobile GUI agent memory.

  • 2

    Features a systematic memory taxonomy and automated pipeline.

  • 3

    Uses a progressive scrutiny approach for efficient evaluation.

Limitations

  • Previous benchmarks lacked sufficient memory-intensive tasks.

  • Current benchmarks don't assess cross-session learning.

Keywords

memory-centric benchmarkLLM-as-judge evaluationmemory taxonomycross-temporal retentioncross-spatial retentionautomated pipelineProgressive Scrutinyhierarchical metricsstate-of-the-art agentsfailure modesdesign implications

More in AI Agents

View all
MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments | Paperchime