AI Agents

The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

DDaocheng FuJJianbiao MeiRRong WuXXuemeng YangJJia XuDDing WangPPinlong CaiYYong LiuLLicheng WenBBotian Shi
arXiv ID
2601.08173
Published
January 13, 2026
Authors
10
Hugging Face Likes
2
Comments
2

Abstract

The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce , a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios. Our codes are available at https://github.com/KnowledgeXLab/EvoEnv

More in AI Agents

View all
The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios | Paperchime