Latest AI Agents Research Papers

Research on autonomous AI agents, tool use, planning, and systems that can take actions to accomplish goals.

154 Papers
Showing 20 of 20 papers

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Xavier Hu, Jinxiang Xia, Shengze Xu +13 more

Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for c...

long-horizon planningLLM-based agentscontinuous plan-and-execute decision makinginteractive economiespersistent economic dynamics+4 more
Feb 10, 20268

Chain of Mindset: Reasoning with Adaptive Cognitive Modes

Tianyi Jiang, Arctanx An, Hengyi Feng +12 more

Human problem-solving is never the repetition of a single mindset, by which we mean a distinct mode of cognitive processing. When tackling a specific task, we do not rely on a single mindset; instead, we integrate multiple mindsets within the single solution process. However, existing LLM reasoning ...

Chain of MindsetCoMagentic frameworkstep-level adaptive mindset orchestrationSpatial mindset+7 more
Feb 10, 202662

TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution

Deyang Jiang, Jing Huang, Xuanle Zhao +6 more

Effectively scaling GUI automation is essential for computer-use agents (CUAs); however, existing work primarily focuses on scaling GUI grounding rather than the more crucial GUI planning, which requires more sophisticated data collection. In reality, the exploration process of a CUA across apps/des...

GUI automationGUI planningtree-structured trajectoriesmulti-agent collaborative frameworkadaptive exploration algorithm+3 more
Feb 10, 20264

G-LNS: Generative Large Neighborhood Search for LLM-Based Automatic Heuristic Design

Baoyun Zhao, He Wang, Liang Zeng

While Large Language Models (LLMs) have recently shown promise in Automated Heuristic Design (AHD), existing approaches typically formulate AHD around constructive priority rules or parameterized local search guidance, thereby restricting the search space to fixed heuristic forms. Such designs offer...

Large Language ModelsAutomated Heuristic DesignCombinatorial Optimization ProblemsLarge Neighborhood Searchgenerative evolutionary framework+5 more
Feb 9, 202617

InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery

Shiyang Feng, Runmin Ma, Xiangchao Yan +54 more

We introduce InternAgent-1.5, a unified system designed for end-to-end scientific discovery across computational and empirical domains. The system is built on a structured architecture composed of three coordinated subsystems for generation, verification, and evolution. These subsystems are supporte...

scientific discoverycomputational modelinglaboratory experimentationunified systemdeep research+5 more
Feb 9, 202615

GISA: A Benchmark for General Information-Seeking Assistant

Yutao Zhu, Xingshuo Zhang, Maosen Zhang +9 more

The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construc...

large language modelssearch agentsinformation-seeking assistantsbenchmarksdeep reasoning+3 more
Feb 9, 202617

Dreaming in Code for Curriculum Learning in Open-Ended Worlds

Konstantinos Mitsides, Maxence Faldor, Antoine Cully

Open-ended learning frames intelligence as emerging from continual interaction with an ever-expanding space of environments. While recent advances have utilized foundation models to programmatically generate diverse environments, these approaches often focus on discovering isolated behaviors rather ...

foundation modelsopen-ended learningenvironment synthesiscurriculum controllong-horizon progression+1 more
Feb 9, 20266

TodoEvolve: Learning to Architect Agent Planning Systems

Jiaxi Liu, Yanzuo Jiang, Guibin Zhang +5 more

Planning has become a central capability for contemporary agent systems in navigating complex, long-horizon tasks, yet existing approaches predominantly rely on fixed, hand-crafted planning structures that lack the flexibility to adapt to the structural diversity of open-ended problems. To address t...

meta-planningplanning architecturesPlanFactoryImpedance-Guided Preference OptimizationIGPO+3 more
Feb 8, 20263

LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth

Weihao Zeng, Yuzhen Huang, Junxian He

Large language models (LLMs) are increasingly capable of carrying out long-running, real-world tasks. However, as the amount of context grows, their reliability often deteriorates, a phenomenon known as "context rot". Existing long-context benchmarks primarily focus on single-step settings that eval...

large language modelscontext rotlong-context benchmarkslanguage agentsLOCA-bench+3 more
Feb 8, 202621

AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research

Yishan Li, Wentong Chen, Yukun Yan +12 more

Generating deep research reports requires large-scale information acquisition and the synthesis of insight-driven analysis, posing a significant challenge for current language models. Most existing approaches follow a plan-then-write paradigm, whose performance heavily depends on the quality of the ...

Writing As Reasoning PolicyWARPEvidence-Based DraftingReasoning-Driven DeepeningMulti-Stage Agentic Training+6 more
Feb 6, 202611

ANCHOR: Branch-Point Data Generation for GUI Agents

Jinbiao Wei, Yilun Zhao, Kangqi Ni +1 more

End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data, yet collecting human demonstrations is expensive and existing synthetic pipelines often suffer from limited task diversity or noisy, goal-drifting trajectories. We present a trajectory expansi...

trajectory expansionGUI agentsdesktop environmentsseed demonstrationsbranch points+6 more
Feb 6, 20265

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster +34 more

LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, m...

AI Research Science Benchmarkagentic capabilitiesresearch lifecyclesequential scaffoldsparallel scaffolds+1 more
Feb 6, 202612

TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents

Kaijie Zhu, Yuzhou Nie, Yijiang Li +10 more

Executing complex terminal tasks remains a significant challenge for open-weight LLMs, constrained by two fundamental limitations. First, high-fidelity, executable training environments are scarce: environments synthesized from real-world repositories are not diverse and scalable, while trajectories...

instruction tuningexpert trajectoriesdistributional mismatchmulti-agent refinement loopGenerator-Critic protocol+3 more
Feb 6, 202620

ScaleEnv: Scaling Environment Synthesis from Scratch for Generalist Interactive Tool-Use Agent Training

Dunwei Tu, Hongyan Hao, Hansi Yang +10 more

Training generalist agents capable of adapting to diverse scenarios requires interactive environments for self-exploration. However, interactive environments remain critically scarce, and existing synthesis methods suffer from significant limitations regarding environmental diversity and scalability...

generalist agentsinteractive environmentsself-explorationprocedural testingtool dependency graph expansion+5 more
Feb 6, 202610

AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions

Xianyang Liu, Shangding Gu, Dawn Song

Large language model (LLM)-based agents are increasingly expected to negotiate, coordinate, and transact autonomously, yet existing benchmarks lack principled settings for evaluating language-mediated economic interaction among multiple agents. We introduce AgenticPay, a benchmark and simulation fra...

multi-agent negotiationlanguage-mediated interactionstrategic reasoningeconomic interactionbuyer-seller markets+3 more
Feb 5, 20263

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Haozhen Zhang, Haodong Yue, Tao Feng +8 more

Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a nat...

runtime memory utilizationquery-aware performance-cost controlmemory modulesbudget tierslightweight router+5 more
Feb 5, 202627

Reinforcement World Model Learning for LLM-based Agents

Xiao Yu, Baolin Peng, Ruize Xu +6 more

Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinfo...

world-modelingreinforcement learningaction-conditioned world modelssim-to-real gap rewardsnext-state token prediction+4 more
Feb 5, 202618

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Tiansheng Hu, Yilun Zhao, Canyu Zhang +2 more

Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent wo...

LLM-based retrieversdeep research agentsscientific literature retrievalSAGE benchmarkReasonIR+4 more
Feb 5, 202611

Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

Zhenxiong Yu, Zhi Yang, Zhiheng Jin +19 more

As large language models (LLMs) evolve into autonomous agents, their real-world applicability has expanded significantly, accompanied by new security challenges. Most existing agent defense mechanisms adopt a mandatory checking paradigm, in which security validation is forcibly triggered at predefin...

large language modelsautonomous agentssecurity challengesmandatory checking paradigmevent-driven defense+7 more
Feb 5, 202664
PreviousPage 3 of 8Next
Latest AI Agents Research | AI Agents Papers