Latest AI Agents Research Papers

Research on autonomous AI agents, tool use, planning, and systems that can take actions to accomplish goals.

154 Papers
Showing 20 of 20 papers

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios

Yuanzhe Shen, Zisu Huang, Zhengyuan Wang +14 more

As LLM-based agents are deployed in increasingly complex real-world settings, existing benchmarks underrepresent key challenges such as enforcing global constraints, coordinating multi-tool reasoning, and adapting to evolving user behavior over long, multi-turn interactions. To bridge this gap, we i...

LLM-based agentslong-horizon benchmarktravel-planning scenariosautomated evaluationmulti-turn interactions+5 more
Feb 2, 20265

Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles

Shaohan Wang, Benfeng Xu, Licheng Zhang +5 more

Deep Research Agents (DRAs) have demonstrated remarkable capabilities in autonomous information retrieval and report generation, showing great potential to assist humans in complex research tasks. Current evaluation frameworks primarily rely on LLM-generated references or LLM-derived evaluation dime...

Deep Research AgentsWikipedia Good Articleslive benchmarkevaluation frameworkfine-grained evaluation+2 more
Feb 2, 202629

TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

Hang Yan, Xinyu Che, Fangzhi Xu +7 more

Recent advances in autonomous LLM agents demonstrate their ability to improve performance through iterative interaction with the environment. We define this paradigm as Test-Time Improvement (TTI). However, the mechanisms under how and why TTI succeed or fail remain poorly understood, and existing e...

Test-Time Improvementautonomous LLM agentsiterative interactionenvironmental interactiontask optimization efficiency+3 more
Feb 2, 202628

RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents

Jialiang Zhu, Gongrui Zhang, Xiaolong Ma +17 more

LLM-based deep research agents are largely built on the ReAct framework. This linear design makes it difficult to revisit earlier states, branch into alternative search directions, or maintain global awareness under long contexts, often leading to local optima, redundant exploration, and inefficient...

ReAct frameworkcross-trajectory explorationstructured state representationiterative reflectionglobally informed planning+3 more
Feb 2, 202611

Closing the Loop: Universal Repository Representation with RPG-Encoder

Jane Luo, Chengyu Yin, Xin Zhang +10 more

Current repository agents encounter a reasoning disconnect due to fragmented representations, as existing methods rely on isolated API documentation or dependency graphs that lack semantic depth. We consider repository comprehension and generation to be inverse processes within a unified cycle: gene...

Repository Planning GraphRPG-Encodercode representationsemantic featurescode dependencies+5 more
Feb 2, 202668

SWE-Universe: Scale Real-World Verifiable Environments to Millions

Mouxiang Chen, Lei Zhang, Yunlong Feng +15 more

We propose SWE-Universe, a scalable and efficient framework for automatically constructing real-world software engineering (SWE) verifiable environments from GitHub pull requests (PRs). To overcome the prevalent challenges of automatic building, such as low production yield, weak verifiers, and proh...

building agentself-verificationin-loop hacking detectionreal-world software engineeringGitHub pull requests+5 more
Feb 2, 202631

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Haozhen Zhang, Quanyu Long, Jianzhu Bao +4 more

Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long...

Large Language Modelmemory systemslearnable memory skillsevolvable memorycontroller+9 more
Feb 2, 202649

Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks

Bohan Zeng, Kaixin Zhu, Daili Hua +24 more

World models have emerged as a critical frontier in AI research, aiming to enhance large models by infusing them with physical dynamics and world knowledge. The core objective is to enable agents to understand, predict, and interact with complex environments. However, current research landscape rema...

world modelsphysical dynamicsenvironment interactionvisual prediction3D estimation+4 more
Feb 2, 202640

MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam +3 more

Automating AI research differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We intro...

Monte Carlo Tree SearchMCTSmodular constructioncomparative reflective memorycross-branch transfer
Feb 2, 202633

daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently

Mohan Jiang, Dayuan Fu, Junhao Shi +8 more

While Large Language Models (LLMs) excel at short-term tasks, scaling them to long-horizon agentic workflows remains challenging. The core bottleneck lies in the scarcity of training data that captures authentic long-dependency structures and cross-stage evolutionary dynamics--existing synthesis met...

Large Language Modelslong-horizon agentic workflowstraining datalong-dependency structurescross-stage evolutionary dynamics+14 more
Feb 2, 202639

Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments

Siwei Wu, Yizhi Li, Yuyang Song +8 more

Training agentic models for terminal-based tasks critically depends on high-quality terminal trajectories that capture realistic long-horizon interactions across diverse domains. However, constructing such data at scale remains challenging due to two key requirements: \emph{Executability}, since eac...

terminal trajectoriesDocker environmentsexecutabilityverifiabilityTerminalBench+4 more
Feb 1, 202613

MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering

Chuanzhe Guo, Jingjing Wu, Sijun He +10 more

The evolution of Large Language Model (LLM) agents for software engineering (SWE) is constrained by the scarcity of verifiable datasets, a bottleneck stemming from the complexity of constructing executable environments across diverse languages. To address this, we introduce MEnvAgent, a Multi-langua...

Large Language Model agentssoftware engineeringverifiable datasetsmulti-agent Planning-Execution-Verification architectureEnvironment Reuse Mechanism+3 more
Jan 30, 202613

Deep Search with Hierarchical Meta-Cognitive Monitoring Inspired by Cognitive Neuroscience

Zhongxiang Sun, Qipeng Wang, Weijie Yu +3 more

Deep search agents powered by large language models have demonstrated strong capabilities in multi-step retrieval, reasoning, and long-horizon task execution. However, their practical failures often stem from the lack of mechanisms to monitor and regulate reasoning and retrieval states as tasks evol...

deep search agentslarge language modelsmulti-step retrievalreasoninglong-horizon task execution+11 more
Jan 30, 20263

ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

Xiaoyu Tian, Haotian Wang, Shuaiting Chen +12 more

Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either super...

tool-call graphstrajectory-level rewardssupervised fine-tuningreinforcement learningagent training+5 more
Jan 29, 202639

BMAM: Brain-inspired Multi-Agent Memory Framework

Yang Li, Jiaxiang Liu, Yusong Wang +2 more

Language-model-based agents operating over extended interaction horizons face persistent challenges in preserving temporally grounded information and maintaining behavioral consistency across sessions, a failure mode we term soul erosion. We present BMAM (Brain-inspired Multi-Agent Memory), a genera...

language-model-based agentsextended interaction horizonstemporally grounded informationbehavioral consistencysoul erosion+9 more
Jan 28, 20263

SERA: Soft-Verified Efficient Repository Agents

Ethan Shen, Danny Tormoen, Saurabh Shah +2 more

Open-weight coding agents should hold a fundamental advantage over closed-source systems: they can be specialized to private codebases, encoding repository-specific information directly in their weights. Yet the cost and complexity of training has kept this advantage theoretical. We show it is now p...

supervised fine-tuningcoding agentsprivate codebasesSoft-Verified Generationsynthetic trajectories+3 more
Jan 28, 20267

OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution

Le Zhang, Yixiong Xiao, Xinjiang Lu +12 more

Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on...

GUI agentsfoundation modelsautonomous task executiondata-construction pipelinedecoupled training paradigm+11 more
Jan 28, 20263

AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

Shicheng Fang, Yuxin Wang, XiaoRan Liu +5 more

The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate the complexities of agent-environment interaction, such as non...

Large Language Modelsautonomous agentsdynamic contextsAgentLongBenchLateral Thinking Puzzles+7 more
Jan 28, 202618
PreviousPage 5 of 8Next