Latest AI Agents Research Papers

Research on autonomous AI agents, tool use, planning, and systems that can take actions to accomplish goals.

154 Papers

Showing 20 of 20 papers

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios

Yuanzhe Shen, Zisu Huang, Zhengyuan Wang +14 more

As LLM-based agents are deployed in increasingly complex real-world settings, existing benchmarks underrepresent key challenges such as enforcing global constraints, coordinating multi-tool reasoning, and adapting to evolving user behavior over long, multi-turn interactions. To bridge this gap, we i...

LLM-based agentslong-horizon benchmarktravel-planning scenariosautomated evaluationmulti-turn interactions+5 more

Feb 2, 20265

Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles

Shaohan Wang, Benfeng Xu, Licheng Zhang +5 more

Deep Research Agents (DRAs) have demonstrated remarkable capabilities in autonomous information retrieval and report generation, showing great potential to assist humans in complex research tasks. Current evaluation frameworks primarily rely on LLM-generated references or LLM-derived evaluation dime...

Deep Research AgentsWikipedia Good Articleslive benchmarkevaluation frameworkfine-grained evaluation+2 more

Feb 2, 202629

TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

Hang Yan, Xinyu Che, Fangzhi Xu +7 more

Recent advances in autonomous LLM agents demonstrate their ability to improve performance through iterative interaction with the environment. We define this paradigm as Test-Time Improvement (TTI). However, the mechanisms under how and why TTI succeed or fail remain poorly understood, and existing e...

Test-Time Improvementautonomous LLM agentsiterative interactionenvironmental interactiontask optimization efficiency+3 more

Feb 2, 202628

RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents

Jialiang Zhu, Gongrui Zhang, Xiaolong Ma +17 more

LLM-based deep research agents are largely built on the ReAct framework. This linear design makes it difficult to revisit earlier states, branch into alternative search directions, or maintain global awareness under long contexts, often leading to local optima, redundant exploration, and inefficient...

ReAct frameworkcross-trajectory explorationstructured state representationiterative reflectionglobally informed planning+3 more

Feb 2, 202611

Closing the Loop: Universal Repository Representation with RPG-Encoder

Jane Luo, Chengyu Yin, Xin Zhang +10 more

Current repository agents encounter a reasoning disconnect due to fragmented representations, as existing methods rely on isolated API documentation or dependency graphs that lack semantic depth. We consider repository comprehension and generation to be inverse processes within a unified cycle: gene...

Repository Planning GraphRPG-Encodercode representationsemantic featurescode dependencies+5 more

Feb 2, 202668

SWE-Universe: Scale Real-World Verifiable Environments to Millions

Mouxiang Chen, Lei Zhang, Yunlong Feng +15 more

We propose SWE-Universe, a scalable and efficient framework for automatically constructing real-world software engineering (SWE) verifiable environments from GitHub pull requests (PRs). To overcome the prevalent challenges of automatic building, such as low production yield, weak verifiers, and proh...

building agentself-verificationin-loop hacking detectionreal-world software engineeringGitHub pull requests+5 more

Feb 2, 202631

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Haozhen Zhang, Quanyu Long, Jianzhu Bao +4 more

Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long...

Large Language Modelmemory systemslearnable memory skillsevolvable memorycontroller+9 more

Feb 2, 202649

Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks

Bohan Zeng, Kaixin Zhu, Daili Hua +24 more

World models have emerged as a critical frontier in AI research, aiming to enhance large models by infusing them with physical dynamics and world knowledge. The core objective is to enable agents to understand, predict, and interact with complex environments. However, current research landscape rema...

world modelsphysical dynamicsenvironment interactionvisual prediction3D estimation+4 more

Feb 2, 202640

MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam +3 more

Automating AI research differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We intro...

Monte Carlo Tree SearchMCTSmodular constructioncomparative reflective memorycross-branch transfer

Feb 2, 202633

daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently

Mohan Jiang, Dayuan Fu, Junhao Shi +8 more

While Large Language Models (LLMs) excel at short-term tasks, scaling them to long-horizon agentic workflows remains challenging. The core bottleneck lies in the scarcity of training data that captures authentic long-dependency structures and cross-stage evolutionary dynamics--existing synthesis met...

Large Language Modelslong-horizon agentic workflowstraining datalong-dependency structurescross-stage evolutionary dynamics+14 more

Feb 2, 202639

WideSeek: Advancing Wide Research via Multi-Agent Scaling

Ziyang Huang, Haolin Ren, Xiaowei Yuan +6 more

Search intelligence is evolving from Deep Research to Wide Research, a paradigm essential for retrieving and synthesizing comprehensive information under complex constraints in parallel. However, progress in this field is impeded by the lack of dedicated benchmarks and optimization methodologies for...

Wide ResearchDeep Researchsearch intelligenceWideSeekBenchGBIS+6 more

Feb 2, 202612

Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments

Siwei Wu, Yizhi Li, Yuyang Song +8 more

Training agentic models for terminal-based tasks critically depends on high-quality terminal trajectories that capture realistic long-horizon interactions across diverse domains. However, constructing such data at scale remains challenging due to two key requirements: \emph{Executability}, since eac...

terminal trajectoriesDocker environmentsexecutabilityverifiabilityTerminalBench+4 more

Feb 1, 202613

MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering

Chuanzhe Guo, Jingjing Wu, Sijun He +10 more

The evolution of Large Language Model (LLM) agents for software engineering (SWE) is constrained by the scarcity of verifiable datasets, a bottleneck stemming from the complexity of constructing executable environments across diverse languages. To address this, we introduce MEnvAgent, a Multi-langua...

Large Language Model agentssoftware engineeringverifiable datasetsmulti-agent Planning-Execution-Verification architectureEnvironment Reuse Mechanism+3 more

Jan 30, 202613

Deep Search with Hierarchical Meta-Cognitive Monitoring Inspired by Cognitive Neuroscience

Zhongxiang Sun, Qipeng Wang, Weijie Yu +3 more

Deep search agents powered by large language models have demonstrated strong capabilities in multi-step retrieval, reasoning, and long-horizon task execution. However, their practical failures often stem from the lack of mechanisms to monitor and regulate reasoning and retrieval states as tasks evol...

deep search agentslarge language modelsmulti-step retrievalreasoninglong-horizon task execution+11 more

Jan 30, 20263

Exploring Reasoning Reward Model for Agents

Kaixuan Fan, Kaituo Feng, Manyuan Zhang +7 more

Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to subop...

Agentic Reinforcement Learningreward modelreasoning tracecritiqueoverall score+5 more

Jan 29, 202619

ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

Xiaoyu Tian, Haotian Wang, Shuaiting Chen +12 more

Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either super...

tool-call graphstrajectory-level rewardssupervised fine-tuningreinforcement learningagent training+5 more

Jan 29, 202639

BMAM: Brain-inspired Multi-Agent Memory Framework

Yang Li, Jiaxiang Liu, Yusong Wang +2 more

Language-model-based agents operating over extended interaction horizons face persistent challenges in preserving temporally grounded information and maintaining behavioral consistency across sessions, a failure mode we term soul erosion. We present BMAM (Brain-inspired Multi-Agent Memory), a genera...

language-model-based agentsextended interaction horizonstemporally grounded informationbehavioral consistencysoul erosion+9 more

Jan 28, 20263

SERA: Soft-Verified Efficient Repository Agents

Ethan Shen, Danny Tormoen, Saurabh Shah +2 more

Open-weight coding agents should hold a fundamental advantage over closed-source systems: they can be specialized to private codebases, encoding repository-specific information directly in their weights. Yet the cost and complexity of training has kept this advantage theoretical. We show it is now p...

supervised fine-tuningcoding agentsprivate codebasesSoft-Verified Generationsynthetic trajectories+3 more

Jan 28, 20267

OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution

Le Zhang, Yixiong Xiao, Xinjiang Lu +12 more

Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on...

GUI agentsfoundation modelsautonomous task executiondata-construction pipelinedecoupled training paradigm+11 more