Latest AI Agents Research Papers

Research on autonomous AI agents, tool use, planning, and systems that can take actions to accomplish goals.

154 Papers

Showing 20 of 20 papers

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Qianyu Yang, Yang Liu, Jiaqi Li +20 more

As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world professional demands. To this end, we introduce \OneMillion-Bench OneM...

Mar 9, 202620

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

Maojun Sun, Yue Wu, Yifei Xie +5 more

Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore dat...

retrieval-augmented approachesfunction-level semanticsdata distributionR Package Knowledge Baseembedding model+5 more

Mar 5, 202645

Scaling Agentic Capabilities, Not Context: Efficient Reinforcement Finetuning for Large Toolspaces

Karan Gupta, Pranav Vajreshwari, Yash Pandya +3 more

Agentic systems operating over large tool ecosystems must plan and execute long-horizon workflows under weak or non-verifiable supervision. While frontier models mitigate these challenges through scale and large context budgets, small language models (SLMs) remain brittle: eager tool loading saturat...

reinforcement fine-tuningcontext controlexecution structuretool orchestrationprogrammatic tool orchestration+5 more

Mar 5, 202612

Interactive Benchmarks

Baoqing Yue, Zihan Zhu, Yifan Zhang +3 more

Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm tha...

interactive benchmarksmodel intelligenceactive information acquisitionreasoning abilitybudget constraints+3 more

Mar 5, 202618

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Zhenting Wang, Huancheng Chen, Jiayun Wang +1 more

Large language model (LLM) agents are fundamentally bottlenecked by finite context windows on long-horizon tasks. As trajectories grow, retaining tool outputs and intermediate reasoning in-context quickly becomes infeasible: the working context becomes prohibitively long, eventually exceeds the cont...

large language model agentscontext windowsexperience memoryindexed memoryreinforcement learning+6 more

Mar 4, 202614

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Guoxin Chen, Fanzhe Meng, Jiale Zhao +12 more

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce Bey...

code agentsbenchmarkscross-repository reasoningdomain-specialized problem solvingdependency-driven migration+8 more

Mar 3, 202650

Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

Dadi Guo, Yuejin Xie, Qingyu Liu +7 more

As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic codin...

large language modelsmathematical capabilitiescode agentsagentic codingreasoning+3 more

Mar 3, 202613

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

Jinpeng Chen, Cheng Gong, Hanbo Li +9 more

Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce CoVe (Constraint-Verification), a post-training data synthesis framework de...

post-training data synthesissupervised fine-tuningreinforcement learningtask constraintstrajectory generation+4 more

Mar 2, 202620

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov +1 more

Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number o...

reinforcement learningsoftware engineering agentsSWE-benchLLM judgesreproducible execution+5 more

Feb 27, 202647

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

Zhiheng Song, Jingshuai Zhang, Chuan Qin +6 more

Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse rout...

route-planning agentslarge language modelsMobilityBenchAPI-replay sandboxdeterministic environment+7 more

Feb 26, 2026103

General Agent Evaluation

Elron Bandel, Asaf Yehudai, Lilach Eden +12 more

The promise of general-purpose agents - systems that perform tasks in unfamiliar environments without domain-specific engineering - remains largely unrealized. Existing agents are predominantly specialized, and while emerging implementations like OpenAI SDK Agent and Claude Code hint at broader capa...

general-purpose agentsagentic benchmarksunified protocolExgentic frameworkOpen General Agent Leaderboard

Feb 26, 202611

Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Qianben Chen, Tianrui Qin, King Zhu +21 more

Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios. Moreover, generalization across heterogeneous research settings remains challenging. In this work, we propose Search More, Think Less...

deep research agentsreasoning depthinference costsearch-intensive scenariosgeneralization+9 more

Feb 26, 202620

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Zeyuan Liu, Jeonghye Kim, Xufang Luo +2 more

Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EM...

reinforcement learninglarge language model agentsexplorationmemory augmentationon-policy updates+3 more

Feb 26, 202633

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

Yutong Wang, Siyuan Xiong, Xuebo Liu +4 more

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their deployability and adaptability. We ...

multi-agent systemstest-time rectify-or-reject pruningretrieval-augmented rectifierfailure-driven indicator pooldistilled failure patterns+3 more

Feb 26, 202627

SkillNet: Create, Evaluate, and Connect AI Skills

Yuan Liang, Ruobin Zhong, Haoming Xu +46 more

Current AI agents can flexibly invoke tools and execute complex tasks, yet their long-term advancement is hindered by the lack of systematic accumulation and transfer of skills. Without a unified mechanism for skill consolidation, agents frequently ``reinvent the wheel'', rediscovering solutions in ...

AI agentsskill consolidationtool invocationlong-term advancementunified mechanism+9 more

Feb 26, 202668

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Rui Yang, Qianhui Wu, Zhaoyang Wang +8 more

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI...

GUI agentsaction-aligned reasoning datapost-training pipelinesSFTCoT reasoning+9 more

Feb 25, 202615

SkillOrchestra: Learning to Route Agents via Skill Transfer

Jiayu Wang, Yifei Ming, Zixuan Ke +3 more

Compound AI systems promise capabilities beyond those of individual models, yet their success depends critically on effective orchestration. Existing routing approaches face two limitations: (1) input-level routers make coarse query-level decisions that ignore evolving task requirements; (2) RL-trai...

compound AI systemsorchestrationrouting policyreinforcement learningskill modeling+5 more

Feb 23, 202646

Computer-Using World Model

Yiming Guan, Rui Yu, John Zhang +15 more

Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real exe...

world modeluser interfaceUI stateaction searchreinforcement learning+5 more

Feb 19, 202612

Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

Wenxuan Ding, Nicholas Tomlin, Greg Durrett

LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information. In these scenarios, LLMs must reason about inherent cost-uncertainty tradeoffs in when to stop exploring and commit to an...

large language modelssequential decision-makingcost-uncertainty tradeoffsenvironment explorationCalibrate-Then-Act framework+5 more

Feb 18, 202614

Discovering Multiagent Learning Algorithms with Large Language Models

Zun Li, John Schultz, Daniel Hennes +1 more

Much of the advancement of Multi-Agent Reinforcement Learning (MARL) in imperfect-information games has historically depended on manual iterative refinement of baselines. While foundational families like Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO) rest on solid ...

Multi-Agent Reinforcement Learningimperfect-information gamesCounterfactual Regret MinimizationPolicy Space Response Oraclesevolutionary coding+10 more

Feb 18, 202611

Page 1 of 8Next

View all categories

Latest AI Agents Research | AI Agents Papers