Latest AI Agents Research Papers

Research on autonomous AI agents, tool use, planning, and systems that can take actions to accomplish goals.

154 Papers

Showing 20 of 20 papers

Discovering Multiagent Learning Algorithms with Large Language Models

Zun Li, John Schultz, Daniel Hennes +1 more

Much of the advancement of Multi-Agent Reinforcement Learning (MARL) in imperfect-information games has historically depended on manual iterative refinement of baselines. While foundational families like Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO) rest on solid ...

Multi-Agent Reinforcement Learningimperfect-information gamesCounterfactual Regret MinimizationPolicy Space Response Oraclesevolutionary coding+10 more

Feb 18, 202611

"What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing

Johannes Kirmayr, Raphael Wennmacher, Khanh Huynh +3 more

Agentic AI assistants that autonomously perform multi-step tasks raise open questions for user experience: how should such systems communicate progress and reasoning during extended operations, especially in attention-critical contexts such as driving? We investigate feedback timing and verbosity fr...

agentic AI assistantsmulti-step tasksuser experiencefeedback timingverbosity+7 more

Feb 17, 202613

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5 Team, Aohan Zeng, Xin Lv +183 more

We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintain...

DSAasynchronous reinforcement learningagentic engineeringvibe codingARC capabilities+5 more

Feb 17, 202657

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Aniketh Garikaparthi, Manasi Patwardhan, Arman Cohan

We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline impleme...

ResearchGymAI agentsend-to-end researchICMLICLR+14 more

Feb 16, 202616

AutoWebWorld: Synthesizing Infinite Verifiable Web Environments via Finite State Machines

Yifan Wu, Yiran Peng, Yiyu Chen +12 more

The performance of autonomous Web GUI agents heavily relies on the quality and quantity of their training data. However, a fundamental bottleneck persists: collecting interaction trajectories from real-world websites is expensive and difficult to verify. The underlying state transitions are hidden, ...

Finite State Machinescoding agentsweb environmentsautomated search-and-verify pipelinesynthetic data+3 more

Feb 15, 202643

REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents

Zheng Chu, Xiao Wang, Jack Hong +11 more

Large language models are transitioning from generalpurpose knowledge engines to realworld problem solvers, yet optimizing them for deep search tasks remains challenging. The central bottleneck lies in the extreme sparsity of highquality search trajectories and reward signals, arising from the diffi...

task synthesisdual-constrained optimizationgraph topologyevidence dispersiontool-augmented queries+9 more

Feb 15, 202610

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

Haiyang Xu, Xi Zhang, Haowei Liu +16 more

The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge collaboration and real-time interaction. GUI-Owl-1.5 achieves...

GUI agent modelUI understandingtrajectory generationsimulated environmentscloud-based sandbox environments+5 more

Feb 15, 202638

Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

Ming Li, Xirui Li, Tianyi Zhou

As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems? Lately, Moltbook approximates a plausible future scenario in which autonomous agen...

large language model agentsnetworked environmentsAI agent societiessemantic stabilizationlexical turnover+5 more

Feb 15, 202624

Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts

Chen Yang, Guangyue Peng, Jiaying Zhu +12 more

We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a s...

unified generalist language modelreward modelingreinforcement learningtool-call turnsdeep search+9 more

Feb 13, 202618

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu +37 more

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and determinist...

agent skillsLLM agentsSkillsBenchprocedural knowledgecurated Skills+4 more

Feb 13, 202645

Budget-Constrained Agentic Large Language Models: Intention-Based Planning for Costly Tool Use

Hanbing Liu, Chunhao Tian, Nan An +4 more

We study budget-constrained tool-augmented agents, where a large language model must solve multi-step tasks by invoking external tools under a strict monetary budget. We formalize this setting as sequential decision making in context space with priced and stochastic tool executions, making direct pl...

tool-augmented agentslarge language modelsequential decision makingcontext spacestochastic tool executions+3 more

Feb 12, 20263

Learning to Configure Agentic AI Systems

Aditya Taparia, Som Sagar, Ransalu Senanayake

Configuring LLM-based agent systems involves choosing workflows, tools, token budgets, and prompts from a large combinatorial design space, and is typically handled today by fixed large templates or hand-tuned heuristics. This leads to brittle behavior and unnecessary compute, since the same cumbers...

LLM-based agent systemsreinforcement learninghierarchical policyquery-wise decision problemagent configuration+6 more

Feb 12, 202614

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Romain Froger, Pierre Andrews, Matteo Bettini +21 more

We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constrai...

large language model agentsasynchronous environmentstemporal constraintsmulti-agent collaborationwrite-action verifier+4 more

Feb 12, 20268

Intelligent AI Delegation

Nenad Tomašev, Matija Franklin, Simon Osindero

AI agents are able to tackle increasingly complex tasks. To achieve more ambitious goals, AI agents need to be able to meaningfully decompose problems into manageable sub-components, and safely delegate their completion across to other AI agents and humans alike. Yet, existing task decomposition and...

task decompositiondelegationadaptive frameworkauthority transfertrust mechanisms+1 more

Feb 12, 202610

LawThinker: A Deep Research Legal Agent in Dynamic Environments

Xinyu Yang, Chenlong Deng, Tongyu Wen +2 more

Legal reasoning requires not only correct outcomes but also procedurally compliant reasoning processes. However, existing methods lack mechanisms to verify intermediate reasoning steps, allowing errors such as inapplicable statute citations to propagate undetected through the reasoning chain. To add...

legal reasoningautonomous agentExplore-Verify-Memorize strategyDeepVerifier moduleknowledge accuracy+4 more

Feb 12, 202631

CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion

Yusong Lin, Haiyang Wang, Shuzhe Wu +4 more

Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to ...

CLI-GymLiberCoderenvironment-intensive tasksDockerfileagentic task+2 more

Feb 11, 20268

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Ailin Huang, Ang Li, Aobo Kong +212 more

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with ...

Mixture-of-Expertssparse MoEfoundation modelactive parametersinterleaved attention+13 more

Feb 11, 2026140

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Qixing Zhou, Jiacheng Zhang, Haiyang Wang +9 more

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current boundaries of their coding abilities. Existing agentic coding benchm...

large language modelsagentic codingsoftware developmentfeature-oriented developmentexecution-based evaluation+6 more

Feb 11, 202614

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Wayne Chi, Yixiong Fang, Arnav Yayavaram +8 more

Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed a...

coding agentsmultimodal counterpartsevaluation testbedsoftware developmentmultimodal understanding+6 more