Latest AI Agents Research Papers

Research on autonomous AI agents, tool use, planning, and systems that can take actions to accomplish goals.

34 Papers
Showing 20 of 20 papers

Agentic Uncertainty Quantification

Jiaxin Zhang, Prafulla Kumar Choubey, Kung-Hsiang Huang +2 more

Although AI agents have demonstrated impressive capabilities in long-horizon reasoning, their reliability is severely hampered by the ``Spiral of Hallucination,'' where early epistemic errors propagate irreversibly. Existing methods face a dilemma: uncertainty quantification (UQ) methods typically a...

Dual-Process Agentic UQUncertainty-Aware MemoryUncertainty-Aware ReflectionSpiral of Hallucinationclosed-loop benchmarks+1 more
Jan 22, 20267

LLM-in-Sandbox Elicits General Agentic Intelligence

Daixuan Cheng, Shaohan Huang, Yuxian Gu +6 more

We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-cod...

LLM-in-Sandboxcode sandboxvirtual computerreinforcement learningnon-agentic data+4 more
Jan 22, 202663

EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience

Taofeng Xue, Chong Peng, Mianqiu Huang +13 more

The development of native computer-use agents (CUA) represents a significant leap in multimodal AI. However, their potential is currently bottlenecked by the constraints of static data scaling. Existing paradigms relying primarily on passive imitation of static datasets struggle to capture the intri...

computer-use agentsnative computer-use agentsdata generationpolicy optimizationevolutionary cycle+9 more
Jan 22, 202675

AgentEHR: Advancing Autonomous Clinical Decision-Making via Retrospective Summarization

Yusheng Liao, Chuan Xuan, Yutong Cai +4 more

Large Language Models have demonstrated profound utility in the medical domain. However, their application to autonomous Electronic Health Records~(EHRs) navigation remains constrained by a reliance on curated inputs and simplified retrieval tasks. To bridge the gap between idealized experimental se...

Electronic Health Recordslarge language modelsdecision-makingretrospective summarizationevolving experience strategy+4 more
Jan 20, 20265

Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics

Junqi Liu, Zihao Zhou, Zekai Zhu +10 more

Agentic systems have recently become the dominant paradigm for formal theorem proving, achieving strong performance by coordinating multiple models and tools. However, existing approaches often rely on task-specific pipelines and trained formal provers, limiting their flexibility and reproducibility...

agentic systemsformal theorem provinggeneral coding agentLeanNumina-Lean-MCP+4 more
Jan 20, 202611

Paper2Rebuttal: A Multi-Agent Framework for Transparent Author Response Assistance

Qianli Ma, Chang Guo, Zhiheng Tian +4 more

Writing effective rebuttals is a high-stakes task that demands more than linguistic fluency, as it requires precise alignment between reviewer intent and manuscript details. Current solutions typically treat this as a direct-to-text generation problem, suffering from hallucination, overlooked critiq...

multi-agents frameworkevidence-centric planningrebuttal generationpeer reviewstrategic coherence+2 more
Jan 20, 202644

Aligning Agentic World Models via Knowledgeable Experience Learning

Baochang Ren, Yunzhi Yao, Rui Sun +3 more

Current Large Language Models (LLMs) exhibit a critical modal disconnect: they possess vast semantic knowledge but lack the procedural grounding to respect the immutable laws of the physical world. Consequently, while these agents implicitly function as world models, their simulations often suffer f...

Large Language Modelsworld modelsphysical hallucinationsalignment strategiesparametric encapsulation+7 more
Jan 19, 202613

ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

Dawei Li, Yuguang Yao, Zhen Tan +2 more

Reward-guided search methods have demonstrated strong potential in enhancing tool-using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models (PRMs) to provide step-level rewards, enabling more fine-gra...

process reward modelstool-using agentsreward-guided searchagent trajectoriesstep-level rewards+4 more
Jan 18, 202614

Agentic-R: Learning to Retrieve for Agentic Search

Wenhan Liu, Xinyu Ma, Yutao Zhu +4 more

Agentic search has recently emerged as a powerful paradigm, where an agent interleaves multi-step reasoning with on-demand retrieval to solve complex questions. Despite its success, how to design a retriever for agentic search remains largely underexplored. Existing search agents typically rely on s...

agentic searchmulti-step reasoningon-demand retrievalsimilarity-based retrieversretrieval-augmented generation+6 more
Jan 17, 202614

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini +82 more

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully ...

AI agentslong-horizon tasksbenchmarksterminal environmentsreal-world tasks+1 more
Jan 17, 202623

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

Jie Yang, Honglin Guo, Li Ji +11 more

The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the...

Large Language Modelsagentic backend codingexecutable workflowdevelopment lifecyclecontainerized services+1 more
Jan 16, 202661

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Keyu Li, Junhao Shi, Yang Xiao +10 more

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on hum...

large language modelsautonomous agentsbenchmarksagentic capabilitiesuser simulation agent+5 more
Jan 16, 202630

Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey

Caihua Li, Lianghong Guo, Yanlin Wang +9 more

Issue resolution, a complex Software Engineering (SWE) task integral to real-world development, has emerged as a compelling challenge for artificial intelligence. The establishment of benchmarks like SWE-bench revealed this task as profoundly difficult for large language models, thereby significantl...

large language modelssoftware engineeringautonomous coding agentstraining-free frameworkssupervised fine-tuning+3 more
Jan 15, 202655
Page 1 of 2Next
Latest AI Agents Research | AI Agents Papers