Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

AAradhye AgarwalGGurdit SiyanYYash PandyaJJoykirat SinghAAkshay NambiAAhmed Awadallah

Published: March 3, 2026
Authors: 6
Word Count: 11,624
Code: Includes code

MOSAIC makes agentic AI safety explicit and learnable through structured reasoning and trajectory-level preferences.

Abstract

Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.

Key Takeaways

1
MOSAIC structures agentic reasoning as plan-check-act/refuse loops, making safety decisions explicit and learnable rather than implicit.
2
Preference-based reinforcement learning with pairwise trajectory comparisons captures safety distinctions that scalar rewards cannot distinguish.
3
MOSAIC reduces harmful behavior by 50% across multiple model families while preserving benign task performance and token efficiency.

Limitations

Approach requires preference-based trajectory labeling which may not scale to all safety scenarios or edge cases.
Evaluation limited to three open-weight model families; generalization to other architectures or deployment contexts remains unclear.

Keywords

post-training frameworkpreference-based reinforcement learningpairwise trajectory comparisonssafety reasoningrefusal mechanismsagentic language modelstool usetrajectory-level labelsscalar rewardsbenign task performance

More in AI Safety & Alignment

View all

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Chenxu Wang, Chaozhuo Li +11

The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve co...

Feb 10182

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren +41

The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transp...

Jan 2685

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Qiyuan Zhang, Junyi Zhou +9

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate...

Mar 244

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Seanie Lee, Sangwoo Park +7

Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimiza...

Jan 3028

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5

Dongrui Liu, Yi Yu +19

To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, Frontier AI Risk Management Framework in Practice presents a comprehensive assessment...

Feb 1626

More AI Safety & Alignment papers