Latest AI Safety & Alignment Research Papers

Research on ensuring AI systems are safe, aligned with human values, and behave as intended.

48 Papers

Showing 20 of 20 papers

Reasoning Models Struggle to Control their Chains of Thought

Chen Yueh-Han, Robert McCarthy, Bruce W. Lee +5 more

Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability -- CoT control...

Chain-of-thought monitoringCoT controllabilityreasoning modelsCoTControl evaluation suiteadversarial prompt optimization

Mar 5, 202618

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Aradhye Agarwal, Gurdit Siyan, Yash Pandya +3 more

Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimize...

post-training frameworkpreference-based reinforcement learningpairwise trajectory comparisonssafety reasoningrefusal mechanisms+5 more

Mar 3, 202611

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Qiyuan Zhang, Junyi Zhou, Yufei Wang +8 more

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation pa...

Reward Modelsrubric-guided evaluationLarge Language Modelsbenchmarkpairwise comparisons+4 more

Mar 2, 202644

Agents of Chaos

Natalie Shapira, Chris Wendler, Avery Yen +35 more

We report an exploratory red-teaming study of autonomous language-model-powered agents deployed in a live laboratory environment with persistent memory, email accounts, Discord access, file systems, and shell execution. Over a two-week period, twenty AI researchers interacted with the agents under b...

red-teaming studyautonomous language-model-powered agentspersistent memorytool usemulti-party communication+6 more

Feb 23, 202615

Towards a Science of AI Agent Reliability

Stephan Rabanser, Sayash Kapoor, Peter Kirgis +3 more

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a s...

AI agentsreliabilityconsistencyrobustnesspredictability+3 more

Feb 18, 202612

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

Shuofei Qiao, Yunxiang Wei, Xuehai Wang +10 more

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteri...

Large Language Modelsidea evaluationdeep innovation evaluation frameworkknowledge-grounded reasoningmulti-perspective reasoning+4 more

Feb 16, 202614

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5

Dongrui Liu, Yi Yu, Jie Zhang +18 more

To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, Frontier AI Risk Management Framework in Practice presents a comprehensive assessment of their frontier risks. As Large Language Models (LLMs) general capabilities rapidly evolve and th...

Large Language Modelsagentic AIcyber offensepersuasion and manipulationstrategic deception+6 more

Feb 16, 202626

A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)

Tianyu Chen, Dongrui Liu, Xia Hu +2 more

Clawdbot is a self-hosted, tool-using personal AI agent with a broad action space spanning local execution and web-mediated workflows, which raises heightened safety and security concerns under ambiguity and adversarial steering. We present a trajectory-centric evaluation of Clawdbot across six risk...

tool-using personal AI agenttrajectory-centric evaluationagent-safety benchmarksATBenchLPS-Bench+4 more

Feb 16, 202620

DeepSight: An All-in-One LM Safety Toolkit

Bo Zhang, Jiaxuan Guo, Lijun Li +17 more

As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluatio...

Large Language ModelsMultimodal Large Language Modelssafety evaluationsafety diagnosissafety alignment+2 more

Feb 12, 202612

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

Pinyi Zhang, Ting-En Lin, Yuchuan Wu +7 more

Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitati...

personalized alignmentlarge language modelsreinforcement learningreward modelsgenerative reward models+9 more

Feb 12, 20264

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Chenxu Wang, Chaozhuo Li, Songyang Liu +10 more

The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment--a combin...

multi-agent systemslarge language modelsself-evolutionsafety alignmentinformation-theoretic framework+5 more

Feb 10, 2026182

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Jiacheng Hou, Yining Sun, Ruochong Jin +4 more

Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces ...

Vision-Centric Jailbreak Attackimage editing modelsvisual-to-visual attackIESBenchintrospective multimodal reasoning+2 more

Feb 10, 20265

Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model

Tianyi Wu, Mingzhe Du, Yue Liu +4 more

Large language models (LLMs) are increasingly used in software development, yet their tendency to generate insecure code remains a major barrier to real-world deployment. Existing secure code alignment methods often suffer from a functionality--security paradox, improving security at the cost of sub...

large language modelssecure code generationonline reinforcement learningvulnerability detectionreward model+3 more

Feb 7, 20263

Uncovering Cross-Objective Interference in Multi-Objective Alignment

Yining Lu, Meng Jiang

We study a persistent failure mode in multi-objective alignment for large language models (LLMs): training improves performance on only a subset of objectives while causing others to degrade. We formalize this phenomenon as cross-objective interference and conduct the first systematic study across c...

multi-objective alignmentlarge language modelscross-objective interferencescalarization algorithmslocal covariance law+5 more

Feb 6, 20266

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Mingqian Feng, Xiaodong Liu, Weiwei Yang +4 more

Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker witho...

multi-turn jailbreaksreinforcement learningintent-drift-aware rewardsupervised fine-tuningdirect preference optimization+5 more

Feb 6, 20263

From Data to Behavior: Predicting Unintended Model Behaviors Before Training

Mengru Wang, Zhenqian Xu, Junfeng Fang +4 more

Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduc...

Large Language Modelsunintended biasesData2BehaviorManipulating Data Featuresmean representations+4 more

Feb 4, 202612

Reliable and Responsible Foundation Models: A Comprehensive Survey

Xinyu Yang, Junlin Han, Rishi Bommasani +49 more

Foundation models, including Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), Image Generative Models (i.e, Text-to-Image Models and Image-Editing Models), and Video Generative Models, have become essential tools with broad applications across various domains such as law, medi...

Large Language ModelsMultimodal Large Language ModelsText-to-Image ModelsImage-Editing ModelsVideo Generative Models+12 more

Feb 4, 20263

Towards Reducible Uncertainty Modeling for Reliable Large Language Model Agents

Changdae Oh, Seongheon Park, To Eun Kim +8 more

Uncertainty quantification (UQ) for large language models (LLMs) is a key building block for safety guardrails of daily LLM applications. Yet, even as LLM agents are increasingly deployed in highly complex tasks, most UQ research still centers on single-turn question-answering. We argue that UQ rese...

uncertainty quantificationlarge language modelsinteractive agentsopen worlduncertainty accumulation process+3 more

Feb 4, 20269

Steering LLMs via Scalable Interactive Oversight

Enyu Zhou, Zhiheng Xi, Long Ma +9 more

As Large Language Models increasingly automate complex, long-horizon tasks such as vibe coding, a supervision gap has emerged. While models excel at execution, users often struggle to guide them effectively due to insufficient domain expertise, the difficulty of articulating precise intent, and the ...

Large Language Modelsvibe codingsupervision gapscalable oversightrecursive tree+4 more

Feb 4, 202616

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

Binghai Wang, Yantao Liu, Yuxuan Liu +13 more

Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons, as they are trained and evaluated to prioritize Outcome Accuracy, which undermines their ability to generalize during RLHF. We introduce Rationale Consistency, a fin...

Generative Reward ModelsLLM-as-a-Judgedeceptive alignmentoutcome accuracyrationale consistency+4 more

Feb 4, 20267

Page 1 of 3Next

View all categories

Latest AI Safety & Alignment Research | AI Safety & Alignment Papers