Latest AI Safety & Alignment Research Papers

Research on ensuring AI systems are safe, aligned with human values, and behave as intended.

48 Papers

Showing 20 of 20 papers

Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

Rakshith Vasudev, Melisa Russak, Dan Bikel +1 more

Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentag...

LLM critic modelsAUROCproactive interventionsdeployment timedisruption-recovery tradeoff+3 more

Feb 3, 202625

SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration

Qingni Wang, Yue Fan, Xin Eric Wang

Graphical User Interface (GUI) grounding aims to translate natural language instructions into executable screen coordinates, enabling automated GUI interaction. Nevertheless, incorrect grounding can result in costly, hard-to-reverse actions (e.g., erroneous payment approvals), raising concerns about...

GUI groundinguncertainty quantificationcalibrationfalse discovery ratedistribution-aware+2 more

Feb 2, 20263

Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

Ran Xu, Tianci Liu, Zihan Dong +6 more

Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rub...

reward modelsreinforcement learningpreference feedbackrubric generatorjudge+6 more

Feb 2, 20264

SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

Maksim Afanasyev, Illarion Iov

Direct preference optimization methods have emerged as a computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning Large Language Models (LLMs). Latest approaches have streamlined the alignment process by deriving implicit reward functions, yet they ofte...

direct preference optimizationReinforcement Learning from Human FeedbackRLHFLarge Language ModelsLLMs+11 more

Feb 2, 202625

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Seanie Lee, Sangwoo Park, Yumin Choi +6 more

Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this saf...

reinforcement learningchain-of-thought reasoningexternal teacher distillationdistributional discrepancylightweight refusal steering+4 more

Jan 30, 202628

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Mingqian Feng, Xiaodong Liu, Weiwei Yang +3 more

Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent...

large language modelsadversarial promptingBest-of-N samplingjailbreak vulnerabilityBeta distribution+4 more

Jan 30, 20265

Shaping capabilities with token-level data filtering

Neil Rathi, Alec Radford

Current approaches to reducing undesired capabilities in language models are largely post hoc, and can thus be easily bypassed by adversaries. A natural alternative is to shape capabilities during pretraining itself. On the proxy task of removing medical capabilities, we show that the simple interve...

language modelspretrainingtoken filteringdata attributionsparse autoencoders+4 more

Jan 29, 202618

Latent Adversarial Regularization for Offline Preference Optimization

Enyi Jiang, Yibo Jacky Zhang, Yinglun Xu +3 more

Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity...

preference optimizationlanguage modelslatent-space regularizationtoken-level regularizationpolicy model+7 more

Jan 29, 202611

How AI Impacts Skill Formation

Judy Hanwen Shen, Alex Tamkin

AI assistance produces significant productivity gains across professional domains, particularly for novice workers. Yet how this assistance affects the development of skills required to effectively supervise AI remains unclear. Novice workers who rely heavily on AI to complete unfamiliar tasks may c...

Jan 28, 20266

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

Quy-Anh Dang, Chris Ngo

Despite significant progress in alignment, large language models (LLMs) remain vulnerable to adversarial attacks that elicit harmful behaviors. Activation steering techniques offer a promising inference-time intervention approach, but existing methods suffer from critical limitations: activation add...

activation steeringangular steeringnorm preservationlayer selectionfeature representations+5 more

Jan 27, 20265

HalluCitation Matters: Revealing the Impact of Hallucinated References with 300 Hallucinated Papers in ACL Conferences

Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

Recently, we have often observed hallucinated citations or references that do not correspond to any existing work in papers under review, preprints, or published papers. Such hallucinated citations pose a serious concern to scientific reliability. When they appear in accepted papers, they may also n...

Jan 26, 20266

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

Zhewen Tan, Wenhan Yu, Jianfeng Si +9 more

In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three ro...

reinforcement learninglarge language modelssafety alignmentadversarial prompt generationsafety defense+4 more

Jan 26, 20269

One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment

Hongru Cai, Yongqi Li, Tiezheng Yu +4 more

Alignment of Large Language Models (LLMs) aims to align outputs with human preferences, and personalized alignment further adapts models to individual users. This relies on personalized reward models that capture user-specific preferences and automatically provide individualized feedback. However, d...

meta-learningreward modelingpersonalized reward modelsModel-Agnostic Meta-LearningMAML+3 more

Jan 26, 20266

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren, Chen Qian +40 more

The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky b...

agentic guardrailthree-dimensional taxonomyagentic safety benchmarkDiagnostic Guardrail frameworkagent safety and security+5 more

Jan 26, 202685

The Responsibility Vacuum: Organizational Failure in Scaled Agent Systems

Oleg Romanchuk, Roman Bondar

Modern CI/CD pipelines integrating agent-generated code exhibit a structural failure in responsibility attribution. Decisions are executed through formally correct approval processes, yet no entity possesses both the authority to approve those decisions and the epistemic capacity to meaningfully und...

Jan 21, 20263

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Anmol Goel, Cornelius Emde, Sangdoo Yun +2 more

We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective...

language modelsfine-tuningprivacy collapsecontextual privacysafety evaluations+2 more

Jan 21, 20268

On the Evidentiary Limits of Membership Inference for Copyright Auditing

Murat Bilgehan Ertan, Emirhan Böge, Min Chen +2 more

As large language models (LLMs) are trained on increasingly opaque corpora, membership inference attacks (MIAs) have been proposed to audit whether copyrighted texts were used during training, despite growing concerns about their reliability under realistic conditions. We ask whether MIAs can serve ...

membership inference attackslarge language modelsparaphrasing frameworkSparse AutoencodersSAE-guided extraction+4 more

Jan 19, 20263

BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

Shiyu Liu, Yongjing Yin, Jianhao Yan +7 more

RL-based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large-scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recogniz...

reinforcement learningagentic searchlarge language modelsboundary-aware policy optimizationgroup-based boundary-aware reward+4 more

Jan 16, 202613

Building Production-Ready Probes For Gemini

János Kramár, Joshua Engels, Zheng Wang +4 more

Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes may be a promising misuse mitigation technique, but we identify a key remaining challenge: probes fail ...

activation probeslanguage modelmisuse mitigationcontext lengthprobe architecture+6 more

Jan 16, 20264

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Yutao Mou, Zhangchi Xue, Lijun Li +4 more

While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains u...

Jan 15, 202621

PreviousPage 2 of 3Next

View all categories

Latest AI Safety & Alignment Research | AI Safety & Alignment Papers