Latest AI Safety & Alignment Research Papers

Research on ensuring AI systems are safe, aligned with human values, and behave as intended.

48 Papers
Showing 20 of 20 papers

Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

Rakshith Vasudev, Melisa Russak, Dan Bikel +1 more

Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentag...

LLM critic modelsAUROCproactive interventionsdeployment timedisruption-recovery tradeoff+3 more
Feb 3, 202625

SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration

Qingni Wang, Yue Fan, Xin Eric Wang

Graphical User Interface (GUI) grounding aims to translate natural language instructions into executable screen coordinates, enabling automated GUI interaction. Nevertheless, incorrect grounding can result in costly, hard-to-reverse actions (e.g., erroneous payment approvals), raising concerns about...

GUI groundinguncertainty quantificationcalibrationfalse discovery ratedistribution-aware+2 more
Feb 2, 20263

Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

Ran Xu, Tianci Liu, Zihan Dong +6 more

Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rub...

reward modelsreinforcement learningpreference feedbackrubric generatorjudge+6 more
Feb 2, 20264

SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

Maksim Afanasyev, Illarion Iov

Direct preference optimization methods have emerged as a computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning Large Language Models (LLMs). Latest approaches have streamlined the alignment process by deriving implicit reward functions, yet they ofte...

direct preference optimizationReinforcement Learning from Human FeedbackRLHFLarge Language ModelsLLMs+11 more
Feb 2, 202625

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Seanie Lee, Sangwoo Park, Yumin Choi +6 more

Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this saf...

reinforcement learningchain-of-thought reasoningexternal teacher distillationdistributional discrepancylightweight refusal steering+4 more
Jan 30, 202628

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Mingqian Feng, Xiaodong Liu, Weiwei Yang +3 more

Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent...

large language modelsadversarial promptingBest-of-N samplingjailbreak vulnerabilityBeta distribution+4 more
Jan 30, 20265

Latent Adversarial Regularization for Offline Preference Optimization

Enyi Jiang, Yibo Jacky Zhang, Yinglun Xu +3 more

Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity...

preference optimizationlanguage modelslatent-space regularizationtoken-level regularizationpolicy model+7 more
Jan 29, 202611

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

Quy-Anh Dang, Chris Ngo

Despite significant progress in alignment, large language models (LLMs) remain vulnerable to adversarial attacks that elicit harmful behaviors. Activation steering techniques offer a promising inference-time intervention approach, but existing methods suffer from critical limitations: activation add...

activation steeringangular steeringnorm preservationlayer selectionfeature representations+5 more
Jan 27, 20265

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

Zhewen Tan, Wenhan Yu, Jianfeng Si +9 more

In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three ro...

reinforcement learninglarge language modelssafety alignmentadversarial prompt generationsafety defense+4 more
Jan 26, 20269

One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment

Hongru Cai, Yongqi Li, Tiezheng Yu +4 more

Alignment of Large Language Models (LLMs) aims to align outputs with human preferences, and personalized alignment further adapts models to individual users. This relies on personalized reward models that capture user-specific preferences and automatically provide individualized feedback. However, d...

meta-learningreward modelingpersonalized reward modelsModel-Agnostic Meta-LearningMAML+3 more
Jan 26, 20266

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren, Chen Qian +40 more

The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky b...

agentic guardrailthree-dimensional taxonomyagentic safety benchmarkDiagnostic Guardrail frameworkagent safety and security+5 more
Jan 26, 202685

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Anmol Goel, Cornelius Emde, Sangdoo Yun +2 more

We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective...

language modelsfine-tuningprivacy collapsecontextual privacysafety evaluations+2 more
Jan 21, 20268

On the Evidentiary Limits of Membership Inference for Copyright Auditing

Murat Bilgehan Ertan, Emirhan Böge, Min Chen +2 more

As large language models (LLMs) are trained on increasingly opaque corpora, membership inference attacks (MIAs) have been proposed to audit whether copyrighted texts were used during training, despite growing concerns about their reliability under realistic conditions. We ask whether MIAs can serve ...

membership inference attackslarge language modelsparaphrasing frameworkSparse AutoencodersSAE-guided extraction+4 more
Jan 19, 20263

BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

Shiyu Liu, Yongjing Yin, Jianhao Yan +7 more

RL-based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large-scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recogniz...

reinforcement learningagentic searchlarge language modelsboundary-aware policy optimizationgroup-based boundary-aware reward+4 more
Jan 16, 202613
PreviousPage 2 of 3Next
Latest AI Safety & Alignment Research | AI Safety & Alignment Papers