Latest AI Safety & Alignment Research Papers

Research on ensuring AI systems are safe, aligned with human values, and behave as intended.

14 Papers
Showing 14 of 14 papers

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Anmol Goel, Cornelius Emde, Sangdoo Yun +2 more

We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective...

language modelsfine-tuningprivacy collapsecontextual privacysafety evaluations+2 more
Jan 21, 20268

On the Evidentiary Limits of Membership Inference for Copyright Auditing

Murat Bilgehan Ertan, Emirhan Böge, Min Chen +2 more

As large language models (LLMs) are trained on increasingly opaque corpora, membership inference attacks (MIAs) have been proposed to audit whether copyrighted texts were used during training, despite growing concerns about their reliability under realistic conditions. We ask whether MIAs can serve ...

membership inference attackslarge language modelsparaphrasing frameworkSparse AutoencodersSAE-guided extraction+4 more
Jan 19, 20263

BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

Shiyu Liu, Yongjing Yin, Jianhao Yan +7 more

RL-based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large-scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recogniz...

reinforcement learningagentic searchlarge language modelsboundary-aware policy optimizationgroup-based boundary-aware reward+4 more
Jan 16, 202613

Fundamental Limitations of Favorable Privacy-Utility Guarantees for DP-SGD

Murat Bilgehan Ertan, Marten van Dijk

Differentially Private Stochastic Gradient Descent (DP-SGD) is the dominant paradigm for private training, but its fundamental limitations under worst-case adversarial privacy definitions remain poorly understood. We analyze DP-SGD in the f-differential privacy framework, which characterizes privacy...

stochastic gradient descentdifferential privacyf-differential privacyshuffled samplingGaussian noise multiplier+4 more
Jan 15, 20263

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Yi Liu, Weizhe Wang, Ruitao Feng +5 more

The rise of AI agent frameworks has introduced agent skills, modular packages containing instructions and executable code that dynamically extend agent capabilities. While this architecture enables powerful customization, skills execute with implicit trust and minimal vetting, creating a significant...

AI agent frameworksagent skillssecurity analysisvulnerability detectionprompt injection+8 more
Jan 15, 20264

Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

Hongjun An, Yiliang Song, Jiangan Chen +3 more

Large Language Model (LLM) training often optimizes for preference alignment, rewarding outputs that are perceived as helpful and interaction-friendly. However, this preference-oriented objective can be exploited: manipulative prompts can steer responses toward user-appeasing agreement and away from...

Jan 10, 20268

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Zhi Yang, Runguo Li, Qiqi Qiang +15 more

Financial agents powered by large language models (LLMs) are increasingly deployed for investment analysis, risk assessment, and automated decision-making, where their abilities to plan, invoke tools, and manipulate mutable state introduce new security risks in high-stakes and highly regulated finan...

large language modelsfinancial agentssecurity benchmarkstate-writable databasescompliance constraints+4 more
Jan 9, 20269
Latest AI Safety & Alignment Research | AI Safety & Alignment Papers