Latest AI Safety & Alignment Research Papers

Research on ensuring AI systems are safe, aligned with human values, and behave as intended.

48 Papers
Showing 8 of 8 papers

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Yi Liu, Weizhe Wang, Ruitao Feng +5 more

The rise of AI agent frameworks has introduced agent skills, modular packages containing instructions and executable code that dynamically extend agent capabilities. While this architecture enables powerful customization, skills execute with implicit trust and minimal vetting, creating a significant...

AI agent frameworksagent skillssecurity analysisvulnerability detectionprompt injection+8 more
Jan 15, 20264

Fundamental Limitations of Favorable Privacy-Utility Guarantees for DP-SGD

Murat Bilgehan Ertan, Marten van Dijk

Differentially Private Stochastic Gradient Descent (DP-SGD) is the dominant paradigm for private training, but its fundamental limitations under worst-case adversarial privacy definitions remain poorly understood. We analyze DP-SGD in the f-differential privacy framework, which characterizes privacy...

stochastic gradient descentdifferential privacyf-differential privacyshuffled samplingGaussian noise multiplier+4 more
Jan 15, 20263

Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

Hongjun An, Yiliang Song, Jiangan Chen +3 more

Large Language Model (LLM) training often optimizes for preference alignment, rewarding outputs that are perceived as helpful and interaction-friendly. However, this preference-oriented objective can be exploited: manipulative prompts can steer responses toward user-appeasing agreement and away from...

Jan 10, 20268

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Zhi Yang, Runguo Li, Qiqi Qiang +15 more

Financial agents powered by large language models (LLMs) are increasingly deployed for investment analysis, risk assessment, and automated decision-making, where their abilities to plan, invoke tools, and manipulate mutable state introduce new security risks in high-stakes and highly regulated finan...

large language modelsfinancial agentssecurity benchmarkstate-writable databasescompliance constraints+4 more
Jan 9, 20269
PreviousPage 3 of 3
Latest AI Safety & Alignment Research | AI Safety & Alignment Papers