DeepSight: An All-in-One LM Safety Toolkit

BBo ZhangJJiaxuan GuoLLijun LiDDongrui LiuSSujin ChenGGuanxu ChenZZhijie ZhengQQihao LinLLewen YanCChen QianYYijin ZhouYYuyao WuSShaoxiong GuoTTianyi DuJJingyi YangXXuhao HuZZiqi MiaoXXiaoya LuJJing ShaoXXia Hu

Published: February 12, 2026
Authors: 20
Word Count: 12,488

View on arXiv Download PDF

DeepSight bridges safety evaluation and diagnosis for more reliable language model deployment.

Abstract

As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation project consisting of a evaluation toolkit DeepSafe and a diagnosis toolkit DeepScan. By unifying task and data protocols, we build a connection between the two stages and transform safety evaluation from black-box to white-box insight. Besides, DeepSight is the first open source toolkit that support the frontier AI risk evaluation and joint safety evaluation and diagnosis.

Key Takeaways

1
DeepSight unifies safety evaluation and diagnosis into one integrated framework for large language models.
2
DeepSafe evaluates behavioral safety across twenty benchmarks using native, rule-based, and model-based evaluators.
3
DeepScan performs representation-level diagnosis to identify internal vulnerabilities in model safety mechanisms.

Limitations

The script does not detail DeepScan's complete diagnostic tools or their specific effectiveness metrics.
No information provided about scalability challenges or computational requirements for the integrated framework.

Keywords

Large Language ModelsMultimodal Large Language Modelssafety evaluationsafety diagnosissafety alignmentDeepSafeDeepScan

More in AI Safety & Alignment

View all

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Chenxu Wang, Chaozhuo Li +11

The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve co...

Feb 10182

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren +41

The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transp...

Jan 2685

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Qiyuan Zhang, Junyi Zhou +9

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate...

Mar 244

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Seanie Lee, Sangwoo Park +7

Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimiza...

Jan 3028

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5

Dongrui Liu, Yi Yu +19

To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, Frontier AI Risk Management Framework in Practice presents a comprehensive assessment...

Feb 1626

More AI Safety & Alignment papers