AI Safety & Alignment

DeepSight: An All-in-One LM Safety Toolkit

BBo ZhangJJiaxuan GuoLLijun LiDDongrui LiuSSujin ChenGGuanxu ChenZZhijie ZhengQQihao LinLLewen YanCChen QianYYijin ZhouYYuyao WuSShaoxiong GuoTTianyi DuJJingyi YangXXuhao HuZZiqi MiaoXXiaoya LuJJing ShaoXXia Hu
Published
February 12, 2026
Authors
20
Word Count
12,488

DeepSight bridges safety evaluation and diagnosis for more reliable language model deployment.

Abstract

As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation project consisting of a evaluation toolkit DeepSafe and a diagnosis toolkit DeepScan. By unifying task and data protocols, we build a connection between the two stages and transform safety evaluation from black-box to white-box insight. Besides, DeepSight is the first open source toolkit that support the frontier AI risk evaluation and joint safety evaluation and diagnosis.

Key Takeaways

  • 1

    DeepSight unifies safety evaluation and diagnosis into one integrated framework for large language models.

  • 2

    DeepSafe evaluates behavioral safety across twenty benchmarks using native, rule-based, and model-based evaluators.

  • 3

    DeepScan performs representation-level diagnosis to identify internal vulnerabilities in model safety mechanisms.

Limitations

  • The script does not detail DeepScan's complete diagnostic tools or their specific effectiveness metrics.

  • No information provided about scalability challenges or computational requirements for the integrated framework.

Keywords

Large Language ModelsMultimodal Large Language Modelssafety evaluationsafety diagnosissafety alignmentDeepSafeDeepScan

More in AI Safety & Alignment

View all
DeepSight: An All-in-One LM Safety Toolkit | Paperchime