Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

AAnton KorznikovAAndrey GalichinAAlexey DontsovOOleg RogovIIvan OseledetsEElena Tutubalina

Published: February 15, 2026
Authors: 6
Word Count: 10,223

Sparse Autoencoders may succeed by random chance, not genuine feature learning.

Abstract

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only 9% of true features despite achieving 71% explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models' internal mechanisms.

Key Takeaways

1
Sparse Autoencoders may achieve high metrics while learning nothing more than random baselines.
2
Current SAE evaluation metrics don't adequately test whether models learn meaningful features.
3
Systematic sanity checks using synthetic data reveal fundamental problems in mechanistic interpretability research.

Limitations

Paper analysis focuses on controlled synthetic experiments rather than real-world neural network scenarios.
Current evaluation metrics lack ground truth comparison for assessing genuine feature discovery.

Keywords

Sparse Autoencodersneural networksactivationsexplained varianceinterpretabilitysparse probingcausal editing

More in Natural Language Processing

View all

OpenAutoNLU: Open Source AutoML Library for NLU

Grigory Arshinov, Aleksandr Boriskin +5

OpenAutoNLU is an open-source automated machine learning library for natural language understanding (NLU) tasks, covering both text classification and named entity recognition (NER). Unlike existing s...

Mar 240

ManCAR: Manifold-Constrained Latent Reasoning with Adaptive Test-Time Computation for Sequential Recommendation

Kun Yang, Yuxuan Zhu +8

Sequential recommendation increasingly employs latent multi-step reasoning to enhance test-time computation. Despite empirical gains, existing approaches largely drive intermediate reasoning states vi...

Feb 2323

STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts

Zachary Bamberger, Till R. Saenger +4

Inference-Time-Compute (ITC) methods like Best-of-N and Tree-of-Thoughts are meant to produce output candidates that are both high-quality and diverse, but their use of high-temperature sampling often...

Feb 1518

Semantic Search over 9 Million Mathematical Theorems

Luke Alexander, Eric Leonen +5

Searching for mathematical results remains difficult: most existing tools retrieve entire papers, while mathematicians and theorem-proving agents often seek a specific theorem, lemma, or proposition t...

Feb 517

Self-Improving Multilingual Long Reasoning via Translation-Reasoning Integrated Training

Junxiao Liu, Zhijun Wang +7

Long reasoning models often struggle in multilingual settings: they tend to reason in English for non-English questions; when constrained to reasoning in the question language, accuracies drop substan...

Feb 516

More Natural Language Processing papers