Natural Language Processing

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

AAnton KorznikovAAndrey GalichinAAlexey DontsovOOleg RogovIIvan OseledetsEElena Tutubalina
Published
February 15, 2026
Authors
6
Word Count
10,223

Sparse Autoencoders may succeed by random chance, not genuine feature learning.

Abstract

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only 9% of true features despite achieving 71% explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models' internal mechanisms.

Key Takeaways

  • 1

    Sparse Autoencoders may achieve high metrics while learning nothing more than random baselines.

  • 2

    Current SAE evaluation metrics don't adequately test whether models learn meaningful features.

  • 3

    Systematic sanity checks using synthetic data reveal fundamental problems in mechanistic interpretability research.

Limitations

  • Paper analysis focuses on controlled synthetic experiments rather than real-world neural network scenarios.

  • Current evaluation metrics lack ground truth comparison for assessing genuine feature discovery.

Keywords

Sparse Autoencodersneural networksactivationsexplained varianceinterpretabilitysparse probingcausal editing

More in Natural Language Processing

View all
Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines? | Paperchime