Large Language Models

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

JJianhui ChenYYuzhang LuoLLiangming Pan
Published
January 29, 2026
Authors
3
Word Count
12,705
Code
Includes code

Trace LLM units back to their training data origins.

Abstract

While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

Key Takeaways

  • 1

    Traces interpretable LLM units to specific training samples.

  • 2

    Uses Influence Functions to estimate data contribution.

  • 3

    Potential to revolutionize LLM understanding and control.

Limitations

  • Currently focuses on specific types of interpretable units.

  • Requires significant computational resources for large datasets.

Keywords

Mechanistic InterpretabilityInfluence Functionsinterpretable unitstraining samplesPythia familyinduction headin-context learningmechanistic catalystdata augmentation pipelinecircuit convergence

More in Large Language Models

View all
Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units | Paperchime