Large Language Models

Rethinking Selective Knowledge Distillation

AAlmog TavorIItay EbenspangerNNeil CnaanMMor Geva
Published
February 1, 2026
Authors
4
Word Count
10,017

Optimize language models with selective knowledge distillation.

Abstract

Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.

Key Takeaways

  • 1

    Entropy-guided selective distillation improves model accuracy and efficiency.

  • 2

    SE-KD3X method extends selection across position, class, sample.

  • 3

    Potential to deploy large language models in resource-constrained environments.

Limitations

  • Requires careful tuning of entropy thresholds for optimal performance.

  • May not generalize well to all types of language models.

Keywords

knowledge distillationlarge language modelsselective distillationautoregressive modelsstudent-entropy-guided position selectiondense distillationoffline teacher caching

More in Large Language Models

View all
Rethinking Selective Knowledge Distillation | Paperchime