Large Language Models

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

TTaejong JooWWenhan XiaCCheolmin KimMMing ZhangEEugene Ie
Published
February 17, 2026
Authors
5
Word Count
8,664

Randomly skipping half your optimizer updates actually improves language model training dramatically.

Abstract

Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and Muon, respectively.

Key Takeaways

  • 1

    Randomly masking parameter updates in adaptive optimizers improves large language model training without additional computational cost.

  • 2

    Stochastic masking induces curvature-dependent regularization that naturally steers optimization toward flatter, more generalizable minima.

  • 3

    Magma optimizer with momentum-gradient alignment achieves 19% perplexity reduction versus Adam on billion-parameter models.

Limitations

  • Requires careful tuning of masking probability p to balance regularization strength and update frequency.

  • Computational benefits diminish if momentum updates cannot be maintained densely alongside sparse parameter updates.

Keywords

large language modelsdense adaptive optimizerspreconditionersrandom maskingRMSPropcurvature-dependent geometric regularizationmomentum-gradient alignmentadaptive optimizersperplexity

More in Large Language Models

View all
On Surprising Effectiveness of Masking Updates in Adaptive Optimizers | Paperchime