Large Language Models

Residual Context Diffusion Language Models

YYuezhou HuHHarman SinghMMonishwaran MaheswaranHHaocheng XiCColeman HooperJJintao ZhangAAditya TomarMMichael W. MahoneySSewon MinMMehrdad FarajtabarKKurt KeutzerAAmir GholamiCChenfeng Xu
Published
January 30, 2026
Authors
13
Word Count
8,387
Code
Includes code

Enhances language models by recycling residual context.

Abstract

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking" mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ~1 billion tokens. RCD consistently improves frontier dLLMs by 5-10 points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4-5x fewer denoising steps at equivalent accuracy levels.

Key Takeaways

  • 1

    RCD recycles computation from discarded tokens.

  • 2

    Uses entropy-based embedding aggregation for reliable signals.

  • 3

    Employs a two-stage training pipeline for efficiency.

Limitations

  • Requires a decoupled training approach.

  • Dynamic adjustment mechanism is crucial for balance.

Keywords

diffusion large language modelsautoregressive language modelsremasking mechanismtoken representationscontextual residualsdenoising stepsdecoupled two-stage trainingbackpropagationlong CoT reasoningshort CoT instruction followingAIME tasks

More in Large Language Models

View all
Residual Context Diffusion Language Models | Paperchime