Efficient AI

SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

TTongcheng FangHHanling ZhangRRuiqi XieZZhuo HanXXin TaoTTianchen ZhaoPPengfei WanWWenbo DingWWanli OuyangXXuefei NingYYu Wang
Published
January 23, 2026
Authors
11
Word Count
9,057
Code
Includes code

SALAD enables efficient, high-sparsity video generation.

Abstract

Diffusion Transformers have recently demonstrated remarkable performance in video generation. However, the long input sequences result in high computational latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free sparse attention is constrained by limited sparsity and thus offers modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation for training. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. By incorporating an input-dependent gating mechanism to finely balance the two branches, our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples and 1,600 training steps with a batch size of 8.

Key Takeaways

  • 1

    SALAD achieves 90% sparsity with comparable generation quality.

  • 2

    Utilizes a parallel linear attention branch for global interactions.

  • 3

    Offers a 1.72× inference speedup with enhanced video quality.

Limitations

  • Relies on effective capture of global interactions by linear attention.

  • Requires careful tuning of the input-dependent scalar gate.

Keywords

diffusion transformersvideo generationsparse attentionlinear attentioninput-dependent gating mechanisminference speedupparameter-efficient fine-tuning

More in Efficient AI

View all
SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer | Paperchime