Efficient AI

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

JJintao ZhangHHaoxu WangKKai JiangKKaiwen ZhengYYouhe JiangIIon StoicaJJianfei ChenJJun ZhuJJoseph E. Gonzalez
Published
February 13, 2026
Authors
9
Word Count
7,327
Code
Includes code

SLA2 cuts video diffusion attention costs 97% while actually improving output quality.

Abstract

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.

Key Takeaways

  • 1

    SLA2 reduces video diffusion model attention computation by 97% while improving quality over the original.

  • 2

    The method combines sparse and linear attention mechanisms to handle both high-attention and low-rank components efficiently.

  • 3

    SLA2 fixes mathematical mismatches in previous SLA approach through learned routing instead of heuristic token assignment.

Limitations

  • SLA relied on heuristics for token routing that weren't necessarily optimal for different situations.

  • Linear attention introduces approximation error compared to true softmax attention despite computational efficiency gains.

Keywords

sparse-linear attentiondiffusion modelsattention sparsitylearnable routerquantization-aware fine-tuningattention errordirect decomposition

More in Efficient AI

View all
SLA2: Sparse-Linear Attention with Learnable Routing and QAT | Paperchime