SLA2: Sparse-Linear Attention with Learnable Routing and QAT

JJintao ZhangHHaoxu WangKKai JiangKKaiwen ZhengYYouhe JiangIIon StoicaJJianfei ChenJJun ZhuJJoseph E. Gonzalez

Published: February 13, 2026
Authors: 9
Word Count: 7,327
Code: Includes code

SLA2 cuts video diffusion attention costs 97% while actually improving output quality.

Abstract

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.

Key Takeaways

1
SLA2 reduces video diffusion model attention computation by 97% while improving quality over the original.
2
The method combines sparse and linear attention mechanisms to handle both high-attention and low-rank components efficiently.
3
SLA2 fixes mathematical mismatches in previous SLA approach through learned routing instead of heuristic token assignment.

Limitations

SLA relied on heuristics for token routing that weren't necessarily optimal for different situations.
Linear attention introduces approximation error compared to true softmax attention despite computational efficiency gains.

Keywords

sparse-linear attentiondiffusion modelsattention sparsitylearnable routerquantization-aware fine-tuningattention errordirect decomposition

More in Efficient AI

View all

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

Jintao Zhang, Kai Jiang +6

Many training-free sparse attention methods are effective for accelerating diffusion models. Recently, several works suggest that making sparse attention trainable can further increase sparsity while ...

Feb 1342

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Yue Ding, Yiyan Ji +13

Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial c...

Feb 438

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Yongtong Wu, Shaoyuan Chen +11

The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache f...

Feb 2530

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi, Shuo Yang +14

Despite rapid progress in autoregressive video diffusion, an emerging system algorithm bottleneck limits both deployability and generation capability: KV cache memory. In autoregressive video generati...

Feb 329

Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

Zecheng Tang, Quantong Qiu +7

The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in long-context scenarios. While hybrid attention strategies that ...

Jan 2428

More Efficient AI papers