SageBwd: A Trainable Low-bit Attention

JJintao ZhangMMarco ChenHHaoxu WangKKai JiangIIon StoicaJJoseph E. GonzalezJJianfei ChenJJun Zhu

Published: March 2, 2026
Authors: 8
Word Count: 5,317
Code: Includes code

SageBwd enables trainable INT8 attention matching full-precision performance through careful backward pass design.

Abstract

Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications while preserving fine-tuning performance. However, SageBwd exhibited a persistent performance gap to full-precision attention (FPA) during pre-training. In this work, we investigate why this gap occurs and demonstrate that SageBwd matches full-precision attention during pretraining. Through experiments and theoretical analysis, we reach a few important insights and conclusions: (i) QK-norm is necessary for stable training at large tokens per step, (ii) quantization errors primarily arise from the backward-pass score gradient dS, (iii) reducing tokens per step enables SageBwd to match FPA performance in pre-training, and (iv) K-smoothing remains essential for training stability, while Q-smoothing provides limited benefit during pre-training.

Key Takeaways

1
SageBwd achieves full-precision attention performance during pre-training by keeping gradient dP in FP16 precision.
2
The dS tensor in backward pass is most vulnerable to quantization error due to its tiny magnitude scaling with sequence length.
3
QK-norm stabilization and reduced tokens-per-step enable SageBwd to match full-precision attention in large-scale pre-training.

Limitations

SageBwd exhibited persistent performance gap to full-precision attention during pre-training in prior work.
Reducing tokens-per-step to match full-precision performance may impact training efficiency and throughput.

Keywords

SageAttentionINT8 attentionattention matrix multiplicationsfine-tuning performancefull-precision attentionQK-normquantization errorsbackward-pass score gradienttokens per stepK-smoothingQ-smoothing

More in Efficient AI

View all

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Jintao Zhang, Haoxu Wang +7

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split tha...

Feb 1349

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

Jintao Zhang, Kai Jiang +6

Many training-free sparse attention methods are effective for accelerating diffusion models. Recently, several works suggest that making sparse attention trainable can further increase sparsity while ...

Feb 1342

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Yue Ding, Yiyan Ji +13

Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial c...

Feb 438

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Yongtong Wu, Shaoyuan Chen +11

The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache f...

Feb 2530

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi, Shuo Yang +14

Despite rapid progress in autoregressive video diffusion, an emerging system algorithm bottleneck limits both deployability and generation capability: KV cache memory. In autoregressive video generati...

Feb 329

More Efficient AI papers