Large Language Models

FASA: Frequency-aware Sparse Attention

YYifei WangYYueqi WangZZhenrui YueHHuimin ZengYYong WangIIsmini LourentzouZZhengzhong TuXXiangxiang ChuJJulian McAuley
Published
February 3, 2026
Authors
9
Word Count
10,825
Code
Includes code

Optimize LLMs with frequency-aware sparse attention.

Abstract

The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. %making them a powerful and efficient proxy for token importance. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. % Since accessing only a small fraction of the KV cache, FASA drastically lowers memory bandwidth requirements and computational cost. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100\% of full-KV performance when only keeping 256 tokens, and achieves 2.56times speedup using just 18.9\% of the cache on AIME24.

Key Takeaways

  • 1

    FASA dynamically predicts token importance without training.

  • 2

    Reduces memory bandwidth consumption by focused attention.

  • 3

    Introduces FASA-M and FASA-C for different scenarios.

Limitations

  • Requires Rotary Positional Encodings for optimal performance.

  • May not generalize well to non-textual data.

Keywords

Large Language ModelsKey Value cachetoken pruningattention sparsityquery-dependent token importanceRoPEfunctional sparsityfrequency-chunk leveldominant frequency-chunkstoken evictionattention computationlong-context taskssequence modelingCoT reasoningtoken-eviction baselinesKV cache memory footprintcomputational costLongBench-V1AIME24

More in Large Language Models

View all
FASA: Frequency-aware Sparse Attention | Paperchime