Latest Efficient AI Research Papers

Research on model compression, quantization, efficient inference, and reducing computational costs of AI systems.

40 Papers

Showing 20 of 20 papers

Training-free Latent Inter-Frame Pruning with Attention Recovery

Dennis Menn, Yuedong Yang, Bokun Wang +6 more

Current video generation models suffer from high computational latency, making real-time applications prohibitively costly. In this paper, we address this limitation by exploiting the temporal redundancy inherent in video latent patches. To this end, we propose the Latent Inter-frame Pruning with At...

video generation modelstemporal redundancylatent patchesLatent Inter-frame PruningAttention Recovery+2 more

Mar 6, 202611

Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

Dongwon Kim, Gawon Seo, Jinsung Lee +2 more

World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned simulators, but its application to decision-time planning rem...

world modelslatent representationstokenizersaction-conditioned world modelplanning+1 more

Mar 5, 202622

SageBwd: A Trainable Low-bit Attention

Jintao Zhang, Marco Chen, Haoxu Wang +5 more

Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications ...

SageAttentionINT8 attentionattention matrix multiplicationsfine-tuning performancefull-precision attention+6 more

Mar 2, 202615

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Yongtong Wu, Shaoyuan Chen, Yinmin Zhong +10 more

The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwid...

KV-Cacheprefill enginesdecoding enginesRDMAglobal scheduler+2 more

Feb 25, 202630

Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

Euisoo Jung, Byunghyun Kim, Hyunjin Kim +2 more

Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism suffer from noticeable generation artifacts and fail to achie...

diffusion modelsdata parallel strategycondition-based partitioningpipeline schedulingadaptive parallelism switching+5 more

Feb 25, 202612

Multi-Vector Index Compression in Any Modality

Hanxiang Qin, Alexander Martin, Rohan Jha +3 more

We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for ...

late interactionmulti-vector retrievalindex compressionattention-guided clusteringsequence resizing+2 more

Feb 24, 202620

DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

Dahye Kim, Deepti Ghadiyaram, Raghudeep Gadde

Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phas...

diffusion transformerstokenizationdenoising phasepatch sizescontent complexity+5 more

Feb 19, 202611

COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression

Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip +3 more

Post-training compression of Transformer models commonly relies on truncated singular value decomposition (SVD). However, enforcing a single shared subspace can degrade accuracy even at moderate compression. Sparse dictionary learning provides a more flexible union-of-subspaces representation, but e...

Transformer modelstruncated singular value decompositionsparse dictionary learningmatrix Procrustes orthogonalizationweight factorization+5 more

Feb 16, 20266

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

Jintao Zhang, Kai Jiang, Chendong Xiang +5 more

Many training-free sparse attention methods are effective for accelerating diffusion models. Recently, several works suggest that making sparse attention trainable can further increase sparsity while preserving generation quality. We study three key questions: (1) when do the two common masking rule...

sparse attentiondiffusion modelsTop-kTop-ptrainable sparse attention+4 more

Feb 13, 202642

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik +4 more

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, proce...

Feb 13, 202625

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Jintao Zhang, Haoxu Wang, Kai Jiang +6 more

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can...

sparse-linear attentiondiffusion modelsattention sparsitylearnable routerquantization-aware fine-tuning+2 more

Feb 13, 202649

ROCKET: Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation for Efficient Model Compression

Ammar Ali, Baher Mohammad, Denis Makhov +3 more

We present ROCKET, a training-free model compression method that achieves state-of-the-art performance in comparison with factorization, structured-sparsification and dynamic compression baselines. Operating under a global compression budget, ROCKET comprises two key innovations: First, it formulate...

model compressionmulti-choice knapsack problemsparse matrix factorizationdictionary learningweight sparsification+5 more

Feb 11, 202612

NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

Hyochan Chong, Dongkyu Kim, Changdong Kim +1 more

Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of data and compute or incur additional storage. In this work, we...

post-training quantizationlow-rank binary factorizationalternating direction method of multipliersbinary quantizationsub-1-bit compression+3 more

Feb 6, 20265

RelayGen: Intra-Generation Model Switching for Efficient Reasoning

Jiwon Song, Yoongon Kim, Jae-Joon Kim

Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks by generating long, multi-step reasoning trajectories, but inference-time scaling incurs substantial deployment cost. A key challenge is that generation difficulty varies within a single output, whereas existing effi...

large reasoning modelsmulti-step reasoning trajectoriesinference-time scalingtoken probability marginssegment-level control+3 more

Feb 6, 202610

Stop the Flip-Flop: Context-Preserving Verification for Fast Revocable Diffusion Decoding

Yanzheng Xiang, Lan Wei, Yizhen Yao +8 more

Parallel diffusion decoding can accelerate diffusion language model inference by unmasking multiple tokens per step, but aggressive parallelism often harms quality. Revocable decoding mitigates this by rechecking earlier tokens, yet we observe that existing verification schemes frequently trigger fl...

diffusion language modelparallel decodingtoken maskingverification schemesflip-flop oscillations+5 more

Feb 5, 20263

OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale

Jingze Shi, Zhangyang Peng, Yizhang Zhu +3 more

Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-design...

Mixture-of-Expertsexpert specializationatomic expertsrouting complexitymemory access+4 more

Feb 5, 20264

RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs

Youngcheon You, Banseok Lee, Minseop Choi +5 more

Efficient deployment of large language models (LLMs) requires extreme quantization, forcing a critical trade-off between low-bit efficiency and performance. Residual binarization enables hardware-friendly, matmul-free inference by stacking binary (pm1) layers, but is plagued by pathological feature ...

residual binarizationquantization-aware trainingQATinter-path adaptationresidual hierarchy+7 more

Feb 5, 20265

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Yue Ding, Yiyan Ji, Jungang Li +12 more

Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs rema...

Omni-modal Large Language Modelstoken compressionspatio-temporal video pruningvision-guided audio selectiondifferentiable straight-through estimator+2 more

Feb 4, 202638

LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

Gang Lin, Dongfang Li, Zhuoen Chen +4 more

The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, suc...

long-context large language modelskey-value cacheattention headstop-k selection strategyHardKuma-based mechanism+5 more

Feb 4, 20267

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

Junyu Chen, Jungang Li, Jing Xiong +11 more

Large language model (LLM) inference is often bounded by memory footprint and memory bandwidth in resource-constrained deployments, making quantization a fundamental technique for efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits....

post-training quantizationquantization gridbit-plane decompositionscalar coefficientssecond-order information+2 more

Feb 4, 20266

Page 1 of 2Next

View all categories

Latest Efficient AI Research | Efficient AI Papers