Latest Efficient AI Research Papers

Research on model compression, quantization, efficient inference, and reducing computational costs of AI systems.

40 Papers
Showing 20 of 20 papers

Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers

Liangyu Wang, Siqi Zhang, Junjie Wang +7 more

The scaling of Large Language Models (LLMs) drives interest in matrix-based optimizers (e.g., Shampoo, Muon, SOAP) for their convergence efficiency; yet their requirement for holistic updates conflicts with the tensor fragmentation in distributed frameworks like Megatron. Existing solutions are subo...

Large Language Modelsmatrix-based optimizersShampooMuonSOAP+14 more
Feb 4, 20268

Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing

Tong Zheng, Chengsong Huang, Runpeng Dai +9 more

Parallel thinking has emerged as a promising paradigm for reasoning, yet it imposes significant computational burdens. Existing efficiency methods primarily rely on local, per-trajectory signals and lack principled mechanisms to exploit global dynamics across parallel branches. We introduce 2D probi...

parallel thinkingwidth-depth dynamicsintermediate answersconsensus-based early stoppingdeviation-based branch pruning+3 more
Feb 3, 202617

Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

Dongwon Jo, Beomseok Kang, Jiwon Song +1 more

The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain irrelevant tokens or rely o...

attentiontoken-level sparsificationQKVFlash Attentionattention speedup+3 more
Feb 3, 202611

POP: Prefill-Only Pruning for Efficient Large Model Inference

Junhui He, Zhihui Fu, Jun Wang +1 more

Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In th...

structured pruninglarge language modelsvision-language modelsprefill stagedecode stage+5 more
Feb 3, 20264

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi, Shuo Yang, Yilong Zhao +13 more

Despite rapid progress in autoregressive video diffusion, an emerging system algorithm bottleneck limits both deployability and generation capability: KV cache memory. In autoregressive video generation models, the KV cache grows with generation history and quickly dominates GPU memory, often exceed...

KV cacheautoregressive video diffusionvideo spatiotemporal redundancySemantic Aware SmoothingProgressive Residual Quantization+3 more
Feb 3, 202629

DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers

Ionut-Vlad Modoranu, Philip Zmushko, Erik Schultheis +2 more

Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying Shampoo currently comes at the cost of significant computat...

Shampoosecond-order optimizerspreconditioner blocks3D tensorsGPU utilization+5 more
Feb 2, 202611

An Empirical Study of World Model Quantization

Zhongqian Fu, Tianyi Zhao, Kai Han +3 more

World models learn an internal representation of environment dynamics, enabling agents to simulate and reason about future states within a compact latent space for tasks such as planning, prediction, and inference. However, running world models rely on hevay computational cost and memory footprint, ...

world modelspost-training quantizationDINO-WMweight-only quantizationjoint weight-activation quantization+6 more
Feb 2, 20263

SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning

Qifan Yu, Xinyu Ma, Zhijian Zhuo +7 more

Progressive Learning (PL) reduces pre-training computational overhead by gradually increasing model scale. While prior work has extensively explored depth expansion, width expansion remains significantly understudied, with the few existing methods limited to the early stages of training. However, ex...

progressive learningwidth expansiontraining instabilitiesactivation statisticsloss spikes+7 more
Feb 2, 20269

PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers

Haopeng Li, Shitong Shao, Wenliang Zhong +4 more

Diffusion Transformers are fundamental for video and image generation, but their efficiency is bottlenecked by the quadratic complexity of attention. While block sparse attention accelerates computation by attending only critical key-value blocks, it suffers from degradation at high sparsity by disc...

diffusion transformersattentionblock sparse attentionsparse attentionpiecewise sparse attention+5 more
Feb 1, 20263

Grounding and Enhancing Informativeness and Utility in Dataset Distillation

Shaobo Wang, Yantai Yang, Guo Chen +5 more

Dataset Distillation (DD) seeks to create a compact dataset from a large, real-world dataset. While recent methods often rely on heuristic approaches to balance efficiency and quality, the fundamental relationship between original and synthetic data remains underexplored. This paper revisits knowled...

dataset distillationknowledge distillationinformativenessutilityShapley Value attribution+2 more
Jan 29, 202616

ECO: Quantized Training without Full-Precision Master Weights

Mahdi Nikdan, Amir Zandieh, Dan Alistarh +1 more

Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as master w...

quantizationLarge Language ModelsSparse Mixture of Expertsmaster weightsgradient updates+6 more
Jan 29, 20264

Discovering Hidden Gems in Model Repositories

Jonathan Kahana, Eliahu Horwitz, Yedid Hoshen

Public repositories host millions of fine-tuned models, yet community usage remains disproportionately concentrated on a small number of foundation checkpoints. We investigate whether this concentration reflects efficient market selection or if superior models are systematically overlooked. Through ...

multi-armed banditsequential halvingmodel discoveryshared query setsaggressive elimination
Jan 29, 202619

KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices

Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides +1 more

The success of Hyper-Connections (HC) in neural networks (NN) has also highlighted issues related to its training instability and restricted scalability. The Manifold-Constrained Hyper-Connections (mHC) mitigate these challenges by projecting the residual connection space onto a Birkhoff polytope, h...

Hyper-ConnectionsManifold-Constrained Hyper-ConnectionsBirkhoff polytopeSinkhorn-Knopp algorithmdoubly stochastic matrices+4 more
Jan 29, 20265

Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

Yingfa Chen, Zhen Leng Thai, Zihan Zhou +6 more

Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratc...

Hybrid Transformer architecturessoftmax attention blocksrecurrent neural networksparameter transferknowledge distillation+7 more
Jan 29, 20269

FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

Zhaopeng Qiu, Shuang Yu, Jingqi Zhang +4 more

Reinforcement learning (RL) for large language models (LLMs) is increasingly bottlenecked by rollout (generation), where long output sequence lengths make attention and KV-cache memory dominate end-to-end step time. FP8 offers an attractive lever for accelerating RL by reducing compute cost and memo...

reinforcement learninglarge language modelsrolloutattentionKV-cache+11 more
Jan 26, 20265

Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

Zecheng Tang, Quantong Qiu, Yi Yang +6 more

The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in long-context scenarios. While hybrid attention strategies that combine sparse and full attention within a single model offer a viable solution, they typically empl...

standard attention mechanismssparse attentionfull attentionhybrid attention strategiesattention router+4 more
Jan 24, 202628

SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

Tongcheng Fang, Hanling Zhang, Ruiqi Xie +8 more

Diffusion Transformers have recently demonstrated remarkable performance in video generation. However, the long input sequences result in high computational latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free sparse attenti...

diffusion transformersvideo generationsparse attentionlinear attentioninput-dependent gating mechanism+2 more
Jan 23, 202611

Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts

Xuan-Phi Nguyen, Shrey Pandit, Austin Xu +2 more

Mixture-of-Experts (MoE) models are typically pre-trained with explicit load-balancing constraints to ensure statistically balanced expert routing. Despite this, we observe that even well-trained MoE models exhibit significantly imbalanced routing. This behavior is arguably natural-and even desirabl...

Mixture-of-Expertsexpert routingexpert parallelismload balancingdynamic rerouting+4 more
Jan 23, 20264
PreviousPage 2 of 2