Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

EEuisoo JungBByunghyun KimHHyunjin KimSSeonghye ChoJJae-Gil Lee

Published: February 25, 2026
Authors: 5
Word Count: 7,589
Code: Includes code

Hybrid parallelism framework achieves 2.3× speedup on diffusion models using conditional guidance scheduling.

Abstract

Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism suffer from noticeable generation artifacts and fail to achieve substantial acceleration proportional to the number of GPUs. Therefore, we propose a hybrid parallelism framework that combines a novel data parallel strategy, condition-based partitioning, with an optimal pipeline scheduling method, adaptive parallelism switching, to reduce generation latency and achieve high generation quality in conditional diffusion models. The key ideas are to (i) leverage the conditional and unconditional denoising paths as a new data-partitioning perspective and (ii) adaptively enable optimal pipeline parallelism according to the denoising discrepancy between these two paths. Our framework achieves 2.31times and 2.07times latency reductions on SDXL and SD3, respectively, using two NVIDIA RTX~3090 GPUs, while preserving image quality. This result confirms the generality of our approach across U-Net-based diffusion models and DiT-based flow-matching architectures. Our approach also outperforms existing methods in acceleration under high-resolution synthesis settings. Code is available at https://github.com/kaist-dmlab/Hybridiff.

Key Takeaways

1
Hybrid parallelism combining condition-based data partitioning and adaptive pipeline switching achieves 2.31× speedup on two GPUs.
2
Exploiting classifier-free guidance's dual paths eliminates patch boundary artifacts while preserving global image consistency.
3
Denoising discrepancy metric automatically determines optimal switching points between serial and parallel execution phases.

Limitations

Method requires classifier-free guidance architecture, limiting applicability to some diffusion model variants.
Speedup gains tested primarily on two GPUs; scalability to larger GPU clusters remains unclear.

Keywords

diffusion modelsdata parallel strategycondition-based partitioningpipeline schedulingadaptive parallelism switchingdenoising pathsU-Net-based diffusion modelsDiT-based flow-matching architecturesinference latencyimage quality

More in Efficient AI

View all

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Jintao Zhang, Haoxu Wang +7

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split tha...

Feb 1349

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

Jintao Zhang, Kai Jiang +6

Many training-free sparse attention methods are effective for accelerating diffusion models. Recently, several works suggest that making sparse attention trainable can further increase sparsity while ...

Feb 1342

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Yue Ding, Yiyan Ji +13

Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial c...

Feb 438

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Yongtong Wu, Shaoyuan Chen +11

The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache f...

Feb 2530

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi, Shuo Yang +14

Despite rapid progress in autoregressive video diffusion, an emerging system algorithm bottleneck limits both deployability and generation capability: KV cache memory. In autoregressive video generati...

Feb 329

More Efficient AI papers