Generative AI

SkyReels-V3 Technique Report

DDebang LiZZhengcong FeiTTuanhui LiYYikun DouZZheng ChenJJiangping YangMMingyuan FanJJingtao XuJJiahua WangBBaoxuan GuMMingshan ChangYYuqiang XieBBinjie MaoYYouqiang ZhangNNuo PangHHao ZhangYYuzhe JinZZhiheng XuDDixuan LinGGuibin ChenYYahui Zhou
Published
January 24, 2026
Authors
21
Word Count
2,879

Revolutionizing video generation with SkyReels-V3.

Abstract

Video generation serves as a cornerstone for building world models, where multimodal contextual inference stands as the defining test of capability. In this end, we present SkyReels-V3, a conditional video generation model, built upon a unified multimodal in-context learning framework with diffusion Transformers. SkyReels-V3 model supports three core generative paradigms within a single architecture: reference images-to-video synthesis, video-to-video extension and audio-guided video generation. (i) reference images-to-video model is designed to produce high-fidelity videos with strong subject identity preservation, temporal coherence, and narrative consistency. To enhance reference adherence and compositional stability, we design a comprehensive data processing pipeline that leverages cross frame pairing, image editing, and semantic rewriting, effectively mitigating copy paste artifacts. During training, an image video hybrid strategy combined with multi-resolution joint optimization is employed to improve generalization and robustness across diverse scenarios. (ii) video extension model integrates spatio-temporal consistency modeling with large-scale video understanding, enabling both seamless single-shot continuation and intelligent multi-shot switching with professional cinematographic patterns. (iii) Talking avatar model supports minute-level audio-conditioned video generation by training first-and-last frame insertion patterns and reconstructing key-frame inference paradigms. On the basis of ensuring visual quality, synchronization of audio and videos has been optimized. Extensive evaluations demonstrate that SkyReels-V3 achieves state-of-the-art or near state-of-the-art performance on key metrics including visual quality, instruction following, and specific aspect metrics, approaching leading closed-source systems. Github: https://github.com/SkyworkAI/SkyReels-V3.

Key Takeaways

  • 1

    Unified multimodal video generation framework.

  • 2

    Outperforms existing models on key metrics.

  • 3

    Versatile applications in video production and avatars.

Limitations

  • Requires high-quality inputs and computational resources.

  • Not tested on all real-world scenarios.

Keywords

diffusion Transformersmultimodal in-context learning frameworkreference images-to-video synthesisvideo-to-video extensionaudio-guided video generationspatio-temporal consistency modelinglarge-scale video understandingtalking avatar modelkey-frame inference paradigmscross frame pairingsemantic rewritingmulti-resolution joint optimizationimage video hybrid strategy

More in Generative AI

View all
SkyReels-V3 Technique Report | Paperchime