SkyReels-V3 Technique Report

DDebang LiZZhengcong FeiTTuanhui LiYYikun DouZZheng ChenJJiangping YangMMingyuan FanJJingtao XuJJiahua WangBBaoxuan GuMMingshan ChangYYuqiang XieBBinjie MaoYYouqiang ZhangNNuo PangHHao ZhangYYuzhe JinZZhiheng XuDDixuan LinGGuibin ChenYYahui Zhou

Published: January 24, 2026
Authors: 21
Word Count: 2,879

View on arXiv Download PDF

Revolutionizing video generation with SkyReels-V3.

Abstract

Video generation serves as a cornerstone for building world models, where multimodal contextual inference stands as the defining test of capability. In this end, we present SkyReels-V3, a conditional video generation model, built upon a unified multimodal in-context learning framework with diffusion Transformers. SkyReels-V3 model supports three core generative paradigms within a single architecture: reference images-to-video synthesis, video-to-video extension and audio-guided video generation. (i) reference images-to-video model is designed to produce high-fidelity videos with strong subject identity preservation, temporal coherence, and narrative consistency. To enhance reference adherence and compositional stability, we design a comprehensive data processing pipeline that leverages cross frame pairing, image editing, and semantic rewriting, effectively mitigating copy paste artifacts. During training, an image video hybrid strategy combined with multi-resolution joint optimization is employed to improve generalization and robustness across diverse scenarios. (ii) video extension model integrates spatio-temporal consistency modeling with large-scale video understanding, enabling both seamless single-shot continuation and intelligent multi-shot switching with professional cinematographic patterns. (iii) Talking avatar model supports minute-level audio-conditioned video generation by training first-and-last frame insertion patterns and reconstructing key-frame inference paradigms. On the basis of ensuring visual quality, synchronization of audio and videos has been optimized. Extensive evaluations demonstrate that SkyReels-V3 achieves state-of-the-art or near state-of-the-art performance on key metrics including visual quality, instruction following, and specific aspect metrics, approaching leading closed-source systems. Github: https://github.com/SkyworkAI/SkyReels-V3.

Key Takeaways

1
Unified multimodal video generation framework.
2
Outperforms existing models on key metrics.
3
Versatile applications in video production and avatars.

Limitations

Requires high-quality inputs and computational resources.
Not tested on all real-world scenarios.

Keywords

diffusion Transformersmultimodal in-context learning frameworkreference images-to-video synthesisvideo-to-video extensionaudio-guided video generationspatio-temporal consistency modelinglarge-scale video understandingtalking avatar modelkey-frame inference paradigmscross frame pairingsemantic rewritingmulti-resolution joint optimizationimage video hybrid strategy

More in Generative AI

View all

Helios: Real Real-Time Long Video Generation Model

Shenghai Yuan, Yuanyang Yin +4

We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We mak...

Mar 4136

OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens

Yiying Yang, Wei Cheng +6

OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON...

Mar 2111

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Zengbin Wang, Xuecai Hu +4

Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or i...

Jan 28107

VIBE: Visual Instruction Based Editor

Grigorii Alekseenko, Aleksandr Gordeev +8

Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside high...

Jan 558

MolHIT: Advancing Molecular-Graph Generation with Hierarchical Discrete Diffusion Models

Hojung Jung, Rodrigo Hormazabal +6

Molecular generation with diffusion models has emerged as a promising direction for AI-driven drug discovery and materials science. While graph diffusion models have been widely adopted due to the dis...

Feb 1954

More Generative AI papers