Generative AI

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

FFSVideo TeamQQingyu ChenZZhiyuan FangHHaibin HuangXXinwei HuangTTong JinMMinxuan LinBBo LiuCCelong LiuCChongyang MaXXing MeiXXiaohui ShenYYaojie ShenFFuwen TanAAngtian WangXXiao YangYYiding YangJJiamin YuanLLingxi ZhangYYuxin Zhang
Published
February 2, 2026
Authors
20

Abstract

We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space (64times64times4 spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.

Keywords

video autoencoderdiffusion transformerDITlayer memory designinter-layer information flowcontext reusemulti-resolution generationDIT upsamplerimage-to-video diffusion

More in Generative AI

View all
FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space | Paperchime