FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

FFSVideo TeamQQingyu ChenZZhiyuan FangHHaibin HuangXXinwei HuangTTong JinMMinxuan LinBBo LiuCCelong LiuCChongyang MaXXing MeiXXiaohui ShenYYaojie ShenFFuwen TanAAngtian WangXXiao YangYYiding YangJJiamin YuanLLingxi ZhangYYuxin Zhang

Published: February 2, 2026
Authors: 20

View on arXiv Download PDF

Abstract

We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space (64times64times4 spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.

Keywords

video autoencoderdiffusion transformerDITlayer memory designinter-layer information flowcontext reusemulti-resolution generationDIT upsamplerimage-to-video diffusion

More in Generative AI

View all

Helios: Real Real-Time Long Video Generation Model

Shenghai Yuan, Yuanyang Yin +4

We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We mak...

Mar 4136

OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens

Yiying Yang, Wei Cheng +6

OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON...

Mar 2111

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Zengbin Wang, Xuecai Hu +4

Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or i...

Jan 28107

VIBE: Visual Instruction Based Editor

Grigorii Alekseenko, Aleksandr Gordeev +8

Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside high...

Jan 558

MolHIT: Advancing Molecular-Graph Generation with Hierarchical Discrete Diffusion Models

Hojung Jung, Rodrigo Hormazabal +6

Molecular generation with diffusion models has emerged as a promising direction for AI-driven drug discovery and materials science. While graph diffusion models have been widely adopted due to the dis...

Feb 1954

More Generative AI papers