MOVA: Towards Scalable and Synchronized Video-Audio Generation

SSII-OpenMOSS TeamDDonghua YuMMingshu ChenQQi ChenQQi LuoQQianyi WuQQinyuan ChengRRuixiao LiTTianyi LiangWWenbo ZhangWWenming TuXXiangyu PengYYang GaoYYanru HuoYYing ZhuYYinze LuoYYiyang ZhangYYuerong SongZZhe XuZZhiyu ZhangCChenchen YangCCheng ChangCChushu ZhouHHanfu ChenHHongnan MaJJiaxi LiJJingqi TongJJunxi LiuKKe ChenSShimin LiSSonglin WangWWei JiangZZhaoye FeiZZhiyuan NingCChunguo LiCChenhui LiZZiwei HeZZengfeng HuangXXie ChenXXipeng Qiu

Published: February 9, 2026
Authors: 40
Word Count: 15,625
Code: Includes code

View on arXiv Download PDF

MOVA enables synchronized video-audio generation at scale through joint modeling instead of cascaded pipelines.

Abstract

Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.

Key Takeaways

1
MOVA generates synchronized video and audio jointly rather than sequentially, solving cascading quality issues.
2
The team built a rigorous three-stage data pipeline that filtered 100k+ hours to 11k hours of high-quality content.
3
A dual-tower architecture with bridging mechanisms enables two specialized experts to collaborate on simultaneous generation.

Limitations

Previous cascaded approaches produced misaligned audio-video with lips moving out of sync and sound effects landing incorrectly.
Most open-source systems focused on either video or audio in isolation, lacking joint generation capabilities at scale.

Keywords

Mixture-of-ExpertsMoEaudio-visual contentlip-synced speechsound effectscontent-aligned musicIT2VAefficient inferenceLoRA fine-tuningprompt enhancement

More in Multimodal AI

View all

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li +22

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Re...

Feb 26190

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu +436

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are...

Feb 4190

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang +2

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static...

Feb 26148

More Multimodal AI papers