Speech & Audio AI

VIBEVOICE-ASR Technical Report

ZZhiliang PengJJianwei YuYYaoyao ChangZZilong WangLLi DongYYingbo HaoYYujie TuCChenyu YangWWenhui WangSSongchen XuYYutao SunHHangbo BaoWWeijiang XuYYi ZhuZZehua WangTTing SongYYan XiaZZewen ChiSShaohan HuangLLiang WangCChuang DingSShuai WangXXie ChenFFuru Wei
Published
January 26, 2026
Authors
24
Word Count
2,916

Revolutionizes long-form speech recognition with unified model.

Abstract

This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.

Key Takeaways

  • 1

    Unifies ASR, diarization, and timestamping in one model.

  • 2

    Outperforms strong closed-source models on public benchmarks.

  • 3

    Offers prompt-based context injection for better accuracy.

Limitations

  • Struggles with overlapping speech in complex scenarios.

  • Performance may degrade in low-resource languages.

Keywords

speech understanding frameworkVibeVoiceAutomatic Speech RecognitionSpeaker DiarizationTimestampingend-to-end generationlong-form audiosingle-pass processingmultilingual supportcode-switchingprompt-based context injection

More in Speech & Audio AI

View all
VIBEVOICE-ASR Technical Report | Paperchime