VIBEVOICE-ASR Technical Report

ZZhiliang PengJJianwei YuYYaoyao ChangZZilong WangLLi DongYYingbo HaoYYujie TuCChenyu YangWWenhui WangSSongchen XuYYutao SunHHangbo BaoWWeijiang XuYYi ZhuZZehua WangTTing SongYYan XiaZZewen ChiSShaohan HuangLLiang WangCChuang DingSShuai WangXXie ChenFFuru Wei

Published: January 26, 2026
Authors: 24
Word Count: 2,916

View on arXiv Download PDF

Revolutionizes long-form speech recognition with unified model.

Abstract

This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.

Key Takeaways

1
Unifies ASR, diarization, and timestamping in one model.
2
Outperforms strong closed-source models on public benchmarks.
3
Offers prompt-based context injection for better accuracy.

Limitations

Struggles with overlapping speech in complex scenarios.
Performance may degrade in low-resource languages.

Keywords

speech understanding frameworkVibeVoiceAutomatic Speech RecognitionSpeaker DiarizationTimestampingend-to-end generationlong-form audiosingle-pass processingmultilingual supportcode-switchingprompt-based context injection

More in Speech & Audio AI

View all

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

Yuejie Li, Ke Yang +5

Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, m...

Feb 13134

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong, Kuangwei Chen +10

Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretr...

Feb 1147

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu +14

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice clonin...

Jan 2236

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Georgii Aparin, Tasnima Sadekova +6

Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, prov...

Feb 428

Qwen3-ASR Technical Report

Xian Shi, Xiong Wang +11

In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-A...

Jan 2921

More Speech & Audio AI papers