Latest Speech & Audio AI Research Papers

Research on speech recognition, text-to-speech, audio processing, and voice AI technologies.

18 Papers

Showing 18 of 18 papers

NLE: Non-autoregressive LLM-based ASR by Transcript Editing

Avihu Dekel, Samuel Thomas, Takashi Fukada +1 more

While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully parallel prediction....

autoregressivenon-autoregressivespeech recognitionconditional transcript editingacoustic embeddings+7 more

Mar 9, 202615

Fish Audio S2 Technical Report

Shijia Liao, Yuxuan Wang, Songting Liu +11 more

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline c...

text-to-speechmulti-speakermulti-turn generationinstruction-following controlnatural-language descriptions+9 more

Mar 9, 202613

MAEB: Massive Audio Embedding Benchmark

Adnan El Assadi, Isaac Chung, Chenghao Xiao +15 more

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-...

audio embedding benchmarkaudio-text reasoningmultilingual speech tasksacoustic understandinglinguistic tasks+3 more

Feb 17, 202617

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

Yuejie Li, Ke Yang, Yueying Hua +4 more

Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, making them inadequate for assessing the robustness of spoken query retrieval systems under complex a...

Feb 13, 2026134

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong, Kuangwei Chen, Zhaoye Fei +9 more

Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs intro...

audio tokenizationcausal audio tokenizerTransformer architectureend-to-end learningquantizer+9 more

Feb 11, 202647

Voxtral Realtime

Alexander H. Liu, Andy Ehrenberg, Andy Lo +131 more

We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explici...

automatic speech recognitionstreamingend-to-end trainingcausal audio encoderAda RMS-Norm+5 more

Feb 11, 202612

Covo-Audio Technical Report

Wenfu Wang, Chenxing Li, Liqiang Zhang +23 more

In this work, we present Covo-Audio, a 7B-parameter end-to-end LALM that directly processes continuous audio inputs and generates audio outputs within a single unified architecture. Through large-scale curated pretraining and targeted post-training, Covo-Audio achieves state-of-the-art or competitiv...

end-to-end LALMcontinuous audio inputsaudio outputslarge-scale curated pretrainingtargeted post-training+16 more

Feb 10, 20265

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Georgii Aparin, Tasnima Sadekova, Alexey Rukhovich +5 more

Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. ...

Sparse Autoencodersencoder layersWhisperHuBERTfeature steering+4 more

Feb 4, 202628

VoxServe: Streaming-Centric Serving System for Speech Language Models

Keisuke Kamahori, Wei-Tzu Lee, Atindra Jha +4 more

Deploying modern Speech Language Models (SpeechLMs) in streaming settings requires systems that provide low latency, high throughput, and strong guarantees of streamability. Existing systems fall short of supporting diverse models flexibly and efficiently. We present VoxServe, a unified serving syst...

Speech Language Modelsstreaming settingslow latencyhigh throughputstreamability+3 more

Jan 30, 20266

A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation

Kai Li, Jintao Cheng, Chang Zeng +5 more

Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from ...

universal sound separationin-the-wild datasetsco-occurrence of eventssemantically consistent synthesis protocolhigh-purity single-event segments+5 more

Jan 30, 20264

Qwen3-ASR Technical Report

Xian Shi, Xiong Wang, Zhifang Guo +10 more

In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. B...

speech recognition modelslanguage identificationnon-autoregressive modelsforced alignmenttimestamp prediction+5 more

Jan 29, 202621

VIBEVOICE-ASR Technical Report

Zhiliang Peng, Jianwei Yu, Yaoyao Chang +21 more

This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in shor...

speech understanding frameworkVibeVoiceAutomatic Speech RecognitionSpeaker DiarizationTimestamping+6 more

Jan 26, 202614

End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions

Anfeng Xu, Tiantian Feng, Somer Bishop +2 more

Accurate transcription and speaker diarization of child-adult spoken interactions are crucial for developmental and clinical research. However, manual annotation is time-consuming and challenging to scale. Existing automated systems typically rely on cascaded speaker diarization and speech recogniti...

Whisper encoder-decoder architectureend-to-end frameworkspeaker diarizationspeech recognitionserialized output training+5 more

Jan 25, 20265

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He +13 more

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-graine...

text-to-speechvoice cloningdual-track LM architecturespeech tokenizersQwen-TTS-Tokenizer-25Hz+6 more

Jan 22, 202636

Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis

Thanathai Lertpetchpun, Yoonjeong Lee, Thanapat Trachu +4 more

Many spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific acce...

text-to-speechspeaker embeddingsphonological rulesaccent controlphoneme shift rate+3 more

Jan 20, 20265

PRiSM: Benchmarking Phone Realization in Speech Models

Shikhar Bharadwaj, Chin-Jou Li, Yoonjae Kim +13 more

Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first ope...

phone recognitioncross-lingual speech processingphonetic analysisintrinsic evaluationextrinsic evaluation+10 more

Jan 20, 20265

Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition

Warit Sirichotedumrong, Adisai Na-Thalang, Potsawee Manakul +3 more

Large encoder-decoder models like Whisper achieve strong offline transcription but remain impractical for streaming applications due to high latency. However, due to the accessibility of pre-trained checkpoints, the open Thai ASR landscape remains dominated by these offline architectures, leaving a ...

FastConformer-Transducerstreaming applicationstext normalizationcurriculum learningThai ASR+5 more

Jan 19, 202611

FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

Tanyu Chen, Tairan Chen, Kai Shen +4 more

Recent end-to-end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction. In this work, we pr...

speech tokenizersneural audio codecsLLMsend-to-end spoken dialogue systemsvoice cloning+5 more

Jan 16, 202619

View all categories

Latest Speech & Audio AI Research | Speech & Audio AI Papers