A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation

KKai LiJJintao ChengCChang ZengZZijun YanHHelin WangZZixiong SuBBo ZhengXXiaolin Hu

Published: January 30, 2026
Authors: 8

Abstract

Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: in-the-wild datasets contain weak labels and severe co-occurrence of events. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence of events by mining high-purity single-event segments from in-the-wild datasets via a semantically consistent synthesis protocol. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2.4k hours of raw audio. Experimental results demonstrate that, compared with the state-of-the-art model SAM-Audio which was trained on a huge dataset sim500 times larger than Hive, certain open-source models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibited remarkable zero-shot generalization on out-of-distribution evaluation benchmarks. These findings highlight that prioritizing purity of supervised signals enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs. Code and dataset are available at https://shandaai.github.io/Hive.

Keywords

universal sound separationin-the-wild datasetsco-occurrence of eventssemantically consistent synthesis protocolhigh-purity single-event segmentsHive datasetSAM-Audiozero-shot generalizationauditory foundation modelsdata efficiency

More in Speech & Audio AI

View all

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

Yuejie Li, Ke Yang +5

Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, m...

Feb 13134

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong, Kuangwei Chen +10

Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretr...

Feb 1147

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu +14

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice clonin...

Jan 2236

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Georgii Aparin, Tasnima Sadekova +6

Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, prov...

Feb 428

Qwen3-ASR Technical Report

Xian Shi, Xiong Wang +11

In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-A...

Jan 2921

More Speech & Audio AI papers