Speech & Audio AI

PRiSM: Benchmarking Phone Realization in Speech Models

SShikhar BharadwajCChin-Jou LiYYoonjae KimKKwanghee ChoiEEunjung YeoRRyan Soh-Eun ShimHHanyu ZhouBBrendon BoldtKKaren Rosero JacomeKKalvin ChangDDarsh AgrawalKKeer XuCChao-Han Huck YangJJian ZhuSShinji WatanabeDDavid R. Mortensen
arXiv ID
2601.14046
Published
January 20, 2026
Authors
16
Hugging Face Likes
5
Comments
3

Abstract

Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception through intrinsic and extrinsic evaluation of PR systems. PRiSM standardizes transcription-based evaluation and assesses downstream utility in clinical, educational, and multilingual settings with transcription and representation probes. We find that diverse language exposure during training is key to PR performance, encoder-CTC models are the most stable, and specialized PR models still outperform Large Audio Language Models. PRiSM releases code, recipes, and datasets to move the field toward multilingual speech models with robust phonetic ability: https://github.com/changelinglab/prism.

Keywords

phone recognitioncross-lingual speech processingphonetic analysisintrinsic evaluationextrinsic evaluationtranscription-based evaluationdownstream utilityclinical settingseducational settingsmultilingual settingstranscription probesrepresentation probeslanguage exposureencoder-CTC modelslarge audio language models

More in Speech & Audio AI

View all
PRiSM: Benchmarking Phone Realization in Speech Models | Paperchime