CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

YYinghao MaHHaiwen XiaHHewei GaoWWeixiong ChenYYuxin YeYYuchen YangSSungkyun ChangMMingshuo DingYYizhi LiRRuibin YuanSSimon DixonEEmmanouil Benetos

Published: February 28, 2026
Authors: 12
Word Count: 17,843
Code: Includes code

View on arXiv Download PDF

Unified reward model benchmark for evaluating music generation across text, lyrics, and audio instructions.

Abstract

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.

Key Takeaways

1
CMI-RewardBench unifies fragmented music evaluation tools into a single benchmark for compositional multimodal instructions.
2
The CMI-RM model achieves parameter efficiency while handling text, lyrics, and audio inputs simultaneously for music generation.
3
Position-consistency filtering in dataset creation improved reliability by validating LLM preferences across reversed input orderings.

Limitations

Existing reward models struggle with flexible multimodal inputs, showing capability gaps even with state-of-the-art multimodal LLMs.
Traditional metrics like FAD operate at distribution level, failing to provide sample-level signals necessary for training alignment.

Keywords

music reward modelingCompositional Multimodal Instructionpreference datasetpseudo-labeled sampleshuman-annotated corpusunified benchmarkreward modelsparameter-efficienttop-k filtering

More in Multimodal AI

View all

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li +22

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Re...

Feb 26190

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu +436

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are...

Feb 4190

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang +2

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static...

Feb 26148

More Multimodal AI papers