Multimodal AI

Training Data Efficiency in Multimodal Process Reward Models

JJinyuan LiCChengsong HuangLLanglin HuangSShaoyang XuHHaolin LiuWWenxuan ZhangJJiaxin Huang
Published
February 4, 2026
Authors
7
Word Count
16,724

Optimize multimodal model training with less data.

Abstract

Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM training.Our preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated corpora.To explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%.

Key Takeaways

  • 1

    Efficient training using a fraction of the dataset.

  • 2

    Introduces Balanced-Information Score for data selection.

  • 3

    Saves compute resources and improves model performance.

Limitations

  • Requires pre-annotated Monte Carlo rollouts.

  • Applicability may vary across different datasets.

Keywords

Multimodal Process Reward ModelsMonte Carlo-annotated corporaVisualProcessBenchBalanced-Information Scorelabel mixtureslabel reliabilitygradient updatesdata efficiency

More in Multimodal AI

View all
Training Data Efficiency in Multimodal Process Reward Models | Paperchime