Large Language Models

Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

GGuijin SonDDonghun YangHHitesh Laxmichand PatelHHyunwoo KoAAmit AgarwalSSunghee AhnKKyong-Ha LeeYYoungjae Yu
Published
February 6, 2026
Authors
8
Word Count
4,062
Code
Includes code

Scalable, oracle-free method for validating math solutions.

Abstract

Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose Consequence-Based Utility, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B, it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver-evaluator gap, maintaining a stronger correct-wrong separation even on instances where the underlying solver often fails to solve.

Key Takeaways

  • 1

    Introduces Consequence-Based Utility for oracle-free math solution evaluation.

  • 2

    Outperforms other oracle-free validators in experiments.

  • 3

    Scalable method for validating machine-generated mathematical solutions.

Limitations

  • Requires a set of related, verifiable questions.

  • Dependent on the performance of the solver model.

Keywords

reasoning modelsresearch-level mathematicsverificationoracle-free evaluatorin-context exemplarreward modelsgenerative reward modelsLLM judgesAcc@1AUCsolver-evaluator gap

More in Large Language Models

View all
Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math | Paperchime