Large Language Models

MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

ZZecheng TangBBaibei JiRRuoxi SunHHaitian WangWWangJie YouZZhang YijunWWenpeng ZhuJJi QiJJuntao LiMMin Zhang
Published
January 17, 2026
Authors
10
Word Count
18,259

Benchmarking RMs for LLM long-term memory management.

Abstract

Existing works increasingly adopt memory-centric mechanisms to process long contexts in a segment manner, and effective memory management is one of the key capabilities that enables large language models to effectively propagate information across the entire sequence. Therefore, leveraging reward models (RMs) to automatically and reliably evaluate memory quality is critical. In this work, we introduce MemoryRewardBench, the first benchmark to systematically study the ability of RMs to evaluate long-term memory management processes. MemoryRewardBench covers both long-context comprehension and long-form generation tasks, featuring 10 distinct settings with different memory management patterns, with context length ranging from 8K to 128K tokens. Evaluations on 13 cutting-edge RMs indicate a diminishing performance gap between open-source and proprietary models, with newer-generation models consistently outperforming their predecessors regardless of parameter count. We further expose the capabilities and fundamental limitations of current RMs in evaluating LLM memory management across diverse settings.

Key Takeaways

  • 1

    Newer RMs outperform larger ones in memory management.

  • 2

    Open-source models nearly match proprietary models' performance.

  • 3

    Model size doesn't guarantee better memory management evaluation.

Limitations

  • RMs prefer step-by-step reasoning over parallel processing.

  • Positional bias affects RMs' process-based evaluations.

Keywords

memory-centric mechanismslong-context comprehensionlong-form generationreward modelsMemoryRewardBenchmemory managementlarge language models

More in Large Language Models

View all
MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models | Paperchime