RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

YYinpei DaiHHongze FuJJayjun LeeYYuejiang LiuHHaoran ZhangJJianing YangCChelsea FinnNNima FazeliJJoyce Chai

Published: March 4, 2026
Authors: 9
Word Count: 3,825
Code: Includes code

RoboMME benchmarks memory-augmented robotic policies across four cognitive memory types with 770k training timesteps.

Abstract

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

Key Takeaways

1
RoboMME is a large-scale benchmark evaluating memory in robotic manipulation across four cognitive memory types: temporal, spatial, object, and procedural.
2
No single memory representation consistently performs best; effectiveness is highly task-dependent with different strengths across symbolic, perceptual, and recurrent approaches.
3
Perceptual memory with memory-as-modulator integration achieves the best balance between performance and computational efficiency across diverse manipulation tasks.

Limitations

Existing robotic manipulation benchmarks rarely require true history-based reasoning because policies can succeed using only current observations.
Prior memory-based approaches use different backbones and evaluation protocols, making systematic comparison and generalization to new situations unclear.

Keywords

vision-language-action modelsmemory mechanismslong-horizon taskshistory-dependent scenariosstandardized benchmarktemporal memoryspatial memoryobject memoryprocedural memorymemory-augmented VLA variantsπ0.5 backbone

More in Multimodal AI

View all

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li +22

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Re...

Feb 26190

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu +436

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are...

Feb 4190

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang +2

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static...

Feb 26148

More Multimodal AI papers