PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

JJianshu ZhangCChengxuan QianHHaosen SunHHaoran LuDDingcheng WangLLetian XueHHan Liu

arXiv ID: 2601.15224
Published: January 21, 2026
Authors: 7
Hugging Face Likes: 11
Comments: 2

Abstract

Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails.

Keywords

Vision-Language Modelsprogress reasoningProgress-BenchProgressLM-45KProgressLM-3Btraining-free promptingtraining-based approach

More in Multimodal AI

View all

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

Hao Luo, Ye Wang +10

We introduce Being-H0.5, a foundational Vision-Language-Action (VLA) model designed for robust cross-embodiment generalization across diverse robotic platforms. While existing VLAs often struggle with...

Jan 1968

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Haowei Zhang, Shudong Yang +3

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video input...

Jan 2166

BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Shijie Lian, Bin Yu +7

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in c...

Jan 2153

More Multimodal AI papers