P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads

YYun LuoFFuting WangQQianjia ChengFFangchen YuHHaodi LeiJJianhao YanCChenxi LiJJiacheng ChenYYufeng ZhaoHHaiyuan WanYYuchen ZhangSShenghe ZhengJJunchi YaoQQingyang ZhangHHaonan HeWWenxuan ZengLLi ShengCChengxing XieYYuxin ZuoYYizhuo LiYYulun WuRRui HuangDDongzhan ZhouKKai ChenYYu QiaoLLei BaiYYu ChengNNing DingBBowen ZhouPPeng YeGGanqu Cui

Published: February 10, 2026
Authors: 31
Word Count: 12,682

View on arXiv Download PDF

P1-VL bridges visual perception and reasoning to solve physics olympiad problems at gold-medal level.

Abstract

The transition from symbolic manipulation to science-grade reasoning represents a pivotal frontier for Large Language Models (LLMs), with physics serving as the critical test anchor for binding abstract logic to physical reality. Physics demands that a model maintain physical consistency with the laws governing the universe, a task that fundamentally requires multimodal perception to ground abstract logic in reality. At the Olympiad level, diagrams are often constitutive rather than illustrative, containing essential constraints, such as boundary conditions and spatial symmetries, that are absent from the text. To bridge this visual-logical gap, we introduce P1-VL, a family of open-source vision-language models engineered for advanced scientific reasoning. Our method harmonizes Curriculum Reinforcement Learning, which employs progressive difficulty expansion to stabilize post-training, with Agentic Augmentation, enabling iterative self-verification at inference. Evaluated on HiPhO, a rigorous benchmark of 13 exams from 2024-2025, our flagship P1-VL-235B-A22B becomes the first open-source Vision-Language Model (VLM) to secure 12 gold medals and achieves the state-of-the-art performance in the open-source models. Our agent-augmented system achieves the No.2 overall rank globally, trailing only Gemini-3-Pro. Beyond physics, P1-VL demonstrates remarkable scientific reasoning capacity and generalizability, establishing significant leads over base models in STEM benchmarks. By open-sourcing P1-VL, we provide a foundational step toward general-purpose physical intelligence to better align visual perceptions with abstract physical laws for machine scientific discovery.

Key Takeaways

1
Physics problem-solving requires visual-linguistic integration because diagrams contain essential information absent from text alone.
2
P1-VL's 235B model achieved 12 gold medals on HiPhO, becoming the leading open-source vision-language model for physics.
3
Curriculum reinforcement learning with progressive difficulty and agentic augmentation enable advanced multimodal reasoning for complex scientific tasks.

Limitations

Previous physics AI approaches treated diagrams as secondary rather than foundational to problem-solving.
The script cuts off before fully explaining the Markov Decision Process formulation and complete methodology.

Keywords

vision-language modelscurriculum reinforcement learningagentic augmentationmultimodal perceptionscientific reasoningphysical consistencyHiPhO benchmarkP1-VL-235B-A22BGemini-3-Pro

More in Multimodal AI

View all

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li +22

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Re...

Feb 26190

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu +436

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are...

Feb 4190

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang +2

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static...

Feb 26148

More Multimodal AI papers