PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

CCheng CuiTTing SunSSuyin LiangTTingquan GaoZZelun ZhangJJiaxuan LiuXXueqing WangCChangda ZhouHHongen LiuMManhui LinYYue ZhangYYubo ZhangYYi LiuDDianhai YuYYanjun Ma

Published: January 29, 2026
Authors: 15
Word Count: 11,324

View on arXiv Download PDF

Robust document parsing for real-world, distorted documents.

Abstract

We introduce PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA) accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions, including scanning, skew, warping, screen-photography, and illumination, we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model's capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency. Code: https://github.com/PaddlePaddle/PaddleOCR

Key Takeaways

1
Handles complex, real-world document distortions effectively.
2
Introduces new tasks like seal recognition and text spotting.
3
Achieves state-of-the-art accuracy on benchmark tests.

Limitations

Relies on high-quality annotations for training.
Extreme distortions may still pose challenges.

Keywords

Vision-Language ModelOmniDocBenchReal5-OmniDocBenchseal recognitiontext spotting

More in Computer Vision

View all

A Very Big Video Reasoning Suite

Maijunxian Wang, Ruisi Wang +54

Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual env...

Feb 23308

From Scale to Speed: Adaptive Test-Time Scaling for Image Editing

Xiangyan Qu, Zhenlong Yuan +10

Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2...

Feb 24117

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Junyi Zhang, Charles Herrmann +6

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memor...

Mar 344

CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing

Yucheng Wang, Zedong Wang +3

Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs ph...

Mar 930

VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection

Yang Cao, Feize Wu +4

Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene r...

Mar 129

More Computer Vision papers