Latest Computer Vision Research Papers

Research on image recognition, object detection, image segmentation, and visual understanding using deep learning techniques.

37 Papers
Showing 17 of 17 papers

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Cheng Cui, Ting Sun, Suyin Liang +12 more

We introduce PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA) accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions, including scanning, skew, warping, screen-photography, and illumination, we propose the Real5-Om...

Vision-Language ModelOmniDocBenchReal5-OmniDocBenchseal recognitiontext spotting
Jan 29, 20267

Masked Depth Modeling for Spatial Perception

Bin Tan, Changjiang Sun, Xiage Qin +8 more

Spatial visual perception is a fundamental requirement in physical-world applications like autonomous driving and robotic manipulation, driven by the need to interact with 3D environments. Capturing pixel-aligned metric depth using RGB-D cameras would be the most viable way, yet it usually faces obs...

depth completionmasked depth modelingvisual contextautomated data curationRGB-D cameras+2 more
Jan 25, 202620

UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders

Matthew Walmer, Saksham Suri, Anirud Aggarwal +1 more

The space of task-agnostic feature upsampling has emerged as a promising area of research to efficiently create denser features from pre-trained visual backbones. These methods act as a shortcut to achieve dense features for a fraction of the cost by learning to map low-resolution features to high-r...

feature upsamplingvisual backbonescross-attentioniterative upsamplingpixel-dense feature upsamplers+4 more
Jan 25, 20263

Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis

Hongyuan Chen, Xingyu Chen, Youjia Zhang +2 more

We present Motion 3-to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. While recent advances have significantly improved 2D, video, and 3D content generation, 4D synthesis remains difficult due to limited t...

4D dynamic objectsmonocular video3D reference meshcanonical reference meshmotion latent representation+3 more
Jan 20, 20268

VideoMaMa: Mask-Guided Video Matting via Generative Prior

Sangbeom Lim, Seoung Wug Oh, Jiahui Huang +3 more

Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffu...

video mattingvideo diffusion modelspseudo-labelingMatting Anything in VideoSAM2+3 more
Jan 20, 202613

Implicit Neural Representation Facilitates Unified Universal Vision Encoding

Matthew Gwilliam, Xiao Wang, Xuefeng Hu +1 more

Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to r...

contrastive learningimage representation learningrecognitiongenerationhyper-network+5 more
Jan 20, 20265

CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation

Shuai Tan, Biao Gong, Ke Ma +5 more

Character image animation is gaining significant importance across various domains, driven by the demand for robust and flexible multi-subject rendering. While existing methods excel in single-person animation, they struggle to handle arbitrary subject counts, diverse character types, and spatial mi...

Unbind-Rebind frameworkpose shift encoderstochastic perturbationslocation-agnostic motion representationsemantic guidance+3 more
Jan 16, 20266

Medical SAM3: A Foundation Model for Universal Prompt-Driven Medical Image Segmentation

Chongcong Jiang, Tianxingjian Ding, Chuhan Song +7 more

Promptable segmentation foundation models such as SAM3 have demonstrated strong generalization capabilities through interactive and concept-based prompting. However, their direct applicability to medical image segmentation remains limited by severe domain shifts, the absence of privileged spatial pr...

foundation modelprompt-driven segmentationmedical image segmentationSAM3fine-tuning+7 more
Jan 15, 202615

VQ-Seg: Vector-Quantized Token Perturbation for Semi-Supervised Medical Image Segmentation

Sicheng Yang, Zhaohu Xing, Lei Zhu

Consistency learning with feature perturbation is a widely used strategy in semi-supervised medical image segmentation. However, many existing perturbation methods rely on dropout, and thus require a careful manual tuning of the dropout rate, which is a sensitive hyperparameter and often difficult t...

vector quantizationfeature perturbationconsistency learningsemi-supervised medical image segmentationdropout+4 more
Jan 15, 20263

Alterbute: Editing Intrinsic Attributes of Objects in Images

Tal Reiss, Daniel Winter, Matan Cohen +4 more

We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors th...

diffusion-based methodintrinsic attributesobject editingidentity preservationvisual named entities+5 more
Jan 15, 202626
PreviousPage 2 of 2
Latest Computer Vision Research | Computer Vision Papers