Latest Computer Vision Research Papers

Research on image recognition, object detection, image segmentation, and visual understanding using deep learning techniques.

37 Papers

Showing 17 of 17 papers

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Cheng Cui, Ting Sun, Suyin Liang +12 more

We introduce PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA) accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions, including scanning, skew, warping, screen-photography, and illumination, we propose the Real5-Om...

Vision-Language ModelOmniDocBenchReal5-OmniDocBenchseal recognitiontext spotting

Jan 29, 20267

Masked Depth Modeling for Spatial Perception

Bin Tan, Changjiang Sun, Xiage Qin +8 more

Spatial visual perception is a fundamental requirement in physical-world applications like autonomous driving and robotic manipulation, driven by the need to interact with 3D environments. Capturing pixel-aligned metric depth using RGB-D cameras would be the most viable way, yet it usually faces obs...

depth completionmasked depth modelingvisual contextautomated data curationRGB-D cameras+2 more

Jan 25, 202620

UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders

Matthew Walmer, Saksham Suri, Anirud Aggarwal +1 more

The space of task-agnostic feature upsampling has emerged as a promising area of research to efficiently create denser features from pre-trained visual backbones. These methods act as a shortcut to achieve dense features for a fraction of the cost by learning to map low-resolution features to high-r...

feature upsamplingvisual backbonescross-attentioniterative upsamplingpixel-dense feature upsamplers+4 more

Jan 25, 20263

Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis

Hongyuan Chen, Xingyu Chen, Youjia Zhang +2 more

We present Motion 3-to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. While recent advances have significantly improved 2D, video, and 3D content generation, 4D synthesis remains difficult due to limited t...

4D dynamic objectsmonocular video3D reference meshcanonical reference meshmotion latent representation+3 more

Jan 20, 20268

VideoMaMa: Mask-Guided Video Matting via Generative Prior

Sangbeom Lim, Seoung Wug Oh, Jiahui Huang +3 more

Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffu...

video mattingvideo diffusion modelspseudo-labelingMatting Anything in VideoSAM2+3 more

Jan 20, 202613

Implicit Neural Representation Facilitates Unified Universal Vision Encoding

Matthew Gwilliam, Xiao Wang, Xuefeng Hu +1 more

Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to r...

contrastive learningimage representation learningrecognitiongenerationhyper-network+5 more

Jan 20, 20265

CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation

Shuai Tan, Biao Gong, Ke Ma +5 more

Character image animation is gaining significant importance across various domains, driven by the demand for robust and flexible multi-subject rendering. While existing methods excel in single-person animation, they struggle to handle arbitrary subject counts, diverse character types, and spatial mi...

Unbind-Rebind frameworkpose shift encoderstochastic perturbationslocation-agnostic motion representationsemantic guidance+3 more

Jan 16, 20266

PubMed-OCR: PMC Open Access OCR Annotations

Hunter Heidenreich, Yosheb Getachew, Olivia Dinica +1 more

PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1...

Jan 16, 20268

Medical SAM3: A Foundation Model for Universal Prompt-Driven Medical Image Segmentation

Chongcong Jiang, Tianxingjian Ding, Chuhan Song +7 more

Promptable segmentation foundation models such as SAM3 have demonstrated strong generalization capabilities through interactive and concept-based prompting. However, their direct applicability to medical image segmentation remains limited by severe domain shifts, the absence of privileged spatial pr...

foundation modelprompt-driven segmentationmedical image segmentationSAM3fine-tuning+7 more

Jan 15, 202615

VQ-Seg: Vector-Quantized Token Perturbation for Semi-Supervised Medical Image Segmentation

Sicheng Yang, Zhaohu Xing, Lei Zhu

Consistency learning with feature perturbation is a widely used strategy in semi-supervised medical image segmentation. However, many existing perturbation methods rely on dropout, and thus require a careful manual tuning of the dropout rate, which is a sensitive hyperparameter and often difficult t...

vector quantizationfeature perturbationconsistency learningsemi-supervised medical image segmentationdropout+4 more

Jan 15, 20263

Action100M: A Large-scale Video Action Dataset

Delong Chen, Tejaswi Kasarla, Yejin Bang +6 more

Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed fro...

Jan 15, 202621

Alterbute: Editing Intrinsic Attributes of Objects in Images

Tal Reiss, Daniel Winter, Matan Cohen +4 more

We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors th...

diffusion-based methodintrinsic attributesobject editingidentity preservationvisual named entities+5 more

Jan 15, 202626

WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments

Xuweiyi Chen, Wentao Zhou, Zezhou Cheng

We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, leading to ghosting, hallucinated geometry, and unstable pose estimation...

Jan 15, 20261

V-DPM: 4D Video Reconstruction with Dynamic Point Maps

Edgar Sucar, Eldar Insafutdinov, Zihang Lai +1 more

Powerful 3D representations such as DUSt3R invariant point maps, which encode 3D shape and camera parameters, have significantly advanced feed forward 3D reconstruction. While point maps assume static scenes, Dynamic Point Maps (DPMs) extend this concept to dynamic 3D content by additionally represe...

Dynamic Point Maps3D reconstructionvideo inputVGGTsynthetic data+2 more

Jan 14, 20268

3AM: Segment Anything with Geometric Consistency in Videos

Yang-Che Sun, Cheng Sun, Chin-Yang Lin +4 more

Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps...

Jan 13, 202624

End-to-End Video Character Replacement without Structural Guidance

Zhengbo Xu, Jie Ma, Ziheng Wang +3 more

Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data. Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skele...

Jan 13, 20266

FlyPose: Towards Robust Human Pose Estimation From Aerial Views

Hassaan Farooq, Marvin Brenner, Peter St\ütz

Unmanned Aerial Vehicles (UAVs) are increasingly deployed in close proximity to humans for applications such as parcel delivery, traffic monitoring, disaster response and infrastructure inspections. Ensuring safe and reliable operation in these human-populated environments demands accurate perceptio...

Jan 9, 20261

PreviousPage 2 of 2

View all categories

Latest Computer Vision Research | Computer Vision Papers