Latest Computer Vision Research Papers

Research on image recognition, object detection, image segmentation, and visual understanding using deep learning techniques.

37 Papers
Showing 20 of 20 papers

CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing

Yucheng Wang, Zedong Wang, Yuetong Wu +2 more

Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioni...

diffusion modelsControlNetOmniControllatent-attention routersparse top-K selection+6 more
Mar 9, 202630

Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness

Yehonatan Elisha, Oren Barkan, Noam Koenigstein

Vision Transformers (ViTs) often degrade under distribution shifts because they rely on spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground-background masks, which fail to capture the fi...

Vision Transformersdistribution shiftsspurious correlationsregularization methodsforeground-background masks+8 more
Mar 9, 202611

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

Zihao Huang, Tianqi Liu, Zhaoxi Chen +7 more

Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation a...

video diffusion models4D reconstructionmonocular videooptical flowpart segmentation+5 more
Mar 4, 202620

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Junyi Zhang, Charles Herrmann, Junhwa Hur +5 more

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architectu...

feedforward geometric foundation modelsquadratic attention complexityrecurrent designsdense 3D reconstructionvideo streams+8 more
Mar 3, 202644

HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

Yichen Liu, Donghao Zhou, Jie Wang +9 more

Human-product images, which showcase the integration of humans and products, play a vital role in advertising, e-commerce, and digital marketing. The essential challenge of generating such images lies in ensuring the high-fidelity preservation of product details. Among existing paradigms, reference-...

reference-based inpaintinghigh-fidelity preservationShared Enhancement AttentionDetail-Aware Losshigh-frequency maps+1 more
Mar 2, 202626

VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection

Yang Cao, Feize Wu, Dave Zhenyu Chen +3 more

Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-...

Visual Geometry Grounded Transformermulti-view indoor 3D object detectionsensor-geometry-freeattention-guided query generationquery-driven feature aggregation+4 more
Mar 1, 202629

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Arnas Uselis, Andrea Dittadi, Seong Joon Oh

Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Although modern models are trained on massive datasets, they still cover only a tiny fraction of the combinatorial space of possible inputs, raising the question of...

compositional generalizationlinear representation hypothesisorthogonal per-concept factorslinear factorizationneural representations+4 more
Feb 27, 202614

VGG-T^3: Offline Feed-Forward 3D Reconstruction at Scale

Sven Elflein, Ruilong Li, Sérgio Agostinho +4 more

We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-l...

3D reconstructionfeed-forward methodscomputational requirementsmemory requirementsKey-Value space representation+7 more
Feb 26, 202613

From Scale to Speed: Adaptive Test-Time Scaling for Image Editing

Xiangyan Qu, Zhenlong Yuan, Jing Tang +9 more

Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image ...

image generationtext-to-image generationimage editingtest-time scalingdiffusion models+10 more
Feb 24, 2026117

A Very Big Video Reasoning Suite

Maijunxian Wang, Ruisi Wang, Juyi Lin +53 more

Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiote...

video reasoningspatiotemporal consistencyemergent generalizationvideo reasoning benchmarkvideo reasoning dataset
Feb 23, 2026308

Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

Shannan Yan, Leqi Zheng, Keyu Lv +7 more

We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an obje...

conditional binary segmentationview-invariant representationscycle-consistency trainingtest-time trainingobject correspondence+1 more
Feb 22, 202614

Spanning the Visual Analogy Space with a Weight Basis of LoRAs

Hila Manor, Rinon Gal, Haggai Maron +2 more

Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet {a, a', b}, the goal is to generate b' such that a : a' :: b : b'. Recent methods adapt text-t...

Low-Rank AdaptationLoRAvisual analogy learningdynamic compositiontransformation primitives+2 more
Feb 17, 202611

SAM 3D Body: Robust Full-Body Human Mesh Recovery

Xitong Yang, Devansh Kukreja, Don Pinkus +11 more

We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. ...

3D human mesh recoveryencoder-decoder architectureparametric mesh representationMomentum Human Rig2D keypoints+9 more
Feb 17, 20268

ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

Mathieu Sibue, Andres Muñoz Garza, Samuel Mensah +4 more

Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct h...

Vision Language ModelsKey Entity ExtractionRelation ExtractionVisual Question Answeringstructured Information Extraction+4 more
Feb 12, 20263

SemanticMoments: Training-Free Motion Similarity via Third Moment Features

Saar Huberman, Kfir Goldberg, Or Patashnik +2 more

Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centri...

semantic motionvideo representationoptical flowtemporal statisticshigher-order moments+2 more
Feb 9, 202619

LIVE: Long-horizon Interactive Video World Modeling

Junchao Huang, Ziyang Ye, Xinting Hu +5 more

Autoregressive video world models predict future visual observations conditioned on actions. While effective over short horizons, these models often struggle with long-horizon generation, as small prediction errors accumulate over time. Prior methods alleviate this by introducing pre-trained teacher...

video world modelsautoregressive modelserror accumulationcycle-consistency objectivediffusion loss+3 more
Feb 3, 20269

LoopViT: Scaling Visual ARC with Looped Transformers

Wen-Jie Shu, Xuerui Qiu, Rui-Jie Zhu +3 more

Recent advances in visual reasoning have leveraged vision transformers to tackle the ARC-AGI benchmark. However, we argue that the feed-forward architecture, where computational depth is strictly bound to parameter size, falls short of capturing the iterative, algorithmic nature of human induction. ...

vision transformersARC-AGI benchmarkfeed-forward architectureweight-tied recurrenceHybrid Block+6 more
Feb 2, 20268

Interacted Planes Reveal 3D Line Mapping

Zeran Ke, Bin Tan, Gui-Song Xia +2 more

3D line mapping from multi-view RGB images provides a compact and structured visual representation of scenes. We study the problem from a physical and topological perspective: a 3D line most naturally emerges as the edge of a finite 3D planar patch. We present LiP-Map, a line-plane joint optimizatio...

3D line mappingplanar patchline-plane joint optimizationlearnable primitivesstructured reconstruction+1 more
Feb 1, 20263

MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

Baorui Ma, Jiahui Yang, Donglin Di +5 more

Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scala...

vision foundation modelsmetric depth estimationsparse metric promptpretraining frameworkdepth completion+9 more
Jan 29, 20263

PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction

Changjian Jiang, Kerui Ren, Xudong Li +7 more

Streaming reconstruction from monocular image sequences remains challenging, as existing methods typically favor either high-quality rendering or accurate geometry, but rarely both. We present PLANING, an efficient on-the-fly reconstruction framework built on a hybrid representation that loosely cou...

monocular image sequenceshybrid representationexplicit geometric primitivesneural Gaussiansdecoupled manner+7 more
Jan 29, 202620
Page 1 of 2Next
Latest Computer Vision Research | Computer Vision Papers