Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

SShannan YanLLeqi ZhengKKeyu LvJJingchen NiHHongyang WeiJJiajun ZhangGGuangting WangJJing LyuCChun YuanFFengyun Rao

Published: February 22, 2026
Authors: 10
Word Count: 8,843

Cycle-consistent mask prediction enables robots to locate objects across drastically different viewpoints.

Abstract

We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code is available at https://github.com/shannany0606/CCMP.

Key Takeaways

1
Cross-view object correspondence enables robots to understand objects from different perspectives using cycle-consistent mask prediction.
2
Conditional binary segmentation with transformer architecture achieves impressive results without requiring auxiliary modules or extra data.
3
Self-supervised training through cycle consistency provides a clever signal for learning view-invariant object representations across egocentric and exocentric views.

Limitations

Previous methods required auxiliary modules, extra data, or complex pipelines that limited practical applicability.
Traditional correspondence approaches fail when visual appearance varies drastically between egocentric and exocentric viewpoints.

Keywords

conditional binary segmentationview-invariant representationscycle-consistency trainingtest-time trainingobject correspondencevisual correspondence

More in Computer Vision

View all

A Very Big Video Reasoning Suite

Maijunxian Wang, Ruisi Wang +54

Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual env...

Feb 23308

From Scale to Speed: Adaptive Test-Time Scaling for Image Editing

Xiangyan Qu, Zhenlong Yuan +10

Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2...

Feb 24117

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Junyi Zhang, Charles Herrmann +6

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memor...

Mar 344

CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing

Yucheng Wang, Zedong Wang +3

Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs ph...

Mar 930

VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection

Yang Cao, Feize Wu +4

Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene r...

Mar 129

More Computer Vision papers