Computer Vision

Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

SShannan YanLLeqi ZhengKKeyu LvJJingchen NiHHongyang WeiJJiajun ZhangGGuangting WangJJing LyuCChun YuanFFengyun Rao
Published
February 22, 2026
Authors
10
Word Count
8,843

Cycle-consistent mask prediction enables robots to locate objects across drastically different viewpoints.

Abstract

We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code is available at https://github.com/shannany0606/CCMP.

Key Takeaways

  • 1

    Cross-view object correspondence enables robots to understand objects from different perspectives using cycle-consistent mask prediction.

  • 2

    Conditional binary segmentation with transformer architecture achieves impressive results without requiring auxiliary modules or extra data.

  • 3

    Self-supervised training through cycle consistency provides a clever signal for learning view-invariant object representations across egocentric and exocentric views.

Limitations

  • Previous methods required auxiliary modules, extra data, or complex pipelines that limited practical applicability.

  • Traditional correspondence approaches fail when visual appearance varies drastically between egocentric and exocentric viewpoints.

Keywords

conditional binary segmentationview-invariant representationscycle-consistency trainingtest-time trainingobject correspondencevisual correspondence

More in Computer Vision

View all
Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction | Paperchime