Computer Vision

SemanticMoments: Training-Free Motion Similarity via Third Moment Features

SSaar HubermanKKfir GoldbergOOr PatashnikSSagie BenaimRRon Mokady
Published
February 9, 2026
Authors
5
Word Count
6,983

Training-free motion similarity using third moment statistical features beats appearance-biased video models.

Abstract

Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.

Key Takeaways

  • 1

    Current video models are biased toward static appearance rather than understanding actual motion as temporal change.

  • 2

    SimMotion benchmarks reveal that state-of-the-art action recognition models fail when motion similarity is isolated from appearance.

  • 3

    SemanticMoments uses statistical moment features to capture motion without training, outperforming existing supervised and self-supervised approaches.

Limitations

  • SimMotion-Real benchmark contains only 40 curated examples, limiting real-world validation scope and generalizability.

  • The script cuts off before fully explaining the SemanticMoments method's implementation and complete performance results.

Keywords

semantic motionvideo representationoptical flowtemporal statisticshigher-order momentspre-trained semantic modelsSimMotion benchmarks

More in Computer Vision

View all
SemanticMoments: Training-Free Motion Similarity via Third Moment Features | Paperchime