SemanticMoments: Training-Free Motion Similarity via Third Moment Features

SSaar HubermanKKfir GoldbergOOr PatashnikSSagie BenaimRRon Mokady

Published: February 9, 2026
Authors: 5
Word Count: 6,983

Training-free motion similarity using third moment statistical features beats appearance-biased video models.

Abstract

Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.

Key Takeaways

1
Current video models are biased toward static appearance rather than understanding actual motion as temporal change.
2
SimMotion benchmarks reveal that state-of-the-art action recognition models fail when motion similarity is isolated from appearance.
3
SemanticMoments uses statistical moment features to capture motion without training, outperforming existing supervised and self-supervised approaches.

Limitations

SimMotion-Real benchmark contains only 40 curated examples, limiting real-world validation scope and generalizability.
The script cuts off before fully explaining the SemanticMoments method's implementation and complete performance results.

Keywords

semantic motionvideo representationoptical flowtemporal statisticshigher-order momentspre-trained semantic modelsSimMotion benchmarks

More in Computer Vision

View all

A Very Big Video Reasoning Suite

Maijunxian Wang, Ruisi Wang +54

Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual env...

Feb 23308

From Scale to Speed: Adaptive Test-Time Scaling for Image Editing

Xiangyan Qu, Zhenlong Yuan +10

Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2...

Feb 24117

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Junyi Zhang, Charles Herrmann +6

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memor...

Mar 344

CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing

Yucheng Wang, Zedong Wang +3

Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs ph...

Mar 930

VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection

Yang Cao, Feize Wu +4

Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene r...

Mar 129

More Computer Vision papers