Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video
  Representation Learning

Chen, Qijun; Dang, Ronghao; Lin, Xiao; Liu, Chengju; Zhu, Minghao

Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning

Authors: Qijun Chen
Ronghao Dang
Xiao Lin
Chengju Liu
Minghao Zhu
Publication date: 1 September 2023
Publisher

Abstract

As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from spatial and temporal weak alignment across modalities. In this paper, we present a \textbf{Fi}ne-grained \textbf{M}otion \textbf{A}lignment (FIMA) framework, capable of introducing well-aligned and significant motion information. Specifically, we first develop a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision. Then, we design a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and space. Moreover, a frame-level motion contrastive loss is presented to improve the temporal diversity of the motion features. Extensive experiments demonstrate that the representations learned by FIMA possess great motion-awareness capabilities and achieve state-of-the-art or competitive results on downstream tasks across UCF101, HMDB51, and Diving48 datasets. Code is available at \url{https://github.com/ZMHH-H/FIMA}.Comment: ACM MM 2023 Camera Read

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2309.00297

Last time updated on 12/09/2023