We present a self-supervised approach for learning video representations
using temporal video alignment as a pretext task, while exploiting both
frame-level and video-level information. We leverage a novel combination of
temporal alignment loss and temporal regularization terms, which can be used as
supervision signals for training an encoder network. Specifically, the temporal
alignment loss (i.e., Soft-DTW) aims for the minimum cost for temporally
aligning videos in the embedding space. However, optimizing solely for this
term leads to trivial solutions, particularly, one where all frames get mapped
to a small cluster in the embedding space. To overcome this problem, we propose
a temporal regularization term (i.e., Contrastive-IDM) which encourages
different frames to be mapped to different points in the embedding space.
Extensive evaluations on various tasks, including action phase classification,
action phase progression, and fine-grained frame retrieval, on three datasets,
namely Pouring, Penn Action, and IKEA ASM, show superior performance of our
approach over state-of-the-art methods for self-supervised representation
learning from videos. In addition, our method provides significant performance
gain where labeled data is lacking.Comment: Accepted to CVPR 202