511 research outputs found
RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning
We study unsupervised video representation learning that seeks to learn both
motion and appearance features from unlabeled video only, which can be reused
for downstream tasks such as action recognition. This task, however, is
extremely challenging due to 1) the highly complex spatial-temporal information
in videos; and 2) the lack of labeled data for training. Unlike the
representation learning for static images, it is difficult to construct a
suitable self-supervised task to well model both motion and appearance
features. More recently, several attempts have been made to learn video
representation through video playback speed prediction. However, it is
non-trivial to obtain precise speed labels for the videos. More critically, the
learnt models may tend to focus on motion pattern and thus may not learn
appearance features well. In this paper, we observe that the relative playback
speed is more consistent with motion pattern, and thus provide more effective
and stable supervision for representation learning. Therefore, we propose a new
way to perceive the playback speed and exploit the relative speed between two
video clips as labels. In this way, we are able to well perceive speed and
learn better motion features. Moreover, to ensure the learning of appearance
features, we further propose an appearance-focused task, where we enforce the
model to perceive the appearance difference between two video clips. We show
that optimizing the two tasks jointly consistently improves the performance on
two downstream tasks, namely action recognition and video retrieval.
Remarkably, for action recognition on UCF101 dataset, we achieve 93.7% accuracy
without the use of labeled data for pre-training, which outperforms the
ImageNet supervised pre-trained model. Code and pre-trained models can be found
at https://github.com/PeihaoChen/RSPNet.Comment: Accepted by AAAI-2021. Code and pre-trained models can be found at
https://github.com/PeihaoChen/RSPNe
Auxiliary Learning for Self-Supervised Video Representation via Similarity-based Knowledge Distillation
Despite the outstanding success of self-supervised pretraining methods for
video representation learning, they generalise poorly when the unlabeled
dataset for pretraining is small or the domain difference between unlabelled
data in source task (pretraining) and labeled data in target task (finetuning)
is significant. To mitigate these issues, we propose a novel approach to
complement self-supervised pretraining via an auxiliary pretraining phase,
based on knowledge similarity distillation, auxSKD, for better generalisation
with a significantly smaller amount of video data, e.g. Kinetics-100 rather
than Kinetics-400. Our method deploys a teacher network that iteratively
distills its knowledge to the student model by capturing the similarity
information between segments of unlabelled video data. The student model
meanwhile solves a pretext task by exploiting this prior knowledge. We also
introduce a novel pretext task, Video Segment Pace Prediction or VSPP, which
requires our model to predict the playback speed of a randomly selected segment
of the input video to provide more reliable self-supervised representations.
Our experimental results show superior results to the state of the art on both
UCF101 and HMDB51 datasets when pretraining on K100 in apple-to-apple
comparisons. Additionally, we show that our auxiliary pretraining, auxSKD, when
added as an extra pretraining phase to recent state of the art self-supervised
methods (i.e. VCOP, VideoPace, and RSPNet), improves their results on UCF101
and HMDB51. Our code is available at https://github.com/Plrbear/auxSKD
Video Representation Learning by Recognizing Temporal Transformations
We introduce a novel self-supervised learning approach to learn
representations of videos that are responsive to changes in the motion
dynamics. Our representations can be learned from data without human annotation
and provide a substantial boost to the training of neural networks on small
labeled data sets for tasks such as action recognition, which require to
accurately distinguish the motion of objects. We promote an accurate learning
of motion without human annotation by training a neural network to discriminate
a video sequence from its temporally transformed versions. To learn to
distinguish non-trivial motions, the design of the transformations is based on
two principles: 1) To define clusters of motions based on time warps of
different magnitude; 2) To ensure that the discrimination is feasible only by
observing and analyzing as many image frames as possible. Thus, we introduce
the following transformations: forward-backward playback, random frame
skipping, and uniform frame skipping. Our experiments show that networks
trained with the proposed method yield representations with improved transfer
performance for action recognition on UCF101 and HMDB51.Comment: ECCV 202
Benchmarking self-supervised video representation learning
Self-supervised learning is an effective way for label-free model
pre-training, especially in the video domain where labeling is expensive.
Existing self-supervised works in the video domain use varying experimental
setups to demonstrate their effectiveness and comparison across approaches
becomes challenging with no standard benchmark. In this work, we first provide
a benchmark that enables a comparison of existing approaches on the same
ground. Next, we study five different aspects of self-supervised learning
important for videos; 1) dataset size, 2) complexity, 3) data distribution, 4)
data noise, and, 5)feature analysis. To facilitate this study, we focus on
seven different methods along with seven different network architectures and
perform an extensive set of experiments on 5 different datasets with an
evaluation of two different downstream tasks. We present several interesting
insights from this study which span across different properties of pretraining
and target datasets, pretext-tasks, and model architectures among others. We
further put some of these insights to the real test and propose an approach
that requires a limited amount of training data and outperforms existing
state-of-the-art approaches which use 10x pretraining data. We believe this
work will pave the way for researchers to a better understanding of
self-supervised pretext tasks in video representation learning
- …