457 research outputs found
Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video
We propose a self-supervised visual learning method by predicting the
variable playback speeds of a video. Without semantic labels, we learn the
spatio-temporal visual representation of the video by leveraging the variations
in the visual appearance according to different playback speeds under the
assumption of temporal coherence. To learn the spatio-temporal visual
variations in the entire video, we have not only predicted a single playback
speed but also generated clips of various playback speeds and directions with
randomized starting points. Hence the visual representation can be successfully
learned from the meta information (playback speeds and directions) of the
video. We also propose a new layer dependable temporal group normalization
method that can be applied to 3D convolutional networks to improve the
representation learning performance where we divide the temporal features into
several groups and normalize each one using the different corresponding
parameters. We validate the effectiveness of our method by fine-tuning it to
the action recognition and video retrieval tasks on UCF-101 and HMDB-51.Comment: Accepted by IEEE Access on May 19, 202
RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning
We study unsupervised video representation learning that seeks to learn both
motion and appearance features from unlabeled video only, which can be reused
for downstream tasks such as action recognition. This task, however, is
extremely challenging due to 1) the highly complex spatial-temporal information
in videos; and 2) the lack of labeled data for training. Unlike the
representation learning for static images, it is difficult to construct a
suitable self-supervised task to well model both motion and appearance
features. More recently, several attempts have been made to learn video
representation through video playback speed prediction. However, it is
non-trivial to obtain precise speed labels for the videos. More critically, the
learnt models may tend to focus on motion pattern and thus may not learn
appearance features well. In this paper, we observe that the relative playback
speed is more consistent with motion pattern, and thus provide more effective
and stable supervision for representation learning. Therefore, we propose a new
way to perceive the playback speed and exploit the relative speed between two
video clips as labels. In this way, we are able to well perceive speed and
learn better motion features. Moreover, to ensure the learning of appearance
features, we further propose an appearance-focused task, where we enforce the
model to perceive the appearance difference between two video clips. We show
that optimizing the two tasks jointly consistently improves the performance on
two downstream tasks, namely action recognition and video retrieval.
Remarkably, for action recognition on UCF101 dataset, we achieve 93.7% accuracy
without the use of labeled data for pre-training, which outperforms the
ImageNet supervised pre-trained model. Code and pre-trained models can be found
at https://github.com/PeihaoChen/RSPNet.Comment: Accepted by AAAI-2021. Code and pre-trained models can be found at
https://github.com/PeihaoChen/RSPNe
Video Representation Learning by Recognizing Temporal Transformations
We introduce a novel self-supervised learning approach to learn
representations of videos that are responsive to changes in the motion
dynamics. Our representations can be learned from data without human annotation
and provide a substantial boost to the training of neural networks on small
labeled data sets for tasks such as action recognition, which require to
accurately distinguish the motion of objects. We promote an accurate learning
of motion without human annotation by training a neural network to discriminate
a video sequence from its temporally transformed versions. To learn to
distinguish non-trivial motions, the design of the transformations is based on
two principles: 1) To define clusters of motions based on time warps of
different magnitude; 2) To ensure that the discrimination is feasible only by
observing and analyzing as many image frames as possible. Thus, we introduce
the following transformations: forward-backward playback, random frame
skipping, and uniform frame skipping. Our experiments show that networks
trained with the proposed method yield representations with improved transfer
performance for action recognition on UCF101 and HMDB51.Comment: ECCV 202
Benchmarking self-supervised video representation learning
Self-supervised learning is an effective way for label-free model
pre-training, especially in the video domain where labeling is expensive.
Existing self-supervised works in the video domain use varying experimental
setups to demonstrate their effectiveness and comparison across approaches
becomes challenging with no standard benchmark. In this work, we first provide
a benchmark that enables a comparison of existing approaches on the same
ground. Next, we study five different aspects of self-supervised learning
important for videos; 1) dataset size, 2) complexity, 3) data distribution, 4)
data noise, and, 5)feature analysis. To facilitate this study, we focus on
seven different methods along with seven different network architectures and
perform an extensive set of experiments on 5 different datasets with an
evaluation of two different downstream tasks. We present several interesting
insights from this study which span across different properties of pretraining
and target datasets, pretext-tasks, and model architectures among others. We
further put some of these insights to the real test and propose an approach
that requires a limited amount of training data and outperforms existing
state-of-the-art approaches which use 10x pretraining data. We believe this
work will pave the way for researchers to a better understanding of
self-supervised pretext tasks in video representation learning
Recommended from our members
Self-supervised Representation Learning for Videos by Segmenting via Sampling Rate Order Prediction
Shenzhen Science and Technology Projects (Grant Number: JCYJ20200109143035495 and JCYJ20180306173210774)
- …