14,221 research outputs found
Self-Supervised Video Representation Learning with Odd-One-Out Networks
We propose a new self-supervised CNN pre-training technique based on a novel
auxiliary task called "odd-one-out learning". In this task, the machine is
asked to identify the unrelated or odd element from a set of otherwise related
elements. We apply this technique to self-supervised video representation
learning where we sample subsequences from videos and ask the network to learn
to predict the odd video subsequence. The odd video subsequence is sampled such
that it has wrong temporal order of frames while the even ones have the correct
temporal order. Therefore, to generate a odd-one-out question no manual
annotation is required. Our learning machine is implemented as multi-stream
convolutional neural network, which is learned end-to-end. Using odd-one-out
networks, we learn temporal representations for videos that generalizes to
other related tasks such as action recognition.
On action classification, our method obtains 60.3\% on the UCF101 dataset
using only UCF101 data for training which is approximately 10% better than
current state-of-the-art self-supervised learning methods. Similarly, on HMDB51
dataset we outperform self-supervised state-of-the art methods by 12.7% on
action classification task.Comment: Accepted in In IEEE International Conference on Computer Vision and
Pattern Recognition CVPR 201
Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
We address the problem of video representation learning without
human-annotated labels. While previous efforts address the problem by designing
novel self-supervised tasks using video data, the learned features are merely
on a frame-by-frame basis, which are not applicable to many video analytic
tasks where spatio-temporal features are prevailing. In this paper we propose a
novel self-supervised approach to learn spatio-temporal features for video
representation. Inspired by the success of two-stream approaches in video
classification, we propose to learn visual features by regressing both motion
and appearance statistics along spatial and temporal dimensions, given only the
input video data. Specifically, we extract statistical concepts (fast-motion
region and the corresponding dominant direction, spatio-temporal color
diversity, dominant color, etc.) from simple patterns in both spatial and
temporal domains. Unlike prior puzzles that are even hard for humans to solve,
the proposed approach is consistent with human inherent visual habits and
therefore easy to answer. We conduct extensive experiments with C3D to validate
the effectiveness of our proposed approach. The experiments show that our
approach can significantly improve the performance of C3D when applied to video
classification tasks. Code is available at
https://github.com/laura-wang/video_repres_mas.Comment: CVPR 201
Self-supervised learning of a facial attribute embedding from video
We propose a self-supervised framework for learning facial attributes by
simply watching videos of a human face speaking, laughing, and moving over
time. To perform this task, we introduce a network, Facial Attributes-Net
(FAb-Net), that is trained to embed multiple frames from the same video
face-track into a common low-dimensional space. With this approach, we make
three contributions: first, we show that the network can leverage information
from multiple source frames by predicting confidence/attention masks for each
frame; second, we demonstrate that using a curriculum learning regime improves
the learned embedding; finally, we demonstrate that the network learns a
meaningful face embedding that encodes information about head pose, facial
landmarks and facial expression, i.e. facial attributes, without having been
supervised with any labelled data. We are comparable or superior to
state-of-the-art self-supervised methods on these tasks and approach the
performance of supervised methods.Comment: To appear in BMVC 2018. Supplementary material can be found at
http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/fabnet.htm
- …