7,485 research outputs found
Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision
The goal of this work is to train discriminative cross-modal embeddings
without access to manually annotated data. Recent advances in self-supervised
learning have shown that effective representations can be learnt from natural
cross-modal synchrony. We build on earlier work to train embeddings that are
more discriminative for uni-modal downstream tasks. To this end, we propose a
novel training strategy that not only optimises metrics across modalities, but
also enforces intra-class feature separation within each of the modalities. The
effectiveness of the method is demonstrated on two downstream tasks: lip
reading using the features trained on audio-visual synchronisation, and speaker
recognition using the features trained for cross-modal biometric matching. The
proposed method outperforms state-of-the-art self-supervised baselines by a
signficant margin.Comment: Under submission as a conference pape
Perfect match: Improved cross-modal embeddings for audio-visual synchronisation
This paper proposes a new strategy for learning powerful cross-modal
embeddings for audio-to-video synchronization. Here, we set up the problem as
one of cross-modal retrieval, where the objective is to find the most relevant
audio segment given a short video clip. The method builds on the recent
advances in learning representations from cross-modal self-supervision.
The main contributions of this paper are as follows: (1) we propose a new
learning strategy where the embeddings are learnt via a multi-way matching
problem, as opposed to a binary classification (matching or non-matching)
problem as proposed by recent papers; (2) we demonstrate that performance of
this method far exceeds the existing baselines on the synchronization task; (3)
we use the learnt embeddings for visual speech recognition in self-supervision,
and show that the performance matches the representations learnt end-to-end in
a fully-supervised manner.Comment: Preprint. Work in progres
Self-supervised learning of a facial attribute embedding from video
We propose a self-supervised framework for learning facial attributes by
simply watching videos of a human face speaking, laughing, and moving over
time. To perform this task, we introduce a network, Facial Attributes-Net
(FAb-Net), that is trained to embed multiple frames from the same video
face-track into a common low-dimensional space. With this approach, we make
three contributions: first, we show that the network can leverage information
from multiple source frames by predicting confidence/attention masks for each
frame; second, we demonstrate that using a curriculum learning regime improves
the learned embedding; finally, we demonstrate that the network learns a
meaningful face embedding that encodes information about head pose, facial
landmarks and facial expression, i.e. facial attributes, without having been
supervised with any labelled data. We are comparable or superior to
state-of-the-art self-supervised methods on these tasks and approach the
performance of supervised methods.Comment: To appear in BMVC 2018. Supplementary material can be found at
http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/fabnet.htm
Cross Pixel Optical Flow Similarity for Self-Supervised Learning
We propose a novel method for learning convolutional neural image
representations without manual supervision. We use motion cues in the form of
optical flow, to supervise representations of static images. The obvious
approach of training a network to predict flow from a single image can be
needlessly difficult due to intrinsic ambiguities in this prediction task. We
instead propose a much simpler learning goal: embed pixels such that the
similarity between their embeddings matches that between their optical flow
vectors. At test time, the learned deep network can be used without access to
video or flow information and transferred to tasks such as image
classification, detection, and segmentation. Our method, which significantly
simplifies previous attempts at using motion for self-supervision, achieves
state-of-the-art results in self-supervision using motion cues, competitive
results for self-supervision in general, and is overall state of the art in
self-supervised pretraining for semantic image segmentation, as demonstrated on
standard benchmarks
Time-Contrastive Networks: Self-Supervised Learning from Video
We propose a self-supervised approach for learning representations and
robotic behaviors entirely from unlabeled videos recorded from multiple
viewpoints, and study how this representation can be used in two robotic
imitation settings: imitating object interactions from videos of humans, and
imitating human poses. Imitation of human behavior requires a
viewpoint-invariant representation that captures the relationships between
end-effectors (hands or robot grippers) and the environment, object attributes,
and body pose. We train our representations using a metric learning loss, where
multiple simultaneous viewpoints of the same observation are attracted in the
embedding space, while being repelled from temporal neighbors which are often
visually similar but functionally different. In other words, the model
simultaneously learns to recognize what is common between different-looking
images, and what is different between similar-looking images. This signal
causes our model to discover attributes that do not change across viewpoint,
but do change across time, while ignoring nuisance variables such as
occlusions, motion blur, lighting and background. We demonstrate that this
representation can be used by a robot to directly mimic human poses without an
explicit correspondence, and that it can be used as a reward function within a
reinforcement learning algorithm. While representations are learned from an
unlabeled collection of task-related videos, robot behaviors such as pouring
are learned by watching a single 3rd-person demonstration by a human. Reward
functions obtained by following the human demonstrations under the learned
representation enable efficient reinforcement learning that is practical for
real-world robotic systems. Video results, open-source code and dataset are
available at https://sermanet.github.io/imitat
Objects that Sound
In this paper our objectives are, first, networks that can embed audio and
visual inputs into a common space that is suitable for cross-modal retrieval;
and second, a network that can localize the object that sounds in an image,
given the audio signal. We achieve both these objectives by training from
unlabelled video using only audio-visual correspondence (AVC) as the objective
function. This is a form of cross-modal self-supervision from video.
To this end, we design new network architectures that can be trained for
cross-modal retrieval and localizing the sound source in an image, by using the
AVC task. We make the following contributions: (i) show that audio and visual
embeddings can be learnt that enable both within-mode (e.g. audio-to-audio) and
between-mode retrieval; (ii) explore various architectures for the AVC task,
including those for the visual stream that ingest a single image, or multiple
images, or a single image and multi-frame optical flow; (iii) show that the
semantic object that sounds within an image can be localized (using only the
sound, no motion or flow information); and (iv) give a cautionary tale on how
to avoid undesirable shortcuts in the data preparation.Comment: Appears in: European Conference on Computer Vision (ECCV) 201
- …