3,742 research outputs found
Disentangled Speech Embeddings using Cross-modal Self-supervision
The objective of this paper is to learn representations of speaker identity
without access to manually annotated data. To do so, we develop a
self-supervised learning objective that exploits the natural cross-modal
synchrony between faces and audio in video. The key idea behind our approach is
to tease apart--without annotation--the representations of linguistic content
and speaker identity. We construct a two-stream architecture which: (1) shares
low-level features common to both representations; and (2) provides a natural
mechanism for explicitly disentangling these factors, offering the potential
for greater generalisation to novel combinations of content and identity and
ultimately producing speaker identity representations that are more robust. We
train our method on a large-scale audio-visual dataset of talking heads `in the
wild', and demonstrate its efficacy by evaluating the learned speaker
representations for standard speaker recognition performance.Comment: ICASSP 2020. The first three authors contributed equally to this wor
Video Representation Learning by Recognizing Temporal Transformations
We introduce a novel self-supervised learning approach to learn
representations of videos that are responsive to changes in the motion
dynamics. Our representations can be learned from data without human annotation
and provide a substantial boost to the training of neural networks on small
labeled data sets for tasks such as action recognition, which require to
accurately distinguish the motion of objects. We promote an accurate learning
of motion without human annotation by training a neural network to discriminate
a video sequence from its temporally transformed versions. To learn to
distinguish non-trivial motions, the design of the transformations is based on
two principles: 1) To define clusters of motions based on time warps of
different magnitude; 2) To ensure that the discrimination is feasible only by
observing and analyzing as many image frames as possible. Thus, we introduce
the following transformations: forward-backward playback, random frame
skipping, and uniform frame skipping. Our experiments show that networks
trained with the proposed method yield representations with improved transfer
performance for action recognition on UCF101 and HMDB51.Comment: ECCV 202
- …