35,000 research outputs found
Self-supervised learning of a facial attribute embedding from video
We propose a self-supervised framework for learning facial attributes by
simply watching videos of a human face speaking, laughing, and moving over
time. To perform this task, we introduce a network, Facial Attributes-Net
(FAb-Net), that is trained to embed multiple frames from the same video
face-track into a common low-dimensional space. With this approach, we make
three contributions: first, we show that the network can leverage information
from multiple source frames by predicting confidence/attention masks for each
frame; second, we demonstrate that using a curriculum learning regime improves
the learned embedding; finally, we demonstrate that the network learns a
meaningful face embedding that encodes information about head pose, facial
landmarks and facial expression, i.e. facial attributes, without having been
supervised with any labelled data. We are comparable or superior to
state-of-the-art self-supervised methods on these tasks and approach the
performance of supervised methods.Comment: To appear in BMVC 2018. Supplementary material can be found at
http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/fabnet.htm
Learnable PINs: Cross-Modal Embeddings for Person Identity
We propose and investigate an identity sensitive joint embedding of face and
voice. Such an embedding enables cross-modal retrieval from voice to face and
from face to voice. We make the following four contributions: first, we show
that the embedding can be learnt from videos of talking faces, without
requiring any identity labels, using a form of cross-modal self-supervision;
second, we develop a curriculum learning schedule for hard negative mining
targeted to this task, that is essential for learning to proceed successfully;
third, we demonstrate and evaluate cross-modal retrieval for identities unseen
and unheard during training over a number of scenarios and establish a
benchmark for this novel task; finally, we show an application of using the
joint embedding for automatically retrieving and labelling characters in TV
dramas.Comment: To appear in ECCV 201
- …