23 research outputs found
Self-supervised learning of a facial attribute embedding from video
We propose a self-supervised framework for learning facial attributes by
simply watching videos of a human face speaking, laughing, and moving over
time. To perform this task, we introduce a network, Facial Attributes-Net
(FAb-Net), that is trained to embed multiple frames from the same video
face-track into a common low-dimensional space. With this approach, we make
three contributions: first, we show that the network can leverage information
from multiple source frames by predicting confidence/attention masks for each
frame; second, we demonstrate that using a curriculum learning regime improves
the learned embedding; finally, we demonstrate that the network learns a
meaningful face embedding that encodes information about head pose, facial
landmarks and facial expression, i.e. facial attributes, without having been
supervised with any labelled data. We are comparable or superior to
state-of-the-art self-supervised methods on these tasks and approach the
performance of supervised methods.Comment: To appear in BMVC 2018. Supplementary material can be found at
http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/fabnet.htm
Cross-Task Representation Learning for Anatomical Landmark Detection
Recently, there is an increasing demand for automatically detecting
anatomical landmarks which provide rich structural information to facilitate
subsequent medical image analysis. Current methods related to this task often
leverage the power of deep neural networks, while a major challenge in fine
tuning such models in medical applications arises from insufficient number of
labeled samples. To address this, we propose to regularize the knowledge
transfer across source and target tasks through cross-task representation
learning. The proposed method is demonstrated for extracting facial anatomical
landmarks which facilitate the diagnosis of fetal alcohol syndrome. The source
and target tasks in this work are face recognition and landmark detection,
respectively. The main idea of the proposed method is to retain the feature
representations of the source model on the target task data, and to leverage
them as an additional source of supervisory signals for regularizing the target
model learning, thereby improving its performance under limited training
samples. Concretely, we present two approaches for the proposed representation
learning by constraining either final or intermediate model features on the
target model. Experimental results on a clinical face image dataset demonstrate
that the proposed approach works well with few labeled data, and outperforms
other compared approaches.Comment: MICCAI-MLMI 202
Learnable PINs: Cross-Modal Embeddings for Person Identity
We propose and investigate an identity sensitive joint embedding of face and
voice. Such an embedding enables cross-modal retrieval from voice to face and
from face to voice. We make the following four contributions: first, we show
that the embedding can be learnt from videos of talking faces, without
requiring any identity labels, using a form of cross-modal self-supervision;
second, we develop a curriculum learning schedule for hard negative mining
targeted to this task, that is essential for learning to proceed successfully;
third, we demonstrate and evaluate cross-modal retrieval for identities unseen
and unheard during training over a number of scenarios and establish a
benchmark for this novel task; finally, we show an application of using the
joint embedding for automatically retrieving and labelling characters in TV
dramas.Comment: To appear in ECCV 201
BRUL\`E: Barycenter-Regularized Unsupervised Landmark Extraction
Unsupervised retrieval of image features is vital for many computer vision
tasks where the annotation is missing or scarce. In this work, we propose a new
unsupervised approach to detect the landmarks in images, validating it on the
popular task of human face key-points extraction. The method is based on the
idea of auto-encoding the wanted landmarks in the latent space while discarding
the non-essential information (and effectively preserving the
interpretability). The interpretable latent space representation (the
bottleneck containing nothing but the wanted key-points) is achieved by a new
two-step regularization approach. The first regularization step evaluates
transport distance from a given set of landmarks to some average value (the
barycenter by Wasserstein distance). The second regularization step controls
deviations from the barycenter by applying random geometric deformations
synchronously to the initial image and to the encoded landmarks. We demonstrate
the effectiveness of the approach both in unsupervised and semi-supervised
training scenarios using 300-W, CelebA, and MAFL datasets. The proposed
regularization paradigm is shown to prevent overfitting, and the detection
quality is shown to improve beyond the state-of-the-art face models.Comment: 10 main pages with 6 figures and 1 Table, 14 pages total with 6
supplementary figures. I.B. and N.B. contributed equally. D.V.D. is
corresponding autho