Unsupervised retrieval of image features is vital for many computer vision
tasks where the annotation is missing or scarce. In this work, we propose a new
unsupervised approach to detect the landmarks in images, validating it on the
popular task of human face key-points extraction. The method is based on the
idea of auto-encoding the wanted landmarks in the latent space while discarding
the non-essential information (and effectively preserving the
interpretability). The interpretable latent space representation (the
bottleneck containing nothing but the wanted key-points) is achieved by a new
two-step regularization approach. The first regularization step evaluates
transport distance from a given set of landmarks to some average value (the
barycenter by Wasserstein distance). The second regularization step controls
deviations from the barycenter by applying random geometric deformations
synchronously to the initial image and to the encoded landmarks. We demonstrate
the effectiveness of the approach both in unsupervised and semi-supervised
training scenarios using 300-W, CelebA, and MAFL datasets. The proposed
regularization paradigm is shown to prevent overfitting, and the detection
quality is shown to improve beyond the state-of-the-art face models.Comment: 10 main pages with 6 figures and 1 Table, 14 pages total with 6
supplementary figures. I.B. and N.B. contributed equally. D.V.D. is
corresponding autho