26 research outputs found

    Feature extraction on faces : from landmark localization to depth estimation

    Get PDF
    Le sujet de cette thèse porte sur les algorithmes d'apprentissage qui extraient les caractéristiques importantes des visages. Les caractéristiques d’intérêt principal sont des points clés; La localisation en deux dimensions (2D) ou en trois dimensions (3D) de traits importants du visage telles que le centre des yeux, le bout du nez et les coins de la bouche. Les points clés sont utilisés pour résoudre des tâches complexes qui ne peuvent pas être résolues directement ou qui requièrent du guidage pour l’obtention de performances améliorées, telles que la reconnaissance de poses ou de gestes, le suivi ou la vérification du visage. L'application des modèles présentés dans cette thèse concerne les images du visage; cependant, les algorithmes proposés sont plus généraux et peuvent être appliqués aux points clés de d'autres objets, tels que les mains, le corps ou des objets fabriqués par l'homme. Cette thèse est écrite par article et explore différentes techniques pour résoudre plusieurs aspects de la localisation de points clés. Dans le premier article, nous démêlons l'identité et l'expression d'un visage donné pour apprendre une distribution à priori sur l'ensemble des points clés. Cette distribution à priori est ensuite combinée avec un classifieur discriminant qui apprend une distribution de probabilité indépendante par point clé. Le modèle combiné est capable d'expliquer les différences dans les expressions pour une même représentation d'identité. Dans le deuxième article, nous proposons une architecture qui vise à conserver les caractéristiques d’images pour effectuer des tâches qui nécessitent une haute précision au niveau des pixels, telles que la localisation de points clés ou la segmentation d’images. L’architecture proposée extrait progressivement les caractéristiques les plus grossières dans les étapes d'encodage pour obtenir des informations plus globales sur l’image. Ensuite, il étend les caractéristiques grossières pour revenir à la résolution de l'image originale en recombinant les caractéristiques du chemin d'encodage. Le modèle, appelé Réseaux de Recombinaison, a obtenu l’état de l’art sur plusieurs jeux de données, tout en accélérant le temps d’apprentissage. Dans le troisième article, nous visons à améliorer la localisation des points clés lorsque peu d'images comportent des étiquettes sur des points clés. En particulier, nous exploitons une forme plus faible d’étiquettes qui sont plus faciles à acquérir ou plus abondantes tel que l'émotion ou la pose de la tête. Pour ce faire, nous proposons une architecture permettant la rétropropagation du gradient des étiquettes les plus faibles à travers des points clés, ainsi entraînant le réseau de localisation des points clés. Nous proposons également une composante de coût non supervisée qui permet des prédictions de points clés équivariantes en fonction des transformations appliquées à l'image, sans avoir les vraies étiquettes des points clés. Ces techniques ont considérablement amélioré les performances tout en réduisant le pourcentage d'images étiquetées par points clés. Finalement, dans le dernier article, nous proposons un algorithme d'apprentissage permettant d'estimer la profondeur des points clés sans aucune supervision de la profondeur. Nous y parvenons en faisant correspondre les points clés de deux visages en les transformant l'un vers l'autre. Cette transformation nécessite une estimation de la profondeur sur un visage, ainsi que une transformation affine qui transforme le premier visage au deuxième. Nous démontrons que notre formulation ne nécessite que la profondeur et que les paramètres affines peuvent être estimés avec un solution analytique impliquant les points clés augmentés par profondeur. Même en l'absence de supervision directe de la profondeur, la technique proposée extrait des valeurs de profondeur raisonnables qui diffèrent des vraies valeurs de profondeur par un facteur d'échelle et de décalage. Nous démontrons des applications d'estimation de profondeur pour la tâche de rotation de visage, ainsi que celle d'échange de visage.This thesis focuses on learning algorithms that extract important features from faces. The features of main interest are landmarks; the two dimensional (2D) or three dimensional (3D) locations of important facial features such as eye centers, nose tip, and mouth corners. Landmarks are used to solve complex tasks that cannot be solved directly or require guidance for enhanced performance, such as pose or gesture recognition, tracking, or face verification. The application of the models presented in this thesis is on facial images; however, the algorithms proposed are more general and can be applied to the landmarks of other forms of objects, such as hands, full body or man-made objects. This thesis is written by article and explores different techniques to solve various aspects of landmark localization. In the first article, we disentangle identity and expression of a given face to learn a prior distribution over the joint set of landmarks. This prior is then merged with a discriminative classifier that learns an independent probability distribution per landmark. The merged model is capable of explaining differences in expressions for the same identity representation. In the second article, we propose an architecture that aims at uncovering image features to do tasks that require high pixel-level accuracy, such as landmark localization or image segmentation. The proposed architecture gradually extracts coarser features in its encoding steps to get more global information over the image and then it expands the coarse features back to the image resolution by recombining the features of the encoding path. The model, termed Recombinator Networks, obtained state-of-the-art on several datasets, while also speeding up training. In the third article, we aim at improving landmark localization when only a few images with labelled landmarks are available. In particular, we leverage a weaker form of data labels that are easier to acquire or more abundantly available such as emotion or head pose. To do so, we propose an architecture to backpropagate gradients of the weaker labels through landmarks, effectively training the landmark localization network. We also propose an unsupervised loss component which makes equivariant landmark predictions with respect to transformations applied to the image without having ground truth landmark labels. These techniques improved performance considerably when we have a low percentage of labelled images with landmarks. Finally, in the last article, we propose a learning algorithm to estimate the depth of the landmarks without any depth supervision. We do so by matching landmarks of two faces through transforming one to another. This transformation requires estimation of depth on one face and an affine transformation that maps the first face to the second one. Our formulation, which only requires depth estimation and affine parameters, can be estimated as a closed form solution of the 2D landmarks and the estimated depth. Even without direct depth supervision, the proposed technique extracts reasonable depth values that differ from the ground truth depth values by a scale and a shift. We demonstrate applications of the estimated depth in face rotation and face replacement tasks

    Recombinator Networks: Learning Coarse-to-Fine Feature Aggregation

    Full text link
    Deep neural networks with alternating convolutional, max-pooling and decimation layers are widely used in state of the art architectures for computer vision. Max-pooling purposefully discards precise spatial information in order to create features that are more robust, and typically organized as lower resolution spatial feature maps. On some tasks, such as whole-image classification, max-pooling derived features are well suited; however, for tasks requiring precise localization, such as pixel level prediction and segmentation, max-pooling destroys exactly the information required to perform well. Precise localization may be preserved by shallow convnets without pooling but at the expense of robustness. Can we have our max-pooled multi-layered cake and eat it too? Several papers have proposed summation and concatenation based methods for combining upsampled coarse, abstract features with finer features to produce robust pixel level predictions. Here we introduce another model --- dubbed Recombinator Networks --- where coarse features inform finer features early in their formation such that finer features can make use of several layers of computation in deciding how to use coarse features. The model is trained once, end-to-end and performs better than summation-based architectures, reducing the error from the previous state of the art on two facial keypoint datasets, AFW and AFLW, by 30\% and beating the current state-of-the-art on 300W without using extra data. We improve performance even further by adding a denoising prediction model based on a novel convnet formulation.Comment: accepted in CVPR 201

    Under Uncertainty Trust Estimation in Multi-Valued Settings

    Get PDF
    Social networking sites have developed considerably during the past couple of years. However, few websites exploit the potentials of combining the social networking sites with online markets. This, in turn, would help users to distinguish and engage into interaction with other unknown, yet trustworthy, users in the market. In this thesis, we develop a model to estimate the trust of unknown agents in a multi-agent system where agents engage into business-oriented interactions with each other. The proposed trust model estimates the degree of trustworthiness of an unknown target agent through the information acquired from a group of advisor agents, who had direct interactions with the target agent. This problem is addressed when: (1) the trust of both advisor and target agents is subject to some uncertainty; (2) the advisor agents are self-interested and provide misleading accounts of their past experiences with the target agents; and (3) the outcome of each interaction between the agents is multi-valued. We use possibility distributions to model trust with respect to its uncertainties thanks to its potential capability of modeling uncertainty arisen from both variability and ignorance. Moreover, we propose trust estimation models to approximate the degree of trustworthiness of an unknown target agent in the two following problems: (1) in the first problem, the advisor agents are assumed to be unknown and have an unknown level of trustworthiness; and (2) in the second problem, however, some interactions are carried out with the advisor agents and their trust distributions are modeled. In addition, a certainty metric is proposed in the possibilistic domain, measuring the confidence of an agent in the reports of its advisors which considers the consistency in the advisors’ reported information and the advisors’ degree of trustworthiness. Finally, we validate the proposed approaches through extensive experiments in various settings

    Improving Facial Analysis and Performance Driven Animation through Disentangling Identity and Expression

    Full text link
    We present techniques for improving performance driven facial animation, emotion recognition, and facial key-point or landmark prediction using learned identity invariant representations. Established approaches to these problems can work well if sufficient examples and labels for a particular identity are available and factors of variation are highly controlled. However, labeled examples of facial expressions, emotions and key-points for new individuals are difficult and costly to obtain. In this paper we improve the ability of techniques to generalize to new and unseen individuals by explicitly modeling previously seen variations related to identity and expression. We use a weakly-supervised approach in which identity labels are used to learn the different factors of variation linked to identity separately from factors related to expression. We show how probabilistic modeling of these sources of variation allows one to learn identity-invariant representations for expressions which can then be used to identity-normalize various procedures for facial expression analysis and animation control. We also show how to extend the widely used techniques of active appearance models and constrained local models through replacing the underlying point distribution models which are typically constructed using principal component analysis with identity-expression factorized representations. We present a wide variety of experiments in which we consistently improve performance on emotion recognition, markerless performance-driven facial animation and facial key-point tracking.Comment: to appear in Image and Vision Computing Journal (IMAVIS

    Improving Landmark Localization with Semi-Supervised Learning

    Full text link
    We present two techniques to improve landmark localization in images from partially annotated datasets. Our primary goal is to leverage the common situation where precise landmark locations are only provided for a small data subset, but where class labels for classification or regression tasks related to the landmarks are more abundantly available. First, we propose the framework of sequential multitasking and explore it here through an architecture for landmark localization where training with class labels acts as an auxiliary signal to guide the landmark localization on unlabeled data. A key aspect of our approach is that errors can be backpropagated through a complete landmark localization model. Second, we propose and explore an unsupervised learning technique for landmark localization based on having a model predict equivariant landmarks with respect to transformations applied to the image. We show that these techniques, improve landmark prediction considerably and can learn effective detectors even when only a small fraction of the dataset has landmark labels. We present results on two toy datasets and four real datasets, with hands and faces, and report new state-of-the-art on two datasets in the wild, e.g. with only 5\% of labeled images we outperform previous state-of-the-art trained on the AFLW dataset.Comment: Published as a conference paper in CVPR 201

    Detecting Road Obstacles by Erasing Them

    Full text link
    Vehicles can encounter a myriad of obstacles on the road, and it is not feasible to record them all beforehand to train a detector. Our method selects image patches and inpaints them with the surrounding road texture, which tends to remove obstacles from those patches. It them uses a network trained to recognize discrepancies between the original patch and the inpainted one, which signals an erased obstacle. We also contribute a new dataset for monocular road obstacle detection, and show that our approach outperforms the state-of-the-art methods on both our new dataset and the standard Fishyscapes Lost & Found benchmark

    Perspective Aware Road Obstacle Detection

    Full text link
    While road obstacle detection techniques have become increasingly effective, they typically ignore the fact that, in practice, the apparent size of the obstacles decreases as their distance to the vehicle increases. In this paper, we account for this by computing a scale map encoding the apparent size of a hypothetical object at every image location. We then leverage this perspective map to (i) generate training data by injecting synthetic objects onto the road in a more realistic fashion than existing methods; and (ii) incorporate perspective information in the decoding part of the detection network to guide the obstacle detector. Our results on standard benchmarks show that, together, these two strategies significantly boost the obstacle detection performance, allowing our approach to consistently outperform state-of-the-art methods in terms of instance-level obstacle detection

    Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation

    Full text link
    In the presence of annotated data, deep human pose estimation networks yield impressive performance. Nevertheless, annotating new data is extremely time-consuming, particularly in real-world conditions. Here, we address this by leveraging contrastive self-supervised (CSS) learning to extract rich latent vectors from single-view videos. Instead of simply treating the latent features of nearby frames as positive pairs and those of temporally-distant ones as negative pairs as in other CSS approaches, we explicitly disentangle each latent vector into a time-variant component and a time-invariant one. We then show that applying CSS only to the time-variant features, while also reconstructing the input and encouraging a gradual transition between nearby and away features, yields a rich latent space, well-suited for human pose estimation. Our approach outperforms other unsupervised single-view methods and matches the performance of multi-view techniques
    corecore