177 research outputs found

    AnchorFace: An Anchor-based Facial Landmark Detector Across Large Poses

    Full text link
    Facial landmark localization aims to detect the predefined points of human faces, and the topic has been rapidly improved with the recent development of neural network based methods. However, it remains a challenging task when dealing with faces in unconstrained scenarios, especially with large pose variations. In this paper, we target the problem of facial landmark localization across large poses and address this task based on a split-and-aggregate strategy. To split the search space, we propose a set of anchor templates as references for regression, which well addresses the large variations of face poses. Based on the prediction of each anchor template, we propose to aggregate the results, which can reduce the landmark uncertainty due to the large poses. Overall, our proposed approach, named AnchorFace, obtains state-of-the-art results with extremely efficient inference speed on four challenging benchmarks, i.e. AFLW, 300W, Menpo, and WFLW dataset. Code will be available at https://github.com/nothingelse92/AnchorFace.Comment: To appear in AAAI 202

    Detección rápida de puntos de referencia faciales y aplicaciones: estudio de la bibliografía

    Get PDF
    Dense facial landmark detection is one of the key elements of face processing pipeline. It is used in virtual face reenactment, emotion recognition, driver status tracking, etc. Early approaches were suitable for facial landmark detection in controlled environments only, which is clearly insufficient. Neural networks have shown an astonishing qualitative improvement for in-the-wild face landmark detection problem, and are now being studied by many researchers in the field. Numerous bright ideas are proposed, often complimentary to each other. However, exploration of the whole volume of novel approaches is quite challenging. Therefore, we present this survey, where we summarize state-of-the-art algorithms into categories, provide a comparison of recently introduced in-the-wild datasets (e.g., 300W, AFLW, COFW, WFLW) that contain images with large pose, face occlusion, taken in unconstrained conditions. In addition to quality, applications require fast inference, and preferably on mobile devices. Hence, we include information about algorithm inference speed both on desktop and mobile hardware, which is rarely studied. Importantly, we highlight problems of algorithms, their applications, vulnerabilities, and briefly touch on established methods. We hope that the reader will find many novel ideas, will see how the algorithms are used in applications, which will enable further research.La detección de puntos de referenda faciales densos es uno de los elementos clave del proceso de procesamiento de rostros. Se utiliza en la anünación de rostros virtuales, el reconocüniento de emociones, el seguimiento del estado del conductor, etc. Los prüneros enfoques eran adecuados para la detección de puntos de referencia faciales solo en entornos controlados, lo que claramente es insuficiente. Las redes neuronales han mostrado una asombrosa mejora cualitativa para el problema de detección de puntos de referencia faciales en condiciones del mundo real, y ahora están siendo estudiadas por muchos investigadores en el campo. Se proponen numerosas ideas brillantes, a menudo complementarias. Sin embargo, la exploración de todo el volumen de enfoques novedosos es bastante desafiante. Por lo tanto, presentamos esta encuesta, donde resumimos los algoritmos de última generación en categorías, brindamos una comparación de los conjuntos de datos introducidos recientemente (por ejemplo, 300W, AFLW, COFW, WFLW) que contienen imágenes con pose grande, oclusión facial, tomadas en condiciones sin restricciones. Además de calidad, las aplicaciones requieren una inferencia rápida y preferentemente en dispositivos móviles. Por lo tanto, incluimos información sobre la velocidad de inferencia de algoritmos tanto en hardware de escritorio como móvil, que rara vez se estudia. Es importante destacar que destacamos los problemas de los algoritmos, sus aplicaciones, vulnerabilidades y mencionamos brevemente los métodos establecidos. Esperamos que el lector encuentre muchas ideas novedosas, vea cómo se utilizan los algoritmos en las aplicaciones, lo que permitirá futuras investigaciones.Facultad de Informátic

    Feature extraction on faces : from landmark localization to depth estimation

    Get PDF
    Le sujet de cette thèse porte sur les algorithmes d'apprentissage qui extraient les caractéristiques importantes des visages. Les caractéristiques d’intérêt principal sont des points clés; La localisation en deux dimensions (2D) ou en trois dimensions (3D) de traits importants du visage telles que le centre des yeux, le bout du nez et les coins de la bouche. Les points clés sont utilisés pour résoudre des tâches complexes qui ne peuvent pas être résolues directement ou qui requièrent du guidage pour l’obtention de performances améliorées, telles que la reconnaissance de poses ou de gestes, le suivi ou la vérification du visage. L'application des modèles présentés dans cette thèse concerne les images du visage; cependant, les algorithmes proposés sont plus généraux et peuvent être appliqués aux points clés de d'autres objets, tels que les mains, le corps ou des objets fabriqués par l'homme. Cette thèse est écrite par article et explore différentes techniques pour résoudre plusieurs aspects de la localisation de points clés. Dans le premier article, nous démêlons l'identité et l'expression d'un visage donné pour apprendre une distribution à priori sur l'ensemble des points clés. Cette distribution à priori est ensuite combinée avec un classifieur discriminant qui apprend une distribution de probabilité indépendante par point clé. Le modèle combiné est capable d'expliquer les différences dans les expressions pour une même représentation d'identité. Dans le deuxième article, nous proposons une architecture qui vise à conserver les caractéristiques d’images pour effectuer des tâches qui nécessitent une haute précision au niveau des pixels, telles que la localisation de points clés ou la segmentation d’images. L’architecture proposée extrait progressivement les caractéristiques les plus grossières dans les étapes d'encodage pour obtenir des informations plus globales sur l’image. Ensuite, il étend les caractéristiques grossières pour revenir à la résolution de l'image originale en recombinant les caractéristiques du chemin d'encodage. Le modèle, appelé Réseaux de Recombinaison, a obtenu l’état de l’art sur plusieurs jeux de données, tout en accélérant le temps d’apprentissage. Dans le troisième article, nous visons à améliorer la localisation des points clés lorsque peu d'images comportent des étiquettes sur des points clés. En particulier, nous exploitons une forme plus faible d’étiquettes qui sont plus faciles à acquérir ou plus abondantes tel que l'émotion ou la pose de la tête. Pour ce faire, nous proposons une architecture permettant la rétropropagation du gradient des étiquettes les plus faibles à travers des points clés, ainsi entraînant le réseau de localisation des points clés. Nous proposons également une composante de coût non supervisée qui permet des prédictions de points clés équivariantes en fonction des transformations appliquées à l'image, sans avoir les vraies étiquettes des points clés. Ces techniques ont considérablement amélioré les performances tout en réduisant le pourcentage d'images étiquetées par points clés. Finalement, dans le dernier article, nous proposons un algorithme d'apprentissage permettant d'estimer la profondeur des points clés sans aucune supervision de la profondeur. Nous y parvenons en faisant correspondre les points clés de deux visages en les transformant l'un vers l'autre. Cette transformation nécessite une estimation de la profondeur sur un visage, ainsi que une transformation affine qui transforme le premier visage au deuxième. Nous démontrons que notre formulation ne nécessite que la profondeur et que les paramètres affines peuvent être estimés avec un solution analytique impliquant les points clés augmentés par profondeur. Même en l'absence de supervision directe de la profondeur, la technique proposée extrait des valeurs de profondeur raisonnables qui diffèrent des vraies valeurs de profondeur par un facteur d'échelle et de décalage. Nous démontrons des applications d'estimation de profondeur pour la tâche de rotation de visage, ainsi que celle d'échange de visage.This thesis focuses on learning algorithms that extract important features from faces. The features of main interest are landmarks; the two dimensional (2D) or three dimensional (3D) locations of important facial features such as eye centers, nose tip, and mouth corners. Landmarks are used to solve complex tasks that cannot be solved directly or require guidance for enhanced performance, such as pose or gesture recognition, tracking, or face verification. The application of the models presented in this thesis is on facial images; however, the algorithms proposed are more general and can be applied to the landmarks of other forms of objects, such as hands, full body or man-made objects. This thesis is written by article and explores different techniques to solve various aspects of landmark localization. In the first article, we disentangle identity and expression of a given face to learn a prior distribution over the joint set of landmarks. This prior is then merged with a discriminative classifier that learns an independent probability distribution per landmark. The merged model is capable of explaining differences in expressions for the same identity representation. In the second article, we propose an architecture that aims at uncovering image features to do tasks that require high pixel-level accuracy, such as landmark localization or image segmentation. The proposed architecture gradually extracts coarser features in its encoding steps to get more global information over the image and then it expands the coarse features back to the image resolution by recombining the features of the encoding path. The model, termed Recombinator Networks, obtained state-of-the-art on several datasets, while also speeding up training. In the third article, we aim at improving landmark localization when only a few images with labelled landmarks are available. In particular, we leverage a weaker form of data labels that are easier to acquire or more abundantly available such as emotion or head pose. To do so, we propose an architecture to backpropagate gradients of the weaker labels through landmarks, effectively training the landmark localization network. We also propose an unsupervised loss component which makes equivariant landmark predictions with respect to transformations applied to the image without having ground truth landmark labels. These techniques improved performance considerably when we have a low percentage of labelled images with landmarks. Finally, in the last article, we propose a learning algorithm to estimate the depth of the landmarks without any depth supervision. We do so by matching landmarks of two faces through transforming one to another. This transformation requires estimation of depth on one face and an affine transformation that maps the first face to the second one. Our formulation, which only requires depth estimation and affine parameters, can be estimated as a closed form solution of the 2D landmarks and the estimated depth. Even without direct depth supervision, the proposed technique extracts reasonable depth values that differ from the ground truth depth values by a scale and a shift. We demonstrate applications of the estimated depth in face rotation and face replacement tasks

    Graph-based Facial Affect Analysis: A Review of Methods, Applications and Challenges

    Full text link
    Facial affect analysis (FAA) using visual signals is important in human-computer interaction. Early methods focus on extracting appearance and geometry features associated with human affects, while ignoring the latent semantic information among individual facial changes, leading to limited performance and generalization. Recent work attempts to establish a graph-based representation to model these semantic relationships and develop frameworks to leverage them for various FAA tasks. In this paper, we provide a comprehensive review of graph-based FAA, including the evolution of algorithms and their applications. First, the FAA background knowledge is introduced, especially on the role of the graph. We then discuss approaches that are widely used for graph-based affective representation in literature and show a trend towards graph construction. For the relational reasoning in graph-based FAA, existing studies are categorized according to their usage of traditional methods or deep models, with a special emphasis on the latest graph neural networks. Performance comparisons of the state-of-the-art graph-based FAA methods are also summarized. Finally, we discuss the challenges and potential directions. As far as we know, this is the first survey of graph-based FAA methods. Our findings can serve as a reference for future research in this field.Comment: 20 pages, 12 figures, 5 table

    Deep face recognition in the wild

    Get PDF
    Face recognition has attracted particular interest in biometric recognition with wide applications in security, entertainment, health, marketing. Recent years have witnessed rapid development of face recognition technique in both academic and industrial fields with the advent of (a) large amounts of available annotated training datasets, (b) Convolutional Neural Network (CNN) based deep structures, (c) affordable, powerful computation resources and (d) advanced loss functions. Despite the significant improvement and success, there are still challenges remaining to be tackled. This thesis contributes towards in the wild face recognition from three perspectives including network design, model compression, and model explanation. Firstly, although the facial landmarks capture pose, expression and shape information, they are only used as the pre-processing step in the current face recognition pipeline without considering their potential in improving model's representation. Thus, we propose the ``FAN-Face'' framework which gradually integrates features from different layers of a facial landmark localization network into different layers of the recognition network. This operation has broken the align-cropped data pre-possessing routine but achieved simple orthogonal improvement to deep face recognition. We attribute this success to the coarse to fine shape-related information stored in the alignment network helping to establish correspondence for face matching. Secondly, motivated by the success of knowledge distillation in model compression in the object classification task, we have examined current knowledge distillation methods on training lightweight face recognition models. By taking into account the classification problem at hand, we advocate a direct feature matching approach by letting the pre-trained classifier in teacher validate the feature representation from the student network. In addition, as the teacher network trained on labeled dataset alone is capable of capturing rich relational information among labels both in class space and feature space, we make first attempts to use unlabeled data to further enhance the model's performance under the knowledge distillation framework. Finally, to increase the interpretability of the ``black box'' deep face recognition model, we have developed a new structure with dynamic convolution which is able to provide clustering of the faces in terms of facial attributes. In particular, we propose to cluster the routing weights of dynamic convolution experts to learn facial attributes in an unsupervised manner without forfeiting face recognition accuracy. Besides, we also introduce group convolution into dynamic convolution to increase the expert granularity. We further confirm that the routing vector benefits the feature-based face reconstruction via the deep inversion technique

    Computational Methods for Measurement of Visual Attention from Videos towards Large-Scale Behavioral Analysis

    Get PDF
    Visual attention is one of the most important aspects of human social behavior, visual navigation, and interaction with the world, revealing information about their social, cognitive, and affective states. Although monitor-based and wearable eye trackers are widely available, they are not sufficient to support the large-scale collection of naturalistic gaze data in face-to-face social interactions or during interactions with 3D environments. Wearable eye trackers are burdensome to participants and bring issues of calibration, compliance, cost, and battery life. The ability to automatically measure attention from ordinary videos would deliver scalable, dense, and objective measurements to use in practice. This thesis investigates several computational methods to measure visual attention from videos using computer vision and its use for quantifying visual social cues such as eye contact and joint attention. Specifically, three methods are investigated. First, I present methods for detection of looks to camera in first-person view and its use for eye contact detection. Experimental results show that the presented method can achieve the first human expert-level detection of eye contact. Second, I develop a method for tracking heads in a 3d space for measuring attentional shifts. Lastly, I propose spatiotemporal deep neural networks for detecting time-varying attention targets in video and present its application for the detection of shared attention and joint attention. The method achieves state-of-the-art results on different benchmark datasets on attention measurement as well as the first empirical result on clinically-relevant gaze shift classification. Presented approaches have the benefit of linking gaze estimation to the broader tasks of action recognition and dynamic visual scene understanding, and bears potential as a useful tool for understanding attention in various contexts such as human social interactions, skill assessments, and human-robot interactions.Ph.D
    • …
    corecore