32 research outputs found

    Adaptive Wing Loss for Robust Face Alignment via Heatmap Regression

    Full text link
    Heatmap regression with a deep network has become one of the mainstream approaches to localize facial landmarks. However, the loss function for heatmap regression is rarely studied. In this paper, we analyze the ideal loss function properties for heatmap regression in face alignment problems. Then we propose a novel loss function, named Adaptive Wing loss, that is able to adapt its shape to different types of ground truth heatmap pixels. This adaptability penalizes loss more on foreground pixels while less on background pixels. To address the imbalance between foreground and background pixels, we also propose Weighted Loss Map, which assigns high weights on foreground and difficult background pixels to help training process focus more on pixels that are crucial to landmark localization. To further improve face alignment accuracy, we introduce boundary prediction and CoordConv with boundary coordinates. Extensive experiments on different benchmarks, including COFW, 300W and WFLW, show our approach outperforms the state-of-the-art by a significant margin on various evaluation metrics. Besides, the Adaptive Wing loss also helps other heatmap regression tasks. Code will be made publicly available at https://github.com/protossw512/AdaptiveWingLoss.Comment: [v2] Camera-ready version for ICCV 2019. [v3] Corrected AUC(fr10%) on table

    PFLD: A Practical Facial Landmark Detector

    Full text link
    Being accurate, efficient, and compact is essential to a facial landmark detector for practical use. To simultaneously consider the three concerns, this paper investigates a neat model with promising detection accuracy under wild environments e.g., unconstrained pose, expression, lighting, and occlusion conditions) and super real-time speed on a mobile device. More concretely, we customize an end-to-end single stage network associated with acceleration techniques. During the training phase, for each sample, rotation information is estimated for geometrically regularizing landmark localization, which is then NOT involved in the testing phase. A novel loss is designed to, besides considering the geometrical regularization, mitigate the issue of data imbalance by adjusting weights of samples to different states, such as large pose, extreme lighting, and occlusion, in the training set. Extensive experiments are conducted to demonstrate the efficacy of our design and reveal its superior performance over state-of-the-art alternatives on widely-adopted challenging benchmarks, i.e., 300W (including iBUG, LFPW, AFW, HELEN, and XM2VTS) and AFLW. Our model can be merely 2.1Mb of size and reach over 140 fps per face on a mobile phone (Qualcomm ARM 845 processor) with high precision, making it attractive for large-scale or real-time applications. We have made our practical system based on PFLD 0.25X model publicly available at \url{http://sites.google.com/view/xjguo/fld} for encouraging comparisons and improvements from the community

    ROBUST FACIAL LANDMARKS LOCALIZATION WITH APPLICATIONS IN FACIAL BIOMETRICS

    Get PDF
    Localization of regions of interest on images and videos is a well studied prob- lem in computer vision community. Usually localization tasks imply localization of objects in a given image, such as detection and segmentation of objects in images. However, the regions of interests can be limited to a single pixel as in the task of facial landmark localization or human pose estimation. This dissertation studies ro- bust facial landmark detection algorithms for faces in the wild using learning methods based on Convolution Neural Networks. Detection of specific keypoints on face images is an integral pre-processing step in facial biometrics and numerous other applications including face verification and identification. Detecting keypoints allows to align face images to a canonical coordi- nate system using geometric transforms such as similarity or affine transformations mitigating the adverse affects of rotation and scaling. This challenging problem has become more attractive in recent years as a result of advances in deep learning and release of more unconstrained datasets. The research community is pushing bound-aries to achieve better and better performance on unconstrained images, where the images are diverse in pose, expression and lightning conditions. Over the years, researchers have developed various hand crafted techniques to extract meaningful features from features, most of them being appearance and geometry-based features. However, these features do not perform well for data col- lected in unconstrained settings due to large variations in appearance and other nui- sance factors. Convolution Neural Networks (CNNs) have become prominent because of their ability to extract discriminating features. Unlike the hand crafted features, DCNNs perform feature extraction and feature classification from the data itself in an end-to-end fashion. This enables the DCNNs to be robust to variations present in the data and at the same time improve their discriminative ability. In this dissertation, we discuss three different methods for facial keypoint de- tection based on Convolution Neural Networks. The methods are generic and can be extended to a related problem of keypoint detection for human pose estimation. The first method called Cascaded Local Deep Descriptor Regression uses deep features ex- tracted around local points to learn linear regressors for incrementally correcting the initial estimate of the keypoints. In the second method, called KEPLER, we develop efficient Heatmap CNNs to directly learn the non-linear mapping between the input and target spaces. We also apply different regularization techniques to tackle the effects of imbalanced data and vanishing gradients. In the third method, we model the spatial correlation between different keypoints using Pose Conditioned Convo- lution Deconvolution Networks (PCD-CNN) while at the same time making it pose agnostic by disentangling pose from the face image. Next, we show an applicationof facial landmark localization used to align the face images for the task of apparent age estimation of humans from unconstrained images. In the fourth part of this dissertation we discuss the impact of good quality landmarks on the task of face verification. Previously proposed methods perform with reasonable accuracy on high resolution and good quality images, but fail when the input image suffers from degradation. To this end, we propose a semi-supervised method which aims at predicting landmarks in the low quality images. This method learns to predict landmarks in low resolution images by learning to model the learning process of high resolution images. In this algorithm, we use Generative Adversarial Networks, which first learn to model the distribution of real low resolution images after which another CNN learns to model the distribution of heatmaps on the images. Additionally, we also propose another high quality facial landmark detection method, which is currently state of the art. Finally, we also discuss the extension of ideas developed for facial keypoint localization for the task of human pose estimation, which is one of the important cues for Human Activity Recognition. As in PCD-CNN, the parts of human body can also be modelled in a tree structure, where the relationship between these parts are learnt through convolutions while being conditioned on the 3D pose and orientation. Another interesting avenue for research is extending facial landmark localization to naturally degraded images

    Face Alignment using a 3D Deeply-initialized Ensemble of Regression Trees

    Full text link
    Face alignment algorithms locate a set of landmark points in images of faces taken in unrestricted situations. State-of-the-art approaches typically fail or lose accuracy in the presence of occlusions, strong deformations, large pose variations and ambiguous configurations. In this paper we present 3DDE, a robust and efficient face alignment algorithm based on a coarse-to-fine cascade of ensembles of regression trees. It is initialized by robustly fitting a 3D face model to the probability maps produced by a convolutional neural network. With this initialization we address self-occlusions and large face rotations. Further, the regressor implicitly imposes a prior face shape on the solution, addressing occlusions and ambiguous face configurations. Its coarse-to-fine structure tackles the combinatorial explosion of parts deformation. In the experiments performed, 3DDE improves the state-of-the-art in 300W, COFW, AFLW and WFLW data sets. Finally, we perform cross-dataset experiments that reveal the existence of a significant data set bias in these benchmarks.Comment: Accepted Version to Computer Vision and Image Understandin

    High-Resolution Representations for Labeling Pixels and Regions

    Full text link
    High-resolution representation learning plays an essential role in many vision problems, e.g., pose estimation and semantic segmentation. The high-resolution network (HRNet)~\cite{SunXLW19}, recently developed for human pose estimation, maintains high-resolution representations through the whole process by connecting high-to-low resolution convolutions in \emph{parallel} and produces strong high-resolution representations by repeatedly conducting fusions across parallel convolutions. In this paper, we conduct a further study on high-resolution representations by introducing a simple yet effective modification and apply it to a wide range of vision tasks. We augment the high-resolution representation by aggregating the (upsampled) representations from all the parallel convolutions rather than only the representation from the high-resolution convolution as done in~\cite{SunXLW19}. This simple modification leads to stronger representations, evidenced by superior results. We show top results in semantic segmentation on Cityscapes, LIP, and PASCAL Context, and facial landmark detection on AFLW, COFW, 300300W, and WFLW. In addition, we build a multi-level representation from the high-resolution representation and apply it to the Faster R-CNN object detection framework and the extended frameworks. The proposed approach achieves superior results to existing single-model networks on COCO object detection. The code and models have been publicly available at \url{https://github.com/HRNet}

    DeCaFA: Deep Convolutional Cascade for Face Alignment In The Wild

    Full text link
    Face Alignment is an active computer vision domain, that consists in localizing a number of facial landmarks that vary across datasets. State-of-the-art face alignment methods either consist in end-to-end regression, or in refining the shape in a cascaded manner, starting from an initial guess. In this paper, we introduce DeCaFA, an end-to-end deep convolutional cascade architecture for face alignment. DeCaFA uses fully-convolutional stages to keep full spatial resolution throughout the cascade. Between each cascade stage, DeCaFA uses multiple chained transfer layers with spatial softmax to produce landmark-wise attention maps for each of several landmark alignment tasks. Weighted intermediate supervision, as well as efficient feature fusion between the stages allow to learn to progressively refine the attention maps in an end-to-end manner. We show experimentally that DeCaFA significantly outperforms existing approaches on 300W, CelebA and WFLW databases. In addition, we show that DeCaFA can learn fine alignment with reasonable accuracy from very few images using coarsely annotated data

    Stacked Dense U-Nets with Dual Transformers for Robust Face Alignment

    Full text link
    Facial landmark localisation in images captured in-the-wild is an important and challenging problem. The current state-of-the-art revolves around certain kinds of Deep Convolutional Neural Networks (DCNNs) such as stacked U-Nets and Hourglass networks. In this work, we innovatively propose stacked dense U-Nets for this task. We design a novel scale aggregation network topology structure and a channel aggregation building block to improve the model's capacity without sacrificing the computational complexity and model size. With the assistance of deformable convolutions inside the stacked dense U-Nets and coherent loss for outside data transformation, our model obtains the ability to be spatially invariant to arbitrary input face images. Extensive experiments on many in-the-wild datasets, validate the robustness of the proposed method under extreme poses, exaggerated expressions and heavy occlusions. Finally, we show that accurate 3D face alignment can assist pose-invariant face recognition where we achieve a new state-of-the-art accuracy on CFP-FP

    LPRNet: Lightweight Deep Network by Low-rank Pointwise Residual Convolution

    Full text link
    Deep learning has become popular in recent years primarily due to the powerful computing device such as GPUs. However, deploying these deep models to end-user devices, smart phones, or embedded systems with limited resources is challenging. To reduce the computation and memory costs, we propose a novel lightweight deep learning module by low-rank pointwise residual (LPR) convolution, called LPRNet. Essentially, LPR aims at using low-rank approximation in pointwise convolution to further reduce the module size, while keeping depthwise convolutions as the residual module to rectify the LPR module. This is critical when the low-rankness undermines the convolution process. We embody our design by replacing modules of identical input-output dimension in MobileNet and ShuffleNetv2. Experiments on visual recognition tasks including image classification and face alignment on popular benchmarks show that our LPRNet achieves competitive performance but with significant reduction of Flops and memory cost compared to the state-of-the-art deep models focusing on model compression

    Facial landmark detection via attention-adaptive deep network

    Get PDF
    Facial landmark detection is a key component of the face recognition pipeline as well as facial attribute analysis and face verification. Recently convolutional neural network-based face alignment methods have achieved significant improvement, but occlusion is still a major source of a hurdle to achieve good accuracy. In this paper, we introduce the attentioned distillation module in our previous work Occlusion-adaptive Deep Network (ODN) model, to improve performance. In this model, the occlusion probability of each position in high-level features are inferred by a distillation module. It can be learnt automatically in the process of estimating the relationship between facial appearance and facial shape. The occlusion probability serves as the adaptive weight on high-level features to reduce the impact of occlusion and obtain clean feature representation. Nevertheless, the clean feature representation cannot represent the holistic face due to the missing semantic features. To obtain exhaustive and complete feature representation, it is vital that we leverage a low-rank learning module to recover lost features. Considering that facial geometric characteristics are conducive to the low-rank module to recover lost features, the role of the geometry-aware module is, to excavate geometric relationships between different facial components. The role of attentioned distillation module is, to get rich feature representation and model occlusion. To improve feature representation, we used channel-wise attention and spatial attention. Experimental results show that our method performs better than existing methods

    HIH: Towards More Accurate Face Alignment via Heatmap in Heatmap

    Full text link
    Recently, heatmap regression models have become the mainstream in locating facial landmarks. To keep computation affordable and reduce memory usage, the whole procedure involves downsampling from the raw image to the output heatmap. However, how much impact will the quantization error introduced by downsampling bring? The problem is hardly systematically investigated among previous works. This work fills the blank and we are the first to quantitatively analyze the negative gain. The statistical results show the NME generated by quantization error is even larger than 1/3 of the SOTA item, which is a serious obstacle for making a new breakthrough in face alignment. To compensate the impact of quantization effect, we propose a novel method, called Heatmap In Heatmap(HIH), which leverages two categories of heatmaps as label representation to encode coordinate. And in HIH, the range of one heatmap represents a pixel of the other category of heatmap. Also, we even combine the face alignment with solutions of other fields to make a comparison. Extensive experiments on various benchmarks show the feasibility of HIH and the superior performance than other solutions. Moreover, the mean error reaches to 4.18 on WFLW, which exceeds SOTA a lot. Our source code are made publicly available at supplementary material
    corecore