32 research outputs found
Adaptive Wing Loss for Robust Face Alignment via Heatmap Regression
Heatmap regression with a deep network has become one of the mainstream
approaches to localize facial landmarks. However, the loss function for heatmap
regression is rarely studied. In this paper, we analyze the ideal loss function
properties for heatmap regression in face alignment problems. Then we propose a
novel loss function, named Adaptive Wing loss, that is able to adapt its shape
to different types of ground truth heatmap pixels. This adaptability penalizes
loss more on foreground pixels while less on background pixels. To address the
imbalance between foreground and background pixels, we also propose Weighted
Loss Map, which assigns high weights on foreground and difficult background
pixels to help training process focus more on pixels that are crucial to
landmark localization. To further improve face alignment accuracy, we introduce
boundary prediction and CoordConv with boundary coordinates. Extensive
experiments on different benchmarks, including COFW, 300W and WFLW, show our
approach outperforms the state-of-the-art by a significant margin on various
evaluation metrics. Besides, the Adaptive Wing loss also helps other heatmap
regression tasks. Code will be made publicly available at
https://github.com/protossw512/AdaptiveWingLoss.Comment: [v2] Camera-ready version for ICCV 2019. [v3] Corrected AUC(fr10%) on
table
PFLD: A Practical Facial Landmark Detector
Being accurate, efficient, and compact is essential to a facial landmark
detector for practical use. To simultaneously consider the three concerns, this
paper investigates a neat model with promising detection accuracy under wild
environments e.g., unconstrained pose, expression, lighting, and occlusion
conditions) and super real-time speed on a mobile device. More concretely, we
customize an end-to-end single stage network associated with acceleration
techniques. During the training phase, for each sample, rotation information is
estimated for geometrically regularizing landmark localization, which is then
NOT involved in the testing phase. A novel loss is designed to, besides
considering the geometrical regularization, mitigate the issue of data
imbalance by adjusting weights of samples to different states, such as large
pose, extreme lighting, and occlusion, in the training set. Extensive
experiments are conducted to demonstrate the efficacy of our design and reveal
its superior performance over state-of-the-art alternatives on widely-adopted
challenging benchmarks, i.e., 300W (including iBUG, LFPW, AFW, HELEN, and
XM2VTS) and AFLW. Our model can be merely 2.1Mb of size and reach over 140 fps
per face on a mobile phone (Qualcomm ARM 845 processor) with high precision,
making it attractive for large-scale or real-time applications. We have made
our practical system based on PFLD 0.25X model publicly available at
\url{http://sites.google.com/view/xjguo/fld} for encouraging comparisons and
improvements from the community
ROBUST FACIAL LANDMARKS LOCALIZATION WITH APPLICATIONS IN FACIAL BIOMETRICS
Localization of regions of interest on images and videos is a well studied prob-
lem in computer vision community. Usually localization tasks imply localization of
objects in a given image, such as detection and segmentation of objects in images.
However, the regions of interests can be limited to a single pixel as in the task of
facial landmark localization or human pose estimation. This dissertation studies ro-
bust facial landmark detection algorithms for faces in the wild using learning methods
based on Convolution Neural Networks.
Detection of specific keypoints on face images is an integral pre-processing step
in facial biometrics and numerous other applications including face verification and
identification. Detecting keypoints allows to align face images to a canonical coordi-
nate system using geometric transforms such as similarity or affine transformations
mitigating the adverse affects of rotation and scaling. This challenging problem has
become more attractive in recent years as a result of advances in deep learning and
release of more unconstrained datasets. The research community is pushing bound-aries to achieve better and better performance on unconstrained images, where the
images are diverse in pose, expression and lightning conditions.
Over the years, researchers have developed various hand crafted techniques
to extract meaningful features from features, most of them being appearance and
geometry-based features. However, these features do not perform well for data col-
lected in unconstrained settings due to large variations in appearance and other nui-
sance factors. Convolution Neural Networks (CNNs) have become prominent because
of their ability to extract discriminating features. Unlike the hand crafted features,
DCNNs perform feature extraction and feature classification from the data itself in
an end-to-end fashion. This enables the DCNNs to be robust to variations present
in the data and at the same time improve their discriminative ability.
In this dissertation, we discuss three different methods for facial keypoint de-
tection based on Convolution Neural Networks. The methods are generic and can be
extended to a related problem of keypoint detection for human pose estimation. The
first method called Cascaded Local Deep Descriptor Regression uses deep features ex-
tracted around local points to learn linear regressors for incrementally correcting the
initial estimate of the keypoints. In the second method, called KEPLER, we develop
efficient Heatmap CNNs to directly learn the non-linear mapping between the input
and target spaces. We also apply different regularization techniques to tackle the
effects of imbalanced data and vanishing gradients. In the third method, we model
the spatial correlation between different keypoints using Pose Conditioned Convo-
lution Deconvolution Networks (PCD-CNN) while at the same time making it pose
agnostic by disentangling pose from the face image. Next, we show an applicationof facial landmark localization used to align the face images for the task of apparent
age estimation of humans from unconstrained images.
In the fourth part of this dissertation we discuss the impact of good quality
landmarks on the task of face verification. Previously proposed methods perform
with reasonable accuracy on high resolution and good quality images, but fail when
the input image suffers from degradation. To this end, we propose a semi-supervised
method which aims at predicting landmarks in the low quality images. This method
learns to predict landmarks in low resolution images by learning to model the learning
process of high resolution images. In this algorithm, we use Generative Adversarial
Networks, which first learn to model the distribution of real low resolution images
after which another CNN learns to model the distribution of heatmaps on the images.
Additionally, we also propose another high quality facial landmark detection method,
which is currently state of the art.
Finally, we also discuss the extension of ideas developed for facial keypoint
localization for the task of human pose estimation, which is one of the important
cues for Human Activity Recognition. As in PCD-CNN, the parts of human body
can also be modelled in a tree structure, where the relationship between these parts are
learnt through convolutions while being conditioned on the 3D pose and orientation.
Another interesting avenue for research is extending facial landmark localization to
naturally degraded images
Face Alignment using a 3D Deeply-initialized Ensemble of Regression Trees
Face alignment algorithms locate a set of landmark points in images of faces
taken in unrestricted situations. State-of-the-art approaches typically fail or
lose accuracy in the presence of occlusions, strong deformations, large pose
variations and ambiguous configurations. In this paper we present 3DDE, a
robust and efficient face alignment algorithm based on a coarse-to-fine cascade
of ensembles of regression trees. It is initialized by robustly fitting a 3D
face model to the probability maps produced by a convolutional neural network.
With this initialization we address self-occlusions and large face rotations.
Further, the regressor implicitly imposes a prior face shape on the solution,
addressing occlusions and ambiguous face configurations. Its coarse-to-fine
structure tackles the combinatorial explosion of parts deformation. In the
experiments performed, 3DDE improves the state-of-the-art in 300W, COFW, AFLW
and WFLW data sets. Finally, we perform cross-dataset experiments that reveal
the existence of a significant data set bias in these benchmarks.Comment: Accepted Version to Computer Vision and Image Understandin
High-Resolution Representations for Labeling Pixels and Regions
High-resolution representation learning plays an essential role in many
vision problems, e.g., pose estimation and semantic segmentation. The
high-resolution network (HRNet)~\cite{SunXLW19}, recently developed for human
pose estimation, maintains high-resolution representations through the whole
process by connecting high-to-low resolution convolutions in \emph{parallel}
and produces strong high-resolution representations by repeatedly conducting
fusions across parallel convolutions.
In this paper, we conduct a further study on high-resolution representations
by introducing a simple yet effective modification and apply it to a wide range
of vision tasks. We augment the high-resolution representation by aggregating
the (upsampled) representations from all the parallel convolutions rather than
only the representation from the high-resolution convolution as done
in~\cite{SunXLW19}. This simple modification leads to stronger representations,
evidenced by superior results. We show top results in semantic segmentation on
Cityscapes, LIP, and PASCAL Context, and facial landmark detection on AFLW,
COFW, W, and WFLW. In addition, we build a multi-level representation from
the high-resolution representation and apply it to the Faster R-CNN object
detection framework and the extended frameworks. The proposed approach achieves
superior results to existing single-model networks on COCO object detection.
The code and models have been publicly available at
\url{https://github.com/HRNet}
DeCaFA: Deep Convolutional Cascade for Face Alignment In The Wild
Face Alignment is an active computer vision domain, that consists in
localizing a number of facial landmarks that vary across datasets.
State-of-the-art face alignment methods either consist in end-to-end
regression, or in refining the shape in a cascaded manner, starting from an
initial guess. In this paper, we introduce DeCaFA, an end-to-end deep
convolutional cascade architecture for face alignment. DeCaFA uses
fully-convolutional stages to keep full spatial resolution throughout the
cascade. Between each cascade stage, DeCaFA uses multiple chained transfer
layers with spatial softmax to produce landmark-wise attention maps for each of
several landmark alignment tasks. Weighted intermediate supervision, as well as
efficient feature fusion between the stages allow to learn to progressively
refine the attention maps in an end-to-end manner. We show experimentally that
DeCaFA significantly outperforms existing approaches on 300W, CelebA and WFLW
databases. In addition, we show that DeCaFA can learn fine alignment with
reasonable accuracy from very few images using coarsely annotated data
Stacked Dense U-Nets with Dual Transformers for Robust Face Alignment
Facial landmark localisation in images captured in-the-wild is an important
and challenging problem. The current state-of-the-art revolves around certain
kinds of Deep Convolutional Neural Networks (DCNNs) such as stacked U-Nets and
Hourglass networks. In this work, we innovatively propose stacked dense U-Nets
for this task. We design a novel scale aggregation network topology structure
and a channel aggregation building block to improve the model's capacity
without sacrificing the computational complexity and model size. With the
assistance of deformable convolutions inside the stacked dense U-Nets and
coherent loss for outside data transformation, our model obtains the ability to
be spatially invariant to arbitrary input face images. Extensive experiments on
many in-the-wild datasets, validate the robustness of the proposed method under
extreme poses, exaggerated expressions and heavy occlusions. Finally, we show
that accurate 3D face alignment can assist pose-invariant face recognition
where we achieve a new state-of-the-art accuracy on CFP-FP
LPRNet: Lightweight Deep Network by Low-rank Pointwise Residual Convolution
Deep learning has become popular in recent years primarily due to the
powerful computing device such as GPUs. However, deploying these deep models to
end-user devices, smart phones, or embedded systems with limited resources is
challenging. To reduce the computation and memory costs, we propose a novel
lightweight deep learning module by low-rank pointwise residual (LPR)
convolution, called LPRNet. Essentially, LPR aims at using low-rank
approximation in pointwise convolution to further reduce the module size, while
keeping depthwise convolutions as the residual module to rectify the LPR
module. This is critical when the low-rankness undermines the convolution
process. We embody our design by replacing modules of identical input-output
dimension in MobileNet and ShuffleNetv2. Experiments on visual recognition
tasks including image classification and face alignment on popular benchmarks
show that our LPRNet achieves competitive performance but with significant
reduction of Flops and memory cost compared to the state-of-the-art deep models
focusing on model compression
Facial landmark detection via attention-adaptive deep network
Facial landmark detection is a key component of the face recognition pipeline as well as facial attribute analysis and face verification. Recently convolutional neural network-based face alignment methods have achieved significant improvement, but occlusion is still a major source of a hurdle to achieve good accuracy. In this paper, we introduce the attentioned distillation module in our previous work Occlusion-adaptive Deep Network (ODN) model, to improve performance. In this model, the occlusion probability of each position in high-level features are inferred by a distillation module. It can be learnt automatically in the process of estimating the relationship between facial appearance and facial shape. The occlusion probability serves as the adaptive weight on high-level features to reduce the impact of occlusion and obtain clean feature representation. Nevertheless, the clean feature representation cannot represent the holistic face due to the missing semantic features. To obtain exhaustive and complete feature representation, it is vital that we leverage a low-rank learning module to recover lost features. Considering that facial geometric characteristics are conducive to the low-rank module to recover lost features, the role of the geometry-aware module is, to excavate geometric relationships between different facial components. The role of attentioned distillation module is, to get rich feature representation and model occlusion. To improve feature representation, we used channel-wise attention and spatial attention. Experimental results show that our method performs better than existing methods
HIH: Towards More Accurate Face Alignment via Heatmap in Heatmap
Recently, heatmap regression models have become the mainstream in locating
facial landmarks. To keep computation affordable and reduce memory usage, the
whole procedure involves downsampling from the raw image to the output heatmap.
However, how much impact will the quantization error introduced by downsampling
bring? The problem is hardly systematically investigated among previous works.
This work fills the blank and we are the first to quantitatively analyze the
negative gain. The statistical results show the NME generated by quantization
error is even larger than 1/3 of the SOTA item, which is a serious obstacle for
making a new breakthrough in face alignment. To compensate the impact of
quantization effect, we propose a novel method, called Heatmap In Heatmap(HIH),
which leverages two categories of heatmaps as label representation to encode
coordinate. And in HIH, the range of one heatmap represents a pixel of the
other category of heatmap. Also, we even combine the face alignment with
solutions of other fields to make a comparison. Extensive experiments on
various benchmarks show the feasibility of HIH and the superior performance
than other solutions. Moreover, the mean error reaches to 4.18 on WFLW, which
exceeds SOTA a lot. Our source code are made publicly available at
supplementary material