3,729 research outputs found
Occlusion-Adaptive Deep Network for Robust Facial Expression Recognition
Recognizing the expressions of partially occluded faces is a challenging
computer vision problem. Previous expression recognition methods, either
overlooked this issue or resolved it using extreme assumptions. Motivated by
the fact that the human visual system is adept at ignoring the occlusion and
focus on non-occluded facial areas, we propose a landmark-guided attention
branch to find and discard corrupted features from occluded regions so that
they are not used for recognition. An attention map is first generated to
indicate if a specific facial part is occluded and guide our model to attend to
non-occluded regions. To further improve robustness, we propose a facial region
branch to partition the feature maps into non-overlapping facial blocks and
task each block to predict the expression independently. This results in more
diverse and discriminative features, enabling the expression recognition system
to recover even though the face is partially occluded. Depending on the
synergistic effects of the two branches, our occlusion-adaptive deep network
significantly outperforms state-of-the-art methods on two challenging
in-the-wild benchmark datasets and three real-world occluded expression
datasets
Adaptive Wing Loss for Robust Face Alignment via Heatmap Regression
Heatmap regression with a deep network has become one of the mainstream
approaches to localize facial landmarks. However, the loss function for heatmap
regression is rarely studied. In this paper, we analyze the ideal loss function
properties for heatmap regression in face alignment problems. Then we propose a
novel loss function, named Adaptive Wing loss, that is able to adapt its shape
to different types of ground truth heatmap pixels. This adaptability penalizes
loss more on foreground pixels while less on background pixels. To address the
imbalance between foreground and background pixels, we also propose Weighted
Loss Map, which assigns high weights on foreground and difficult background
pixels to help training process focus more on pixels that are crucial to
landmark localization. To further improve face alignment accuracy, we introduce
boundary prediction and CoordConv with boundary coordinates. Extensive
experiments on different benchmarks, including COFW, 300W and WFLW, show our
approach outperforms the state-of-the-art by a significant margin on various
evaluation metrics. Besides, the Adaptive Wing loss also helps other heatmap
regression tasks. Code will be made publicly available at
https://github.com/protossw512/AdaptiveWingLoss.Comment: [v2] Camera-ready version for ICCV 2019. [v3] Corrected AUC(fr10%) on
table
Joint Multi-view Face Alignment in the Wild
The de facto algorithm for facial landmark estimation involves running a face
detector with a subsequent deformable model fitting on the bounding box. This
encompasses two basic problems: i) the detection and deformable fitting steps
are performed independently, while the detector might not provide best-suited
initialisation for the fitting step, ii) the face appearance varies hugely
across different poses, which makes the deformable face fitting very
challenging and thus distinct models have to be used (\eg, one for profile and
one for frontal faces). In this work, we propose the first, to the best of our
knowledge, joint multi-view convolutional network to handle large pose
variations across faces in-the-wild, and elegantly bridge face detection and
facial landmark localisation tasks. Existing joint face detection and landmark
localisation methods focus only on a very small set of landmarks. By contrast,
our method can detect and align a large number of landmarks for semi-frontal
(68 landmarks) and profile (39 landmarks) faces. We evaluate our model on a
plethora of datasets including standard static image datasets such as IBUG,
300W, COFW, and the latest Menpo Benchmark for both semi-frontal and profile
faces. Significant improvement over state-of-the-art methods on deformable face
tracking is witnessed on 300VW benchmark. We also demonstrate state-of-the-art
results for face detection on FDDB and MALF datasets.Comment: submit to IEEE Transactions on Image Processin
Pose-adaptive Hierarchical Attention Network for Facial Expression Recognition
Multi-view facial expression recognition (FER) is a challenging task because
the appearance of an expression varies in poses. To alleviate the influences of
poses, recent methods either perform pose normalization or learn separate FER
classifiers for each pose. However, these methods usually have two stages and
rely on good performance of pose estimators. Different from existing methods,
we propose a pose-adaptive hierarchical attention network (PhaNet) that can
jointly recognize the facial expressions and poses in unconstrained
environment. Specifically, PhaNet discovers the most relevant regions to the
facial expression by an attention mechanism in hierarchical scales, and the
most informative scales are then selected to learn the pose-invariant and
expression-discriminative representations. PhaNet is end-to-end trainable by
minimizing the hierarchical attention losses, the FER loss and pose loss with
dynamically learned loss weights. We validate the effectiveness of the proposed
PhaNet on three multi-view datasets (BU-3DFE, Multi-pie, and KDEF) and two
in-the-wild FER datasets (AffectNet and SFEW). Extensive experiments
demonstrate that our framework outperforms the state-of-the-arts under both
within-dataset and cross-dataset settings, achieving the average accuracies of
84.92\%, 93.53\%, 88.5\%, 54.82\% and 31.25\% respectively.Comment: 12 pages, 15 figure
Deep Learning Architectures for Face Recognition in Video Surveillance
Face recognition (FR) systems for video surveillance (VS) applications
attempt to accurately detect the presence of target individuals over a
distributed network of cameras. In video-based FR systems, facial models of
target individuals are designed a priori during enrollment using a limited
number of reference still images or video data. These facial models are not
typically representative of faces being observed during operations due to large
variations in illumination, pose, scale, occlusion, blur, and to camera
inter-operability. Specifically, in still-to-video FR application, a single
high-quality reference still image captured with still camera under controlled
conditions is employed to generate a facial model to be matched later against
lower-quality faces captured with video cameras under uncontrolled conditions.
Current video-based FR systems can perform well on controlled scenarios, while
their performance is not satisfactory in uncontrolled scenarios mainly because
of the differences between the source (enrollment) and the target (operational)
domains. Most of the efforts in this area have been toward the design of robust
video-based FR systems in unconstrained surveillance environments. This chapter
presents an overview of recent advances in still-to-video FR scenario through
deep convolutional neural networks (CNNs). In particular, deep learning
architectures proposed in the literature based on triplet-loss function (e.g.,
cross-correlation matching CNN, trunk-branch ensemble CNN and HaarNet) and
supervised autoencoders (e.g., canonical face representation CNN) are reviewed
and compared in terms of accuracy and computational complexity
Unsupervised Eyeglasses Removal in the Wild
Eyeglasses removal is challenging in removing different kinds of eyeglasses,
e.g., rimless glasses, full-rim glasses and sunglasses, and recovering
appropriate eyes. Due to the large visual variants, the conventional methods
lack scalability. Most existing works focus on the frontal face images in the
controlled environment, such as the laboratory, and need to design specific
systems for different eyeglass types. To address the limitation, we propose a
unified eyeglass removal model called Eyeglasses Removal Generative Adversarial
Network (ERGAN), which could handle different types of glasses in the wild. The
proposed method does not depend on the dense annotation of eyeglasses location
but benefits from the large-scale face images with weak annotations.
Specifically, we study the two relevant tasks simultaneously, i.e., removing
and wearing eyeglasses. Given two facial images with and without eyeglasses,
the proposed model learns to swap the eye area in two faces. The generation
mechanism focuses on the eye area and invades the difficulty of generating a
new face. In the experiment, we show the proposed method achieves a competitive
removal quality in terms of realism and diversity. Furthermore, we evaluate
ERGAN on several subsequent tasks, such as face verification and facial
expression recognition. The experiment shows that our method could serve as a
pre-processing method for these tasks
Deep Facial Expression Recognition: A Survey
With the transition of facial expression recognition (FER) from
laboratory-controlled to challenging in-the-wild conditions and the recent
success of deep learning techniques in various fields, deep neural networks
have increasingly been leveraged to learn discriminative representations for
automatic FER. Recent deep FER systems generally focus on two important issues:
overfitting caused by a lack of sufficient training data and
expression-unrelated variations, such as illumination, head pose and identity
bias. In this paper, we provide a comprehensive survey on deep FER, including
datasets and algorithms that provide insights into these intrinsic problems.
First, we describe the standard pipeline of a deep FER system with the related
background knowledge and suggestions of applicable implementations for each
stage. We then introduce the available datasets that are widely used in the
literature and provide accepted data selection and evaluation principles for
these datasets. For the state of the art in deep FER, we review existing novel
deep neural networks and related training strategies that are designed for FER
based on both static images and dynamic image sequences, and discuss their
advantages and limitations. Competitive performances on widely used benchmarks
are also summarized in this section. We then extend our survey to additional
related issues and application scenarios. Finally, we review the remaining
challenges and corresponding opportunities in this field as well as future
directions for the design of robust deep FER systems
Robust Facial Landmark Localization Based on Texture and Pose Correlated Initialization
Robust facial landmark localization remains a challenging task when faces are
partially occluded. Recently, the cascaded pose regression has attracted
increasing attentions, due to it's superior performance in facial landmark
localization and occlusion detection. However, such an approach is sensitive to
initialization, where an improper initialization can severly degrade the
performance. In this paper, we propose a Robust Initialization for Cascaded
Pose Regression (RICPR) by providing texture and pose correlated initial shapes
for the testing face. By examining the correlation of local binary patterns
histograms between the testing face and the training faces, the shapes of the
training faces that are most correlated with the testing face are selected as
the texture correlated initialization. To make the initialization more robust
to various poses, we estimate the rough pose of the testing face according to
five fiducial landmarks located by multitask cascaded convolutional networks.
Then the pose correlated initial shapes are constructed by the mean face's
shape and the rough testing face pose. Finally, the texture correlated and the
pose correlated initial shapes are joined together as the robust
initialization. We evaluate RICPR on the challenging dataset of COFW. The
experimental results demonstrate that the proposed scheme achieves better
performances than the state-of-the-art methods in facial landmark localization
and occlusion detection
Face Recognition in Low Quality Images: A Survey
Low-resolution face recognition (LRFR) has received increasing attention over
the past few years. Its applications lie widely in the real-world environment
when high-resolution or high-quality images are hard to capture. One of the
biggest demands for LRFR technologies is video surveillance. As the the number
of surveillance cameras in the city increases, the videos that captured will
need to be processed automatically. However, those videos or images are usually
captured with large standoffs, arbitrary illumination condition, and diverse
angles of view. Faces in these images are generally small in size. Several
studies addressed this problem employed techniques like super resolution,
deblurring, or learning a relationship between different resolution domains. In
this paper, we provide a comprehensive review of approaches to low-resolution
face recognition in the past five years. First, a general problem definition is
given. Later, systematically analysis of the works on this topic is presented
by catogory. In addition to describing the methods, we also focus on datasets
and experiment settings. We further address the related works on unconstrained
low-resolution face recognition and compare them with the result that use
synthetic low-resolution data. Finally, we summarized the general limitations
and speculate a priorities for the future effort.Comment: There are some mistakes addressing in this paper which will be
misleading to the reader and we wont have a new version in short time. We
will resubmit once it is being corecte
Deep Appearance Models: A Deep Boltzmann Machine Approach for Face Modeling
The "interpretation through synthesis" approach to analyze face images,
particularly Active Appearance Models (AAMs) method, has become one of the most
successful face modeling approaches over the last two decades. AAM models have
ability to represent face images through synthesis using a controllable
parameterized Principal Component Analysis (PCA) model. However, the accuracy
and robustness of the synthesized faces of AAM are highly depended on the
training sets and inherently on the generalizability of PCA subspaces. This
paper presents a novel Deep Appearance Models (DAMs) approach, an efficient
replacement for AAMs, to accurately capture both shape and texture of face
images under large variations. In this approach, three crucial components
represented in hierarchical layers are modeled using the Deep Boltzmann
Machines (DBM) to robustly capture the variations of facial shapes and
appearances. DAMs are therefore superior to AAMs in inferencing a
representation for new face images under various challenging conditions. The
proposed approach is evaluated in various applications to demonstrate its
robustness and capabilities, i.e. facial super-resolution reconstruction,
facial off-angle reconstruction or face frontalization, facial occlusion
removal and age estimation using challenging face databases, i.e. Labeled Face
Parts in the Wild (LFPW), Helen and FG-NET. Comparing to AAMs and other deep
learning based approaches, the proposed DAMs achieve competitive results in
those applications, thus this showed their advantages in handling occlusions,
facial representation, and reconstruction
- …