8,723 research outputs found
DeCaFA: Deep Convolutional Cascade for Face Alignment In The Wild
Face Alignment is an active computer vision domain, that consists in
localizing a number of facial landmarks that vary across datasets.
State-of-the-art face alignment methods either consist in end-to-end
regression, or in refining the shape in a cascaded manner, starting from an
initial guess. In this paper, we introduce DeCaFA, an end-to-end deep
convolutional cascade architecture for face alignment. DeCaFA uses
fully-convolutional stages to keep full spatial resolution throughout the
cascade. Between each cascade stage, DeCaFA uses multiple chained transfer
layers with spatial softmax to produce landmark-wise attention maps for each of
several landmark alignment tasks. Weighted intermediate supervision, as well as
efficient feature fusion between the stages allow to learn to progressively
refine the attention maps in an end-to-end manner. We show experimentally that
DeCaFA significantly outperforms existing approaches on 300W, CelebA and WFLW
databases. In addition, we show that DeCaFA can learn fine alignment with
reasonable accuracy from very few images using coarsely annotated data
Deep Multi-Center Learning for Face Alignment
Facial landmarks are highly correlated with each other since a certain
landmark can be estimated by its neighboring landmarks. Most of the existing
deep learning methods only use one fully-connected layer called shape
prediction layer to estimate the locations of facial landmarks. In this paper,
we propose a novel deep learning framework named Multi-Center Learning with
multiple shape prediction layers for face alignment. In particular, each shape
prediction layer emphasizes on the detection of a certain cluster of
semantically relevant landmarks respectively. Challenging landmarks are focused
firstly, and each cluster of landmarks is further optimized respectively.
Moreover, to reduce the model complexity, we propose a model assembling method
to integrate multiple shape prediction layers into one shape prediction layer.
Extensive experiments demonstrate that our method is effective for handling
complex occlusions and appearance variations with real-time performance. The
code for our method is available at
https://github.com/ZhiwenShao/MCNet-Extension.Comment: This paper has been accepted by Neurocomputin
Facial Landmark Machines: A Backbone-Branches Architecture with Progressive Representation Learning
Facial landmark localization plays a critical role in face recognition and
analysis. In this paper, we propose a novel cascaded backbone-branches fully
convolutional neural network~(BB-FCN) for rapidly and accurately localizing
facial landmarks in unconstrained and cluttered settings. Our proposed BB-FCN
generates facial landmark response maps directly from raw images without any
preprocessing. BB-FCN follows a coarse-to-fine cascaded pipeline, which
consists of a backbone network for roughly detecting the locations of all
facial landmarks and one branch network for each type of detected landmark for
further refining their locations. Furthermore, to facilitate the facial
landmark localization under unconstrained settings, we propose a large-scale
benchmark named SYSU16K, which contains 16000 faces with large variations in
pose, expression, illumination and resolution. Extensive experimental
evaluations demonstrate that our proposed BB-FCN can significantly outperform
the state-of-the-art under both constrained (i.e., within detected facial
regions only) and unconstrained settings. We further confirm that high-quality
facial landmarks localized with our proposed network can also improve the
precision and recall of face detection
Learning Deep Representation for Face Alignment with Auxiliary Attributes
In this study, we show that landmark detection or face alignment task is not
a single and independent problem. Instead, its robustness can be greatly
improved with auxiliary information. Specifically, we jointly optimize landmark
detection together with the recognition of heterogeneous but subtly correlated
facial attributes, such as gender, expression, and appearance attributes. This
is non-trivial since different attribute inference tasks have different
learning difficulties and convergence rates. To address this problem, we
formulate a novel tasks-constrained deep model, which not only learns the
inter-task correlation but also employs dynamic task coefficients to facilitate
the optimization convergence when learning multiple complex tasks. Extensive
evaluations show that the proposed task-constrained learning (i) outperforms
existing face alignment methods, especially in dealing with faces with severe
occlusion and pose variation, and (ii) reduces model complexity drastically
compared to the state-of-the-art methods based on cascaded deep model.Comment: to be published in the IEEE Transactions on Pattern Analysis and
Machine Intelligence (TPAMI
Joint Multi-view Face Alignment in the Wild
The de facto algorithm for facial landmark estimation involves running a face
detector with a subsequent deformable model fitting on the bounding box. This
encompasses two basic problems: i) the detection and deformable fitting steps
are performed independently, while the detector might not provide best-suited
initialisation for the fitting step, ii) the face appearance varies hugely
across different poses, which makes the deformable face fitting very
challenging and thus distinct models have to be used (\eg, one for profile and
one for frontal faces). In this work, we propose the first, to the best of our
knowledge, joint multi-view convolutional network to handle large pose
variations across faces in-the-wild, and elegantly bridge face detection and
facial landmark localisation tasks. Existing joint face detection and landmark
localisation methods focus only on a very small set of landmarks. By contrast,
our method can detect and align a large number of landmarks for semi-frontal
(68 landmarks) and profile (39 landmarks) faces. We evaluate our model on a
plethora of datasets including standard static image datasets such as IBUG,
300W, COFW, and the latest Menpo Benchmark for both semi-frontal and profile
faces. Significant improvement over state-of-the-art methods on deformable face
tracking is witnessed on 300VW benchmark. We also demonstrate state-of-the-art
results for face detection on FDDB and MALF datasets.Comment: submit to IEEE Transactions on Image Processin
High-Resolution Representations for Labeling Pixels and Regions
High-resolution representation learning plays an essential role in many
vision problems, e.g., pose estimation and semantic segmentation. The
high-resolution network (HRNet)~\cite{SunXLW19}, recently developed for human
pose estimation, maintains high-resolution representations through the whole
process by connecting high-to-low resolution convolutions in \emph{parallel}
and produces strong high-resolution representations by repeatedly conducting
fusions across parallel convolutions.
In this paper, we conduct a further study on high-resolution representations
by introducing a simple yet effective modification and apply it to a wide range
of vision tasks. We augment the high-resolution representation by aggregating
the (upsampled) representations from all the parallel convolutions rather than
only the representation from the high-resolution convolution as done
in~\cite{SunXLW19}. This simple modification leads to stronger representations,
evidenced by superior results. We show top results in semantic segmentation on
Cityscapes, LIP, and PASCAL Context, and facial landmark detection on AFLW,
COFW, W, and WFLW. In addition, we build a multi-level representation from
the high-resolution representation and apply it to the Faster R-CNN object
detection framework and the extended frameworks. The proposed approach achieves
superior results to existing single-model networks on COCO object detection.
The code and models have been publicly available at
\url{https://github.com/HRNet}
Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz
The reconstruction of dense 3D models of face geometry and appearance from a
single image is highly challenging and ill-posed. To constrain the problem,
many approaches rely on strong priors, such as parametric face models learned
from limited 3D scan data. However, prior models restrict generalization of the
true diversity in facial geometry, skin reflectance and illumination. To
alleviate this problem, we present the first approach that jointly learns 1) a
regressor for face shape, expression, reflectance and illumination on the basis
of 2) a concurrently learned parametric face model. Our multi-level face model
combines the advantage of 3D Morphable Models for regularization with the
out-of-space generalization of a learned corrective space. We train end-to-end
on in-the-wild images without dense annotations by fusing a convolutional
encoder with a differentiable expert-designed renderer and a self-supervised
training loss, both defined at multiple detail levels. Our approach compares
favorably to the state-of-the-art in terms of reconstruction quality, better
generalizes to real world faces, and runs at over 250 Hz.Comment: CVPR 2018 (Oral). Project webpage:
https://gvv.mpi-inf.mpg.de/projects/FML
Self-supervised CNN for Unconstrained 3D Facial Performance Capture from an RGB-D Camera
We present a novel method for real-time 3D facial performance capture with
consumer-level RGB-D sensors. Our capturing system is targeted at robust and
stable 3D face capturing in the wild, in which the RGB-D facial data contain
noise, imperfection and occlusion, and often exhibit high variability in
motion, pose, expression and lighting conditions, thus posing great challenges.
The technical contribution is a self-supervised deep learning framework, which
is trained directly from raw RGB-D data. The key novelties include: (1)
learning both the core tensor and the parameters for refining our parametric
face model; (2) using vertex displacement and UV map for learning surface
detail; (3) designing the loss function by incorporating temporal coherence and
same identity constraints based on pairs of RGB-D images and utilizing sparse
norms, in addition to the conventional terms for photo-consistency, feature
similarity, regularization as well as geometry consistency; and (4) augmenting
the training data set in new ways. The method is demonstrated in a live setup
that runs in real-time on a smartphone and an RGB-D sensor. Extensive
experiments show that our method is robust to severe occlusion, fast motion,
large rotation, exaggerated facial expressions and diverse lighting
Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition
Occlusion and pose variations, which can change facial appearance
significantly, are two major obstacles for automatic Facial Expression
Recognition (FER). Though automatic FER has made substantial progresses in the
past few decades, occlusion-robust and pose-invariant issues of FER have
received relatively less attention, especially in real-world scenarios. This
paper addresses the real-world pose and occlusion robust FER problem with
three-fold contributions. First, to stimulate the research of FER under
real-world occlusions and variant poses, we build several in-the-wild facial
expression datasets with manual annotations for the community. Second, we
propose a novel Region Attention Network (RAN), to adaptively capture the
importance of facial regions for occlusion and pose variant FER. The RAN
aggregates and embeds varied number of region features produced by a backbone
convolutional neural network into a compact fixed-length representation. Last,
inspired by the fact that facial expressions are mainly defined by facial
action units, we propose a region biased loss to encourage high attention
weights for the most important regions. We validate our RAN and region biased
loss on both our built test datasets and four popular datasets: FERPlus,
AffectNet, RAF-DB, and SFEW. Extensive experiments show that our RAN and region
biased loss largely improve the performance of FER with occlusion and variant
pose. Our method also achieves state-of-the-art results on FERPlus, AffectNet,
RAF-DB, and SFEW. Code and the collected test data will be publicly available.Comment: The test set and the code of this paper will be available at
https://github.com/kaiwang960112/Challenge-condition-FER-datase
Facial Landmark Detection with Tweaked Convolutional Neural Networks
We present a novel convolutional neural network (CNN) design for facial
landmark coordinate regression. We examine the intermediate features of a
standard CNN trained for landmark detection and show that features extracted
from later, more specialized layers capture rough landmark locations. This
provides a natural means of applying differential treatment midway through the
network, tweaking processing based on facial alignment. The resulting Tweaked
CNN model (TCNN) harnesses the robustness of CNNs for landmark detection, in an
appearance-sensitive manner without training multi-part or multi-scale models.
Our results on standard face landmark detection and face verification
benchmarks show TCNN to surpasses previously published performances by wide
margins.Comment: First two authors had joint first authorship / equal contributio
- …