54 research outputs found
SfSNet: Learning Shape, Reflectance and Illuminance of Faces in the Wild
We present SfSNet, an end-to-end learning framework for producing an accurate
decomposition of an unconstrained human face image into shape, reflectance and
illuminance. SfSNet is designed to reflect a physical lambertian rendering
model. SfSNet learns from a mixture of labeled synthetic and unlabeled real
world images. This allows the network to capture low frequency variations from
synthetic and high frequency details from real images through the photometric
reconstruction loss. SfSNet consists of a new decomposition architecture with
residual blocks that learns a complete separation of albedo and normal. This is
used along with the original image to predict lighting. SfSNet produces
significantly better quantitative and qualitative results than state-of-the-art
methods for inverse rendering and independent normal and illumination
estimation.Comment: Accepted to CVPR 2018 (Spotlight
Gender recognition from facial images: Two or three dimensions?
© 2016 Optical Society of America. This paper seeks to compare encoded features from both two-dimensional (2D) and three-dimensional (3D) face images in order to achieve automatic gender recognition with high accuracy and robustness. The Fisher vector encoding method is employed to produce 2D, 3D, and fused features with escalated discriminative power. For 3D face analysis, a two-source photometric stereo (PS) method is introduced that enables 3D surface reconstructions with accurate details as well as desirable efficiency. Moreover, a 2D + 3D imaging device, taking the two-source PS method as its core, has been developed that can simultaneously gather color images for 2D evaluations and PS images for 3D analysis. This system inherits the superior reconstruction accuracy from the standard (three or more light) PS method but simplifies the reconstruction algorithm as well as the hardware design by only requiring two light sources. It also offers great potential for facilitating human computer interaction by being accurate, cheap, efficient, and nonintrusive. Ten types of low-level 2D and 3D features have been experimented with and encoded for Fisher vector gender recognition. Evaluations of the Fisher vector encoding method have been performed on the FERET database, Color FERET database, LFW database, and FRGCv2 database, yielding 97.7%, 98.0%, 92.5%, and 96.7% accuracy, respectively. In addition, the comparison of 2D and 3D features has been drawn from a self-collected dataset, which is constructed with the aid of the 2D + 3D imaging device in a series of data capture experiments. With a variety of experiments and evaluations, it can be proved that the Fisher vector encoding method outperforms most state-of-the-art gender recognition methods. It has also been observed that 3D features reconstructed by the two-source PS method are able to further boost the Fisher vector gender recognition performance, i.e., up to a 6% increase on the self-collected database
Disentangling the modes of variation in unlabelled data
Statistical methods are of paramount importance in discovering the modes of variation in visual data. The Principal Component Analysis (PCA) is probably the most prominent method for extracting a single mode of variation in the data. However, in practice, visual data exhibit several modes of variations. For instance, the appearance of faces varies in identity, expression, pose etc. To extract these modes of variations from visual data, several supervised methods, such as the TensorFaces relying on multilinear (tensor) decomposition (e.g., Higher Order SVD) have been developed. The main drawbacks of such methods is that they require both labels regarding the modes of variations and the same number of samples under all modes of variations (e.g., the same face under different expressions, poses etc.). Therefore, their applicability is limited to well-organised data, usually captured in well-controlled conditions. In this paper, we propose a novel general multilinear matrix decomposition method that discovers the multilinear structure of possibly incomplete sets of visual data in unsupervised setting (i.e., without the presence of labels). We also propose extensions of the method with sparsity and low-rank constraints in order to handle noisy data, captured in unconstrained conditions. Besides that, a graph-regularised variant of the method is also developed in order to exploit available geometric or label information for some modes of variations. We demonstrate the applicability of the proposed method in several computer vision tasks, including Shape from Shading (SfS) (in the wild and with occlusion removal), expression transfer, and estimation of surface normals from images captured in the wild
Face recognition in 2D and 2.5D using ridgelets and photometric stereo
A new technique for face recognition - Ridgefaces - is presented. The method combines the well-known Fisherface method with the ridgelet transform and high-speed Photometric Stereo (PS). The paper first derives ridgelet projections for 2D/2.5D face images before the Fisherface approach is used to reduce the dimensionality and increase the spread of the resulting feature vectors. The ridgelet transform is attractive because it is efficient at extracting highly discriminating low-frequency directional features. Best recognition is obtained when Ridgefaces is performed on surface normals acquired from PS, although good results are also found using standard 2D images and PS-derived albedo maps. © 2012 Elsevier Ltd. All rights reserved
Disentangling the modes of variation in unlabelled data
Statistical methods are of paramount importance in discovering the modes of variation in visual data. The Principal Component Analysis (PCA) is probably the most prominent method for extracting a single mode of variation in the data. However, in practice, visual data exhibit several modes of variations. For instance, the appearance of faces varies in identity, expression, pose etc. To extract these modes of variations from visual data, several supervised methods, such as the TensorFaces relying on multilinear (tensor) decomposition (e.g., Higher Order SVD) have been developed. The main drawbacks of such methods is that they require both labels regarding the modes of variations and the same number of samples under all modes of variations (e.g., the same face under different expressions, poses etc.). Therefore, their applicability is limited to well-organised data, usually captured in well-controlled conditions. In this paper, we propose a novel general multilinear matrix decomposition method that discovers the multilinear structure of possibly incomplete sets of visual data in unsupervised setting (i.e., without the presence of labels). We also propose extensions of the method with sparsity and low-rank constraints in order to handle noisy data, captured in unconstrained conditions. Besides that, a graph-regularised variant of the method is also developed in order to exploit available geometric or label information for some modes of variations. We demonstrate the applicability of the proposed method in several computer vision tasks, including Shape from Shading (SfS) (in the wild and with occlusion removal), expression transfer, and estimation of surface normals from images captured in the wild
3D face recognition using photometric stereo
Automatic face recognition has been an active research area for the last four decades. This thesis explores innovative bio-inspired concepts aimed at improved face recognition using surface normals. New directions in salient data representation are explored using data captured via a photometric stereo method from the University of the West of England’s “Photoface” device. Accuracy assessments demonstrate the advantage of the capture format and the synergy offered by near infrared light sources in achieving more accurate results than under conventional visible light. Two 3D face databases have been created as part of the thesis – the publicly available Photoface database which contains 3187 images of 453 subjects and the 3DE-VISIR dataset which contains 363 images of 115 people with different expressions captured simultaneously under near infrared and visible light. The Photoface database is believed to be the ?rst to capture naturalistic 3D face models. Subsets of these databases are then used to show the results of experiments inspired by the human visual system. Experimental results show that optimal recognition rates are achieved using surprisingly low resolution of only 10x10 pixels on surface normal data, which corresponds to the spatial frequency range of optimal human performance. Motivated by the observed increase in recognition speed and accuracy that occurs in humans when faces are caricatured, novel interpretations of caricaturing using outlying data and pixel locations with high variance show that performance remains disproportionately high when up to 90% of the data has been discarded. These direct methods of dimensionality reduction have useful implications for the storage and processing requirements for commercial face recognition systems. The novel variance approach is extended to recognise positive expressions with 90% accuracy which has useful implications for human-computer interaction as well as ensuring that a subject has the correct expression prior to recognition. Furthermore, the subject recognition rate is improved by removing those pixels which encode expression. Finally, preliminary work into feature detection on surface normals by extending Haar-like features is presented which is also shown to be useful for correcting the pose of the head as part of a fully operational device. The system operates with an accuracy of 98.65% at a false acceptance rate of only 0.01 on front facing heads with neutral expressions. The work has shown how new avenues of enquiry inspired by our observation of the human visual system can offer useful advantages towards achieving more robust autonomous computer-based facial recognition
Multilinear methods for disentangling variations with applications to facial analysis
Several factors contribute to the appearance of an object in a visual scene, including pose,
illumination, and deformation, among others. Each factor accounts for a source of variability
in the data. It is assumed that the multiplicative interactions of these factors emulate the
entangled variability, giving rise to the rich structure of visual object appearance. Disentangling
such unobserved factors from visual data is a challenging task, especially when the data have
been captured in uncontrolled recording conditions (also referred to as “in-the-wild”) and label
information is not available. The work presented in this thesis focuses on disentangling the
variations contained in visual data, in particular applied to 2D and 3D faces. The motivation
behind this work lies in recent developments in the field, such as (i) the creation of large, visual
databases for face analysis, with (ii) the need of extracting information without the use of labels
and (iii) the need to deploy systems under demanding, real-world conditions.
In the first part of this thesis, we present a method to synthesise plausible 3D expressions
that preserve the identity of a target subject. This method is supervised as the model uses
labels, in this case 3D facial meshes of people performing a defined set of facial expressions, to
learn. The ability to synthesise an entire facial rig from a single neutral expression has a large
range of applications both in computer graphics and computer vision, ranging from the ecient
and cost-e↵ective creation of CG characters to scalable data generation for machine learning
purposes. Unlike previous methods based on multilinear models, the proposed approach is
capable to extrapolate well outside the sample pool, which allows it to accurately reproduce
the identity of the target subject and create artefact-free expression shapes while requiring
only a small input dataset. We introduce global-local multilinear models that leverage the
strengths of expression-specific and identity-specific local models combined with coarse motion
estimations from a global model. The expression-specific and identity-specific local models
are built from di↵erent slices of the patch-wise local multilinear model. Experimental results
show that we achieve high-quality, identity-preserving facial expression synthesis results that
outperform existing methods both quantitatively and qualitatively.
In the second part of this thesis, we investigate how the modes of variations from visual data
can be extracted. Our assumption is that visual data has an underlying structure consisting of
factors of variation and their interactions. Finding this structure and the factors is important
as it would not only help us to better understand visual data but once obtained we can edit the factors for use in various applications. Shape from Shading and expression transfer are just two
of the potential applications. To extract the factors of variation, several supervised methods
have been proposed but they require both labels regarding the modes of variations and the same
number of samples under all modes of variations. Therefore, their applicability is limited to
well-organised data, usually captured in well-controlled conditions. We propose a novel general
multilinear matrix decomposition method that discovers the multilinear structure of possibly
incomplete sets of visual data in unsupervised setting. We demonstrate the applicability of the
proposed method in several computer vision tasks, including Shape from Shading (SfS) (in the
wild and with occlusion removal), expression transfer, and estimation of surface normals from
images captured in the wild.
Finally, leveraging the unsupervised multilinear method proposed as well as recent advances in
deep learning, we propose a weakly supervised deep learning method for disentangling multiple
latent factors of variation in face images captured in-the-wild. To this end, we propose a deep
latent variable model, where we model the multiplicative interactions of multiple latent factors
of variation explicitly as a multilinear structure. We demonstrate that the proposed approach
indeed learns disentangled representations of facial expressions and pose, which can be used in
various applications, including face editing, as well as 3D face reconstruction and classification
of facial expression, identity and pose.Open Acces
Long-range concealed object detection through active covert illumination
© 2015 SPIE. When capturing a scene for surveillance, the addition of rich 3D data can dramatically improve the accuracy of object detection or face recognition. Traditional 3D techniques, such as geometric stereo, only provide a coarse grained reconstruction of the scene and are ill-suited to fine analysis. Photometric stereo is a well established technique providing dense, high-resolution, reconstructions, using active artificial illumination of an object from multiple directions to gather surface information. It is typically used indoors, at short range
- …