255 research outputs found
FML: Face Model Learning from Videos
Monocular image-based 3D reconstruction of faces is a long-standing problem
in computer vision. Since image data is a 2D projection of a 3D face, the
resulting depth ambiguity makes the problem ill-posed. Most existing methods
rely on data-driven priors that are built from limited 3D face scans. In
contrast, we propose multi-frame video-based self-supervised training of a deep
network that (i) learns a face identity model both in shape and appearance
while (ii) jointly learning to reconstruct 3D faces. Our face model is learned
using only corpora of in-the-wild video clips collected from the Internet. This
virtually endless source of training data enables learning of a highly general
3D face model. In order to achieve this, we propose a novel multi-frame
consistency loss that ensures consistent shape and appearance across multiple
frames of a subject's face, thus minimizing depth ambiguity. At test time we
can use an arbitrary number of frames, so that we can perform both monocular as
well as multi-frame reconstruction.Comment: CVPR 2019 (Oral). Video: https://www.youtube.com/watch?v=SG2BwxCw0lQ,
Project Page: https://gvv.mpi-inf.mpg.de/projects/FML19
Recommended from our members
Face Transfer with Multilinear Models
Face Transfer is a method for mapping videorecorded performances of one individual to facial animations of another. It extracts visemes (speech-related mouth articulations), expressions, and three-dimensional (3D) pose from monocular video or film footage. These parameters are then used to generate and drive a detailed 3D textured face mesh for a target identity, which can be seamlessly rendered back into target footage. The underlying face model automatically adjusts for how the target performs facial expressions and visemes. The performance data can be easily edited to change the visemes, expressions, pose, or even the identity of the target---the attributes are separably controllable. This supports a wide variety of video rewrite and puppetry applications.Face Transfer is based on a multilinear model of 3D face meshes that separably parameterizes the space of geometric variations due to different attributes (e.g., identity, expression, and viseme). Separability means that each of these attributes can be independently varied. A multilinear model can be estimated from a Cartesian product of examples (identities x expressions x visemes) with techniques from statistical analysis, but only after careful preprocessing of the geometric data set to secure one-to-one correspondence, to minimize cross-coupling artifacts, and to fill in any missing examples. Face Transfer offers new solutions to these problems and links the estimated model with a face-tracking algorithm to extract pose, expression, and viseme parameters.Engineering and Applied Science
Recommended from our members
Estimation of 3D Faces and Illumination from Single Photographs Using a Bilineaur Illumination Model
3D Face modeling is still one of the biggest challenges in computer graphics. In this paper we present a novel framework that acquires the 3D shape, texture, pose and illumination of a face from a single photograph. Additionally, we show how we can recreate a face under varying illumination conditions. Or, essentially relight it. Using a custom-built face scanning system, we have collected 3D face scans and light reflection images of a large and diverse group of human subjects . We derive a morphable face model for 3D face shapes and accompanying textures by transforming the data into a linear vector sub-space. The acquired images of faces under variable illumination are then used to derive a bilinear illumination model that spans 3D face shape and illumination variations. Using both models we, in turn, propose a novel fitting framework that estimates the parameters of the morphable model given a single photograph. Our framework can deal with complex face reflectance and lighting environments in an efficient and robust manner. In the results section of our paper, we compare our methods to existing ones and demonstrate its efficacy in reconstructing 3D face models when provided with a single photograph. We also provide several examples of facial relighting (on 2D images) by performing adequate decomposition of the estimated illumination using our framework.Engineering and Applied Science
A 3D morphable model learnt from 10,000 faces
This is the final version of the article. It is the open access version, provided by the Computer Vision Foundation. Except for the watermark, it is identical to the IEEE published version. Available from IEEE via the DOI in this record.We present Large Scale Facial Model (LSFM) - a 3D Morphable Model (3DMM) automatically constructed from 9,663 distinct facial identities. To the best of our knowledge LSFM is the largest-scale Morphable Model ever constructed, containing statistical information from a huge variety of the human population. To build such a large model we introduce a novel fully automated and robust Morphable Model construction pipeline. The dataset that LSFM is trained on includes rich demographic information about each subject, allowing for the construction of not only a global 3DMM but also models tailored for specific age, gender or ethnicity groups. As an application example, we utilise the proposed model to perform age classification from 3D shape alone. Furthermore, we perform a systematic analysis of the constructed 3DMMs that showcases their quality and descriptive power. The presented extensive qualitative and quantitative evaluations reveal that the proposed 3DMM achieves state-of-the-art results, outperforming existing models by a large margin. Finally, for the benefit of the research community, we make publicly available the source code of the proposed automatic 3DMM construction pipeline. In addition, the constructed global 3DMM and a variety of bespoke models tailored by age, gender and ethnicity are available on application to researchers involved in medically oriented research.J. Booth is funded by an EPSRC
DTA from Imperial College London, and holds a Qualcomm
Innovation Fellowship. A. Roussos is funded by
the Great Ormond Street Hospital Childrens Charity (Face
Value: W1037). The work of S. Zafeiriou was partially
funded by the EPSRC project EP/J017787/1 (4D-FAB)
Multilinear methods for disentangling variations with applications to facial analysis
Several factors contribute to the appearance of an object in a visual scene, including pose,
illumination, and deformation, among others. Each factor accounts for a source of variability
in the data. It is assumed that the multiplicative interactions of these factors emulate the
entangled variability, giving rise to the rich structure of visual object appearance. Disentangling
such unobserved factors from visual data is a challenging task, especially when the data have
been captured in uncontrolled recording conditions (also referred to as “in-the-wild”) and label
information is not available. The work presented in this thesis focuses on disentangling the
variations contained in visual data, in particular applied to 2D and 3D faces. The motivation
behind this work lies in recent developments in the field, such as (i) the creation of large, visual
databases for face analysis, with (ii) the need of extracting information without the use of labels
and (iii) the need to deploy systems under demanding, real-world conditions.
In the first part of this thesis, we present a method to synthesise plausible 3D expressions
that preserve the identity of a target subject. This method is supervised as the model uses
labels, in this case 3D facial meshes of people performing a defined set of facial expressions, to
learn. The ability to synthesise an entire facial rig from a single neutral expression has a large
range of applications both in computer graphics and computer vision, ranging from the ecient
and cost-e↵ective creation of CG characters to scalable data generation for machine learning
purposes. Unlike previous methods based on multilinear models, the proposed approach is
capable to extrapolate well outside the sample pool, which allows it to accurately reproduce
the identity of the target subject and create artefact-free expression shapes while requiring
only a small input dataset. We introduce global-local multilinear models that leverage the
strengths of expression-specific and identity-specific local models combined with coarse motion
estimations from a global model. The expression-specific and identity-specific local models
are built from di↵erent slices of the patch-wise local multilinear model. Experimental results
show that we achieve high-quality, identity-preserving facial expression synthesis results that
outperform existing methods both quantitatively and qualitatively.
In the second part of this thesis, we investigate how the modes of variations from visual data
can be extracted. Our assumption is that visual data has an underlying structure consisting of
factors of variation and their interactions. Finding this structure and the factors is important
as it would not only help us to better understand visual data but once obtained we can edit the factors for use in various applications. Shape from Shading and expression transfer are just two
of the potential applications. To extract the factors of variation, several supervised methods
have been proposed but they require both labels regarding the modes of variations and the same
number of samples under all modes of variations. Therefore, their applicability is limited to
well-organised data, usually captured in well-controlled conditions. We propose a novel general
multilinear matrix decomposition method that discovers the multilinear structure of possibly
incomplete sets of visual data in unsupervised setting. We demonstrate the applicability of the
proposed method in several computer vision tasks, including Shape from Shading (SfS) (in the
wild and with occlusion removal), expression transfer, and estimation of surface normals from
images captured in the wild.
Finally, leveraging the unsupervised multilinear method proposed as well as recent advances in
deep learning, we propose a weakly supervised deep learning method for disentangling multiple
latent factors of variation in face images captured in-the-wild. To this end, we propose a deep
latent variable model, where we model the multiplicative interactions of multiple latent factors
of variation explicitly as a multilinear structure. We demonstrate that the proposed approach
indeed learns disentangled representations of facial expressions and pose, which can be used in
various applications, including face editing, as well as 3D face reconstruction and classification
of facial expression, identity and pose.Open Acces
Geometric Expression Invariant 3D Face Recognition using Statistical Discriminant Models
Currently there is no complete face recognition system that is invariant to all facial expressions.
Although humans find it easy to identify and recognise faces regardless of changes in illumination,
pose and expression, producing a computer system with a similar capability has proved to
be particularly di cult. Three dimensional face models are geometric in nature and therefore
have the advantage of being invariant to head pose and lighting. However they are still susceptible
to facial expressions. This can be seen in the decrease in the recognition results using
principal component analysis when expressions are added to a data set.
In order to achieve expression-invariant face recognition systems, we have employed a tensor
algebra framework to represent 3D face data with facial expressions in a parsimonious
space. Face variation factors are organised in particular subject and facial expression modes.
We manipulate this using single value decomposition on sub-tensors representing one variation
mode. This framework possesses the ability to deal with the shortcomings of PCA in less constrained
environments and still preserves the integrity of the 3D data. The results show improved
recognition rates for faces and facial expressions, even recognising high intensity expressions
that are not in the training datasets.
We have determined, experimentally, a set of anatomical landmarks that best describe facial
expression e ectively. We found that the best placement of landmarks to distinguish di erent
facial expressions are in areas around the prominent features, such as the cheeks and eyebrows.
Recognition results using landmark-based face recognition could be improved with better placement.
We looked into the possibility of achieving expression-invariant face recognition by reconstructing
and manipulating realistic facial expressions. We proposed a tensor-based statistical
discriminant analysis method to reconstruct facial expressions and in particular to neutralise
facial expressions. The results of the synthesised facial expressions are visually more realistic
than facial expressions generated using conventional active shape modelling (ASM). We
then used reconstructed neutral faces in the sub-tensor framework for recognition purposes.
The recognition results showed slight improvement. Besides biometric recognition, this novel
tensor-based synthesis approach could be used in computer games and real-time animation
applications
Bayesian Robust Tensor Factorization for Incomplete Multiway Data
We propose a generative model for robust tensor factorization in the presence
of both missing data and outliers. The objective is to explicitly infer the
underlying low-CP-rank tensor capturing the global information and a sparse
tensor capturing the local information (also considered as outliers), thus
providing the robust predictive distribution over missing entries. The
low-CP-rank tensor is modeled by multilinear interactions between multiple
latent factors on which the column sparsity is enforced by a hierarchical
prior, while the sparse tensor is modeled by a hierarchical view of Student-
distribution that associates an individual hyperparameter with each element
independently. For model learning, we develop an efficient closed-form
variational inference under a fully Bayesian treatment, which can effectively
prevent the overfitting problem and scales linearly with data size. In contrast
to existing related works, our method can perform model selection automatically
and implicitly without need of tuning parameters. More specifically, it can
discover the groundtruth of CP rank and automatically adapt the sparsity
inducing priors to various types of outliers. In addition, the tradeoff between
the low-rank approximation and the sparse representation can be optimized in
the sense of maximum model evidence. The extensive experiments and comparisons
with many state-of-the-art algorithms on both synthetic and real-world datasets
demonstrate the superiorities of our method from several perspectives.Comment: in IEEE Transactions on Neural Networks and Learning Systems, 201
- …