23 research outputs found
Subspace Representations for Robust Face and Facial Expression Recognition
Analyzing human faces and modeling their variations have always been of interest to the computer vision community. Face analysis based on 2D intensity images is a challenging problem, complicated by variations in pose, lighting, blur, and non-rigid facial deformations due to facial expressions. Among the different sources of variation, facial expressions are of interest as important channels of non-verbal communication. Facial expression analysis is also affected by changes in view-point and inter-subject variations in performing different expressions. This dissertation makes an attempt to address some of the challenges involved in developing robust algorithms for face and facial expression recognition by exploiting the idea of proper subspace representations for data.
Variations in the visual appearance of an object mostly arise due to changes in illumination and pose. So we first present a video-based sequential algorithm for estimating the face albedo as an illumination-insensitive signature for face recognition. We show that by knowing/estimating the pose of the face at each frame of a sequence, the albedo can be efficiently estimated using a Kalman filter. Then we extend this to the case of unknown pose by simultaneously tracking the pose as well as updating the albedo through an efficient Bayesian inference method performed using a Rao-Blackwellized particle filter.
Since understanding the effects of blur, especially motion blur, is an important problem in unconstrained visual analysis, we then propose a blur-robust recognition algorithm for faces with spatially varying blur. We model a blurred face as a weighted average of geometrically transformed instances of its clean face. We then build a matrix, for each gallery face, whose column space spans the space of all the motion blurred images obtained from the clean face. This matrix representation is then used to define a proper objective function and perform blur-robust face recognition.
To develop robust and generalizable models for expression analysis one needs to break the dependence of the models on the choice of the coordinate frame of the camera. To this end, we build models for expressions on the affine shape-space (Grassmann manifold), as an approximation to the projective shape-space, by using a Riemannian interpretation of deformations that facial expressions cause on different parts of the face. This representation enables us to perform various expression analysis and recognition algorithms without the need for pose normalization as a preprocessing step.
There is a large degree of inter-subject variations in performing various expressions. This poses an important challenge on developing robust facial expression recognition algorithms. To address this challenge, we propose a dictionary-based approach for facial expression analysis by decomposing expressions in terms of action units (AUs). First, we construct an AU-dictionary using domain experts' knowledge of AUs. To incorporate the high-level knowledge regarding expression decomposition and AUs, we then perform structure-preserving sparse coding by imposing two layers of grouping over AU-dictionary atoms as well as over the test image matrix columns. We use the computed sparse code matrix for each expressive face to perform expression decomposition and recognition.
Most of the existing methods for the recognition of faces and expressions consider either the expression-invariant face recognition problem or the identity-independent facial expression recognition problem. We propose joint face and facial expression recognition using a dictionary-based component separation algorithm (DCS). In this approach, the given expressive face is viewed as a superposition of a neutral face component with a facial expression component, which is sparse with respect to the whole image. This assumption leads to a dictionary-based component separation algorithm, which benefits from the idea of sparsity and morphological diversity. The DCS algorithm uses the data-driven dictionaries to decompose an expressive test face into its constituent components. The sparse codes we obtain as a result of this decomposition are then used for joint face and expression recognition
Synthesization and reconstruction of 3D faces by deep neural networks
The past few decades have witnessed substantial progress towards 3D facial modelling and reconstruction as it is high importance for many computer vision and graphics applications including Augmented/Virtual Reality (AR/VR), computer games, movie post-production, image/video editing, medical applications, etc. In the traditional approaches, facial texture and shape are represented as triangle mesh that can cover identity and expression variation with non-rigid deformation. A dataset of 3D face scans is then densely registered into a common topology in order to construct a linear statistical model. Such models are called 3D Morphable Models (3DMMs) and can be used for 3D face synthesization or reconstruction by a single or few 2D face images. The works presented in this thesis focus on the modernization of these traditional techniques in the light of recent advances of deep learning and thanks to the availability of large-scale datasets.
Ever since the introduction of 3DMMs by over two decades, there has been a lot of progress on it and they are still considered as one of the best methodologies to model 3D faces. Nevertheless, there are still several aspects of it that need to be upgraded to the "deep era". Firstly, the conventional 3DMMs are built by linear statistical approaches such as Principal Component Analysis (PCA) which omits high-frequency information by its nature. While this does not curtail shape, which is often smooth in the original data, texture models are heavily afflicted by losing high-frequency details and photorealism. Secondly, the existing 3DMM fitting approaches rely on very primitive (i.e. RGB values, sparse landmarks) or hand-crafted features (i.e. HOG, SIFT) as supervision that are sensitive to "in-the-wild" images (i.e. lighting, pose, occlusion), or somewhat missing identity/expression resemblance with the target image. Finally, shape, texture, and expression modalities are separately modelled by ignoring the correlation among them, placing a fundamental limit to the synthesization of semantically meaningful 3D faces. Moreover, photorealistic 3D face synthesis has not been studied thoroughly in the literature.
This thesis attempts to address the above-mentioned issues by harnessing the power of deep neural network and generative adversarial networks as explained below:
Due to the linear texture models, many of the state-of-the-art methods are still not capable of reconstructing facial textures with high-frequency details. For this, we take a radically different approach and build a high-quality texture model by Generative Adversarial Networks (GANs) that preserves details. That is, we utilize GANs to train a very powerful generator of facial texture in the UV space. And then show that it is possible to employ this generator network as a statistical texture prior to 3DMM fitting. The resulting texture reconstructions are plausible and photorealistic as GANs are faithful to the real-data distribution in both low- and high- frequency domains.
Then, we revisit the conventional 3DMM fitting approaches making use of non-linear optimization to find the optimal latent parameters that best reconstruct the test image but under a new perspective. We propose to optimize the parameters with the supervision of pretrained deep identity features through our end-to-end differentiable framework. In order to be robust towards initialization and expedite the fitting process, we also propose a novel self-supervised regression-based approach. We demonstrate excellent 3D face reconstructions that are photorealistic and identity preserving and achieve for the first time, to the best of our knowledge, facial texture reconstruction with high-frequency details.
In order to extend the non-linear texture model for photo-realistic 3D face synthesis, we present a methodology that generates high-quality texture, shape, and normals jointly. To do so, we propose a novel GAN that can generate data from different modalities while exploiting their correlations. Furthermore, we demonstrate how we can condition the generation on the expression and create faces with various facial expressions. Additionally, we study another approach for photo-realistic face synthesis by 3D guidance. This study proposes to generate 3D faces by linear 3DMM and then augment their 2D rendering by an image-to-image translation network to the photorealistic face domain. Both works demonstrate excellent photorealistic face synthesis and show that the generated faces are improving face recognition benchmarks as synthetic training data.
Finally, we study expression reconstruction for personalized 3D face models where we improve generalization and robustness of expression encoding. First, we propose a 3D augmentation approach on 2D head-mounted camera images to increase robustness to perspective changes. And, we also propose to train generic expression encoder network by populating the number of identities with a novel multi-id personalized model training architecture in a self-supervised manner. Both approaches show promising results in both qualitative and quantitative experiments.Open Acces
3D scene graph inference and refinement for vision-as-inverse-graphics
The goal of scene understanding is to interpret images,
so as to infer the objects present in a scene, their poses
and fine-grained details. This thesis focuses on methods that
can provide a much more detailed explanation of the scene than
standard bounding-boxes or pixel-level segmentation - we infer
the underlying 3D scene given only its
projection in the form of a single image.
We employ the Vision-as-Inverse-Graphics (VIG) paradigm,
which (a) infers the latent variables of a scene such
as the objects present and their properties as well as the lighting
and the camera, and (b) renders these
latent variables to reconstruct the input image.
One highly attractive aspect of the VIG approach is that it produces
a compact and interpretable representation of the 3D scene in
terms of an arbitrary number of objects, called a 'scene graph'.
This representation is of a key importance, as it
can be useful e.g. if we wish to edit, refine,
interpret the scene or interact with it.
First, we investigate how the recognition models can be used to infer
the scene graph given only a single RGB image. These models are
trained using realistic synthetic images and corresponding ground
truth scene graphs, obtained from a rich stochastic scene
generator. Once the objects have been detected, each object detection
is further processed using neural networks to predict
the object and global latent variables.
This allows computing of object poses
and sizes in 3D scene coordinates, given the camera parameters. This
inference of the latent variables in the form of a 3D scene graph acts
like the encoder of an autoencoder, with graphics
rendering as the decoder.
One of the major challenges is the problem of placing the
detected objects in 3D at a reasonable size and distance with
respect to the single camera, the parameters of
which are unknown. Previous VIG approaches for
multiple objects usually only considered a fixed camera,
while we allow for variable camera pose. To infer the camera
parameters given the votes cast by the detected objects,
we introduce a Probabilistic HoughNets framework for combining
probabilistic votes, robustified with an outlier model.
Each detection provides one noisy low-dimensional manifold
in the Hough space, and by intersecting them
probabilistically we reduce the uncertainty on the camera parameters.
Given an initialization of a scene graph, its refinement typically
involves computationally expensive and inefficient
search through the latent space. Since optimization of the 3D scene
corresponding to an image is a challenging task even for a few LVs,
previous work for multi-object scenes considered only refinement of
the geometry, but not the appearance or illumination. To overcome this
issue, we develop a framework called 'Learning Direct Optimization'
(LiDO) for optimization of the latent variables of a multi-object
scene. Instead of minimizing an error metric that compares observed
image and the render, this optimization is driven by neural networks
that make use of the auto-context in the form of a current scene graph
and its render to predict the LV update.
Our experiments show that the LiDO method converges rapidly
as it does not need to perform a search on the error landscape,
produces better solutions than error-based competitors, and is able
to handle the mismatch between the data and the fitted scene model.
We apply LiDO to a realistic synthetic dataset, and show
that the method transfers to work well with real images.
The advantages of LiDO mean that it could be a critical component
in the development of future vision-as-inverse-graphics systems
Recommended from our members
Automatic age progression and estimation from faces
Recently, automatic age progression has gained popularity due to its numerous applications. Among these is the frequent search for missing people, in the UK alone up to 300,000 people are reported missing every year. Although many algorithms have been proposed, most of the methods are affected by image noise, illumination variations, and facial expressions. Furthermore, most of the algorithms use a pattern caricaturing approach which infers ages by manipulating the target image and a template face formed by averaging faces at the intended age. To this end, this thesis investigates the problem with a view to tackling the most prominent issues associated with the existing algorithms. Initially using active appearance models (AAM), facial features are extracted and mapped to peopleâs ages, afterward a formula is derived which allows the convenient generation of age progressed images irrespective of whether the intended age exists in the training database or not. In order to handle image noise as well as varying facial expressions, a nonlinear appearance model called kernel appearance model (KAM) is derived. To illustrate the real application of automatic age progression, both AAM and KAM based algorithms are then used to synthesise faces of two popular long missing British and Irish kids; Ben Needham and Mary Boyle. However, both statistical techniques exhibit image rendering artefacts such as low-resolution output and the generation of inconsistent skin tone. To circumvent this problem, a hybrid texture enhancement pipeline is developed. To further ensure that the progressed images preserve peopleâs identities while at the same time attaining the intended age, rigorous human and machine based tests are conducted; part of this tests resulted to the development of a robust age estimation algorithm. Eventually, the results of the rigorous assessment reveal that the hybrid technique is able to handle all existing problems of age progression with minimal error.National Information Technology Development Agency of Nigeria (NITDA
Inferring Human Pose and Motion from Images
As optical gesture recognition technology advances, touchless human computer interfaces of the future will soon become a reality. One particular technology, markerless motion capture, has gained a large amount of attention, with widespread application in diverse disciplines, including medical science, sports analysis, advanced user interfaces, and virtual arts. However, the complexity of human anatomy makes markerless motion capture a non-trivial problem: I) parameterised pose configuration exhibits high dimensionality, and II) there is considerable ambiguity in surjective inverse mapping from observation to pose configuration spaces with a limited number of camera views. These factors together lead to multimodality in high dimensional space, making markerless motion capture an ill-posed problem. This study challenges these difficulties by introducing a new framework. It begins with automatically modelling specific subject template models and calibrating posture at the initial stage. Subsequent tracking is accomplished by embedding naturally-inspired global optimisation into the sequential Bayesian filtering framework. Tracking is enhanced by several robust evaluation improvements. Sparsity of images is managed by compressive evaluation, further accelerating computational efficiency in high dimensional space
Patch-based models for visual object classes
This thesis concerns models for visual object classes that exhibit a reasonable amount of regularity,
such as faces, pedestrians, cells and human brains. Such models are useful for making
âwithin-objectâ inferences such as determining their individual characteristics and establishing
their identity. For example, the model could be used to predict the identity of a face, the pose
of a pedestrian or the phenotype of a cell and segment parts of a human brain.
Existing object modelling techniques have several limitations. First, most current methods
have targeted the above tasks individually using object specific representations; therefore, they
cannot be applied to other problems without major alterations. Second, most methods have been
designed to work with small databases which do not contain the variations in pose, illumination,
occlusion and background clutter seen in âreal worldâ images. Consequently, many existing
algorithms fail when tested on unconstrained databases. Finally, the complexity of the training
procedure in these methods makes it impractical to use large datasets.
In this thesis, we investigate patch-based models for object classes. Our models are capable
of exploiting very large databases of objects captured in uncontrolled environments. We
represent the test image with a regular grid of patches from a library of images of the same
object. All the domain specific information is held in this library: we use one set of images of
the object to help draw inferences about others. In each experimental chapter we investigate
a different within-object inference task. In particular we develop models for classification, regression,
semantic segmentation and identity recognition. In each task, we achieve results that
are comparable to or better than the state of the art. We conclude that patch-based representation
can be successfully used for the above tasks and shows promise for other applications such
as generation and localization
Exploiting Novel Deep Learning Architecture in Character Animation Pipelines
This doctoral dissertation aims to show a body of work proposed for improving different blocks in the character animation pipelines resulting in less manual work and more realistic character animation. To that purpose, we describe a variety of cutting-edge deep learning approaches that have been applied to the field of human motion modelling and character animation.
The recent advances in motion capture systems and processing hardware have shifted from physics-based approaches to data-driven approaches that are heavily used in the current game production frameworks. However, despite these
significant successes, there are still shortcomings to address. For example, the existing production pipelines contain processing steps such as marker
labelling in the motion capture pipeline or annotating motion primitives, which should be done manually. In addition, most of the current approaches for character animation used in game production are limited by the amount of stored animation data resulting in many duplicates and repeated patterns.
We present our work in four main chapters. We first present a large dataset of human motion called MoVi. Secondly, we show how machine learning approaches can be used to automate proprocessing data blocks of optical motion capture pipelines. Thirdly, we show how generative models can be used to generate batches of synthetic motion sequences given only weak control signals. Finally, we show how novel generative models can be applied to real-time character control in the game production
Exploiting Novel Deep Learning Architecture in Character Animation Pipelines
This doctoral dissertation aims to show a body of work proposed for improving different blocks in the character animation pipelines resulting in less manual work and more realistic character animation. To that purpose, we describe a variety of cutting-edge deep learning approaches that have been applied to the field of human motion modelling and character animation.
The recent advances in motion capture systems and processing hardware have shifted from physics-based approaches to data-driven approaches that are heavily used in the current game production frameworks. However, despite these
significant successes, there are still shortcomings to address. For example, the existing production pipelines contain processing steps such as marker
labelling in the motion capture pipeline or annotating motion primitives, which should be done manually. In addition, most of the current approaches for character animation used in game production are limited by the amount of stored animation data resulting in many duplicates and repeated patterns.
We present our work in four main chapters. We first present a large dataset of human motion called MoVi. Secondly, we show how machine learning approaches can be used to automate proprocessing data blocks of optical motion capture pipelines. Thirdly, we show how generative models can be used to generate batches of synthetic motion sequences given only weak control signals. Finally, we show how novel generative models can be applied to real-time character control in the game production