11,569 research outputs found
Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers
We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality
Photo-realistic face synthesis and reenactment with deep generative models
The advent of Deep Learning has led to numerous breakthroughs in the field of Computer Vision. Over the last decade, a significant amount of research has been undertaken towards designing neural networks for visual data analysis. At the same time, rapid advancements have been made towards the direction of deep generative modeling, especially after the introduction of Generative Adversarial Networks (GANs), which have shown particularly promising results when it comes to synthesising visual data. Since then, considerable attention has been devoted to the problem of photo-realistic human face animation due to its wide range of applications, including image and video editing, virtual assistance, social media, teleconferencing, and augmented reality. The objective of this thesis is to make progress towards generating photo-realistic videos of human faces. To that end, we propose novel generative algorithms that provide explicit control over the facial expression and head pose of synthesised subjects. Despite the major advances in face reenactment and motion transfer, current methods struggle to generate video portraits that are indistinguishable from real data. In this work, we aim to overcome the limitations of existing approaches, by combining concepts from deep generative networks and video-to-video translation with 3D face modelling, and more specifically by capitalising on prior knowledge of faces that is enclosed within statistical models such as 3D Morphable Models (3DMMs). In the first part of this thesis, we introduce a person-specific system that performs full head reenactment using ideas from video-to-video translation. Subsequently, we propose a novel approach to controllable video portrait synthesis, inspired from Implicit Neural Representations (INR). In the second part of the thesis, we focus on person-agnostic methods and present a GAN-based framework that performs video portrait reconstruction, full head reenactment, expression editing, novel pose synthesis and face frontalisation.Open Acces
Statistical modelling for facial expression dynamics
PhDOne of the most powerful and fastest means of relaying emotions between humans are facial expressions.
The ability to capture, understand and mimic those emotions and their underlying dynamics
in the synthetic counterpart is a challenging task because of the complexity of human emotions, different
ways of conveying them, non-linearities caused by facial feature and head motion, and the
ever critical eye of the viewer. This thesis sets out to address some of the limitations of existing
techniques by investigating three components of expression modelling and parameterisation framework:
(1) Feature and expression manifold representation, (2) Pose estimation, and (3) Expression
dynamics modelling and their parameterisation for the purpose of driving a synthetic head avatar.
First, we introduce a hierarchical representation based on the Point Distribution Model (PDM).
Holistic representations imply that non-linearities caused by the motion of facial features, and intrafeature
correlations are implicitly embedded and hence have to be accounted for in the resulting
expression space. Also such representations require large training datasets to account for all possible
variations. To address those shortcomings, and to provide a basis for learning more subtle, localised
variations, our representation consists of tree-like structure where a holistic root component is decomposed
into leaves containing the jaw outline, each of the eye and eyebrows and the mouth. Each
of the hierarchical components is modelled according to its intrinsic functionality, rather than the
final, holistic expression label.
Secondly, we introduce a statistical approach for capturing an underlying low-dimension expression
manifold by utilising components of the previously defined hierarchical representation. As
Principal Component Analysis (PCA) based approaches cannot reliably capture variations caused by
large facial feature changes because of its linear nature, the underlying dynamics manifold for each
of the hierarchical components is modelled using a Hierarchical Latent Variable Model (HLVM) approach.
Whilst retaining PCA properties, such a model introduces a probability density model which
can deal with missing or incomplete data and allows discovery of internal within cluster structures.
All of the model parameters and underlying density model are automatically estimated during the
training stage. We investigate the usefulness of such a model to larger and unseen datasets.
Thirdly, we extend the concept of HLVM model to pose estimation to address the non-linear
shape deformations and definition of the plausible pose space caused by large head motion. Since
our head rarely stays still, and its movements are intrinsically connected with the way we perceive
and understand the expressions, pose information is an integral part of their dynamics. The proposed
3
approach integrates into our existing hierarchical representation model. It is learned using sparse and
discreetly sampled training dataset, and generalises to a larger and continuous view-sphere.
Finally, we introduce a framework that models and extracts expression dynamics. In existing
frameworks, explicit definition of expression intensity and pose information, is often overlooked,
although usually implicitly embedded in the underlying representation. We investigate modelling
of the expression dynamics based on use of static information only, and focus on its sufficiency
for the task at hand. We compare a rule-based method that utilises the existing latent structure and
provides a fusion of different components with holistic and Bayesian Network (BN) approaches. An
Active Appearance Model (AAM) based tracker is used to extract relevant information from input
sequences. Such information is subsequently used to define the parametric structure of the underlying
expression dynamics. We demonstrate that such information can be utilised to animate a synthetic
head avatar.
Submitte
Analysis of 3D Face Reconstruction
This thesis investigates the long standing problem of 3D reconstruction from a single 2D face
image. Face reconstruction from a single 2D face image is an ill posed problem involving estimation of the intrinsic and the extrinsic camera parameters, light parameters, shape parameters
and the texture parameters. The proposed approach has many potential applications in the
law enforcement, surveillance, medicine, computer games and the entertainment industries.
This problem is addressed using an analysis by synthesis framework by reconstructing a 3D
face model from identity photographs. The identity photographs are a widely used medium for
face identi cation and can be found on identity cards and passports.
The novel contribution of this thesis is a new technique for creating 3D face models from a single
2D face image. The proposed method uses the improved dense 3D correspondence obtained
using rigid and non-rigid registration techniques. The existing reconstruction methods use the
optical
ow method for establishing 3D correspondence. The resulting 3D face database is used
to create a statistical shape model.
The existing reconstruction algorithms recover shape by optimizing over all the parameters
simultaneously. The proposed algorithm simplifies the reconstruction problem by using a step
wise approach thus reducing the dimension of the parameter space and simplifying the opti-
mization problem. In the alignment step, a generic 3D face is aligned with the given 2D face
image by using anatomical landmarks. The texture is then warped onto the 3D model by using
the spatial alignment obtained previously. The 3D shape is then recovered by optimizing over
the shape parameters while matching a texture mapped model to the target image.
There are a number of advantages of this approach. Firstly, it simpli es the optimization requirements and makes the optimization more robust. Second, there is no need to accurately
recover the illumination parameters. Thirdly, there is no need for recovering the texture parameters by using a texture synthesis approach. Fourthly, quantitative analysis is used for
improving the quality of reconstruction by improving the cost function. Previous methods use
qualitative methods such as visual analysis, and face recognition rates for evaluating reconstruction accuracy.
The improvement in the performance of the cost function occurs as a result of improvement
in the feature space comprising the landmark and intensity features. Previously, the feature
space has not been evaluated with respect to reconstruction accuracy thus leading to inaccurate
assumptions about its behaviour.
The proposed approach simpli es the reconstruction problem by using only identity images,
rather than placing eff ort on overcoming the pose, illumination and expression (PIE) variations.
This makes sense, as frontal face images under standard illumination conditions are widely
available and could be utilized for accurate reconstruction. The reconstructed 3D models with
texture can then be used for overcoming the PIE variations
MaskRenderer: 3D-Infused Multi-Mask Realistic Face Reenactment
We present a novel end-to-end identity-agnostic face reenactment system,
MaskRenderer, that can generate realistic, high fidelity frames in real-time.
Although recent face reenactment works have shown promising results, there are
still significant challenges such as identity leakage and imitating mouth
movements, especially for large pose changes and occluded faces. MaskRenderer
tackles these problems by using (i) a 3DMM to model 3D face structure to better
handle pose changes, occlusion, and mouth movements compared to 2D
representations; (ii) a triplet loss function to embed the cross-reenactment
during training for better identity preservation; and (iii) multi-scale
occlusion, improving inpainting and restoring missing areas. Comprehensive
quantitative and qualitative experiments conducted on the VoxCeleb1 test set,
demonstrate that MaskRenderer outperforms state-of-the-art models on unseen
faces, especially when the Source and Driving identities are very different
- β¦