19 research outputs found
Synthesising visual speech using dynamic visemes and deep learning architectures
This paper proposes and compares a range of methods to improve the naturalness of visual speech synthesis. A feedforward deep neural network (DNN) and many-to-one and many-to-many recurrent neural networks (RNNs) using long short-term memory (LSTM) are considered. Rather than using acoustically derived units of speech, such as phonemes, viseme representations are considered and we propose using dynamic visemes together with a deep learning framework. The input feature representation to the models is also investigated and we determine that including wide phoneme and viseme contexts is crucial for predicting realistic lip motions that are sufficiently smooth but not under-articulated. A detailed objective evaluation across a range of system configurations shows that a combined dynamic viseme-phoneme speech unit combined with a many-to-many encoder-decoder architecture models visual co-articulations effectively. Subjective preference tests reveal there to be no significant difference between animations produced using this system and using ground truth facial motion taken from the original video. Furthermore, the dynamic viseme system also outperforms significantly conventional phoneme-driven speech animation systems
Discovering Dynamic Visemes
Abstract
This thesis introduces a set of new, dynamic units of visual speech which are learnt
using computer vision and machine learning techniques. Rather than clustering
phoneme labels as is done traditionally, the visible articulators of a speaker are
tracked and automatically segmented into short, visually intuitive speech gestures
based on the dynamics of the articulators. The segmented gestures are clustered
into dynamic visemes, such that movements relating to the same visual function
appear within the same cluster. Speech animation can then be generated on any
facial model by mapping a phoneme sequence to a sequence of dynamic visemes,
and stitching together an example of each viseme in the sequence. Dynamic visemes
model coarticulation and maintain the dynamics of the original speech, so simple
blending at the concatenation boundaries ensures a smooth transition. The efficacy
of dynamic visemes for computer animation is formally evaluated both objectively
and subjectively, and compared with traditional phoneme to static lip-pose interpolation
Expressive Modulation of Neutral Visual Speech
The need for animated graphical models of the human face is commonplace in
the movies, video games and television industries, appearing in everything from
low budget advertisements and free mobile apps, to Hollywood blockbusters
costing hundreds of millions of dollars. Generative statistical models of
animation attempt to address some of the drawbacks of industry standard
practices such as labour intensity and creative inflexibility.
This work describes one such method for transforming speech animation curves
between different expressive styles. Beginning with the assumption that
expressive speech animation is a mix of two components, a high-frequency
speech component (the content) and a much lower-frequency expressive
component (the style), we use Independent Component Analysis (ICA) to
identify and manipulate these components independently of one another. Next
we learn how the energy for different speaking styles is distributed in terms of
the low-dimensional independent components model. Transforming the
speaking style involves projecting new animation curves into the lowdimensional
ICA space, redistributing the energy in the independent
components, and finally reconstructing the animation curves by inverting the
projection.
We show that a single ICA model can be used for separating multiple expressive
styles into their component parts. Subjective evaluations show that viewers can
reliably identify the expressive style generated using our approach, and that they
have difficulty in identifying transformed animated expressive speech from the
equivalent ground-truth
Cross Modal Evaluation of High Quality Emotional Speech Synthesis with the Virtual Human Toolkit
Emotional expression is a key requirement for intelligent virtual agents. In order for an agent to produce dynamic spoken content speech synthesis is required. However, despite substantial work with pre-recorded prompts, very little work has explored the combined effect of high quality emotional speech synthesis and facial expression. In this paper we offer a baseline evaluation of the naturalness and emotional range available by combining the freely available SmartBody component of the Virtual Human Toolkit (VHTK) with CereVoice text to speech (TTS) system. Results echo previous work using pre-recorded prompts, the visual modality is dominant and the modalities do not interact. This allows the speech synthesis to add gradual changes to the perceived emotion both in terms of valence and activation. The naturalness reported is good, 3.54 on a 5 point MOS scale
Modelling talking human faces
This thesis investigates a number of new approaches for visual speech
synthesis using data-driven methods to implement a talking face.
The main contributions in this thesis are the following. The accuracy
of shared Gaussian process latent variable model (SGPLVM)
built using the active appearance model (AAM) and relative spectral
transform-perceptual linear prediction (RASTAPLP) features is improved
by employing a more accurate AAM. This is the first study
to report that using a more accurate AAM improves the accuracy of
SGPLVM. Objective evaluation via reconstruction error is performed
to compare the proposed approach against previously existing methods.
In addition, it is shown experimentally that the accuracy of AAM
can be improved by using a larger number of landmarks and/or larger
number of samples in the training data.
The second research contribution is a new method for visual speech
synthesis utilising a fully Bayesian method namely the manifold relevance
determination (MRD) for modelling dynamical systems through
probabilistic non-linear dimensionality reduction. This is the first time
MRD was used in the context of generating talking faces from the
input speech signal. The expressive power of this model is in the ability
to consider non-linear mappings between audio and visual features
within a Bayesian approach. An efficient latent space has been learnt
iii
Abstract iv
using a fully Bayesian latent representation relying on conditional nonlinear
independence framework. In the SGPLVM the structure of the
latent space cannot be automatically estimated because of using a maximum
likelihood formulation. In contrast to SGPLVM the Bayesian approaches
allow the automatic determination of the dimensionality of the
latent spaces. The proposed method compares favourably against several
other state-of-the-art methods for visual speech generation, which
is shown in quantitative and qualitative evaluation on two different
datasets.
Finally, the possibility of incremental learning of AAM for inclusion
in the proposed MRD approach for visual speech generation is
investigated. The quantitative results demonstrate that using MRD in
conjunction with incremental AAMs produces only slightly less accurate
results than using batch methods. These results support a way of
training this kind of models on computers with limited resources, for
example in mobile computing.
Overall, this thesis proposes several improvements to the current
state-of-the-art in generating talking faces from speech signal leading
to perceptually more convincing results
A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition
A key requirement for developing any innovative system in a
computing environment is to integrate a sufficiently friendly
interface with the average end user. Accurate design of such a
user-centered interface, however, means more than just the
ergonomics of the panels and displays. It also requires that
designers precisely define what information to use and how, where,
and when to use it. Recent advances in user-centered design of
computing systems have suggested that multimodal integration can
provide different types and levels of intelligence to the user
interface. The work of this thesis aims at improving speech
recognition-based interfaces by making use of the visual modality
conveyed by the movements of the lips.
Designing a good visual front end is a major part of this framework.
For this purpose, this work derives the optical flow fields for
consecutive frames of people speaking. Independent Component
Analysis (ICA) is then used to derive basis flow fields. The
coefficients of these basis fields comprise the visual features of
interest. It is shown that using ICA on optical flow fields yields
better classification results than the traditional approaches based
on Principal Component Analysis (PCA). In fact, ICA can capture
higher order statistics that are needed to understand the motion of
the mouth. This is due to the fact that lips movement is complex in
its nature, as it involves large image velocities, self occlusion
(due to the appearance and disappearance of the teeth) and a lot of
non-rigidity.
Another issue that is of great interest to audio-visual speech
recognition systems designers is the integration (fusion) of the
audio and visual information into an automatic speech recognizer.
For this purpose, a reliability-driven sensor fusion scheme is
developed. A statistical approach is developed to account for the
dynamic changes in reliability. This is done in two steps. The first
step derives suitable statistical reliability measures for the
individual information streams. These measures are based on the
dispersion of the N-best hypotheses of the individual stream
classifiers. The second step finds an optimal mapping between the
reliability measures and the stream weights that maximizes the
conditional likelihood. For this purpose, genetic algorithms are
used.
The addressed issues are challenging problems and are substantial
for developing an audio-visual speech recognition framework that can
maximize the information gather about the words uttered and minimize
the impact of noise