98 research outputs found

    Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers

    Get PDF
    We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

    Modelling talking human faces

    Get PDF
    This thesis investigates a number of new approaches for visual speech synthesis using data-driven methods to implement a talking face. The main contributions in this thesis are the following. The accuracy of shared Gaussian process latent variable model (SGPLVM) built using the active appearance model (AAM) and relative spectral transform-perceptual linear prediction (RASTAPLP) features is improved by employing a more accurate AAM. This is the first study to report that using a more accurate AAM improves the accuracy of SGPLVM. Objective evaluation via reconstruction error is performed to compare the proposed approach against previously existing methods. In addition, it is shown experimentally that the accuracy of AAM can be improved by using a larger number of landmarks and/or larger number of samples in the training data. The second research contribution is a new method for visual speech synthesis utilising a fully Bayesian method namely the manifold relevance determination (MRD) for modelling dynamical systems through probabilistic non-linear dimensionality reduction. This is the first time MRD was used in the context of generating talking faces from the input speech signal. The expressive power of this model is in the ability to consider non-linear mappings between audio and visual features within a Bayesian approach. An efficient latent space has been learnt iii Abstract iv using a fully Bayesian latent representation relying on conditional nonlinear independence framework. In the SGPLVM the structure of the latent space cannot be automatically estimated because of using a maximum likelihood formulation. In contrast to SGPLVM the Bayesian approaches allow the automatic determination of the dimensionality of the latent spaces. The proposed method compares favourably against several other state-of-the-art methods for visual speech generation, which is shown in quantitative and qualitative evaluation on two different datasets. Finally, the possibility of incremental learning of AAM for inclusion in the proposed MRD approach for visual speech generation is investigated. The quantitative results demonstrate that using MRD in conjunction with incremental AAMs produces only slightly less accurate results than using batch methods. These results support a way of training this kind of models on computers with limited resources, for example in mobile computing. Overall, this thesis proposes several improvements to the current state-of-the-art in generating talking faces from speech signal leading to perceptually more convincing results

    TEXT-DRIVEN MOUTH ANIMATION FOR HUMAN COMPUTER INTERACTION WITH PERSONAL ASSISTANT

    Get PDF
    International audiencePersonal assistants are becoming more pervasive in our environments but still do not provide natural interactions. Their lack of realism in term of expressiveness and their lack of visual feedback can create frustrating experiences and make users lose patience. In this sense, we propose an end-to-end trainable neural architecture for text-driven 3D mouth animations. Previous works showed such architectures provide better realism and could open the door for integrated affective Human Computer Interface (HCI). Our study shows that such visual feedback improves users' comfort for 78% of the candidates significantly while slightly improving their time perception

    Realistic and expressive talking head : implementation and evaluation

    Get PDF
    [no abstract

    Perceptually Valid Facial Expressions for Character-Based Applications

    Get PDF
    This paper addresses the problem of creating facial expression of mixed emotions in a perceptually valid way. The research has been done in the context of a “game-like” health and education applications aimed at studying social competency and facial expression awareness in autistic children as well as native language learning, but the results can be applied to many other applications such as games with need for dynamic facial expressions or tools for automating the creation of facial animations. Most existing methods for creating facial expressions of mixed emotions use operations like averaging to create the combined effect of two universal emotions. Such methods may be mathematically justifiable but are not necessarily valid from a perceptual point of view. The research reported here starts by user experiments aiming at understanding how people combine facial actions to express mixed emotions, and how the viewers perceive a set of facial actions in terms of underlying emotions. Using the results of these experiments and a three-dimensional emotion model, we associate facial actions to dimensions and regions in the emotion space, and create a facial expression based on the location of the mixed emotion in the three-dimensional space. We call these regionalized facial actions “facial expression units.

    DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation

    Full text link
    In recent years, audio-driven 3D facial animation has gained significant attention, particularly in applications such as virtual reality, gaming, and video conferencing. However, accurately modeling the intricate and subtle dynamics of facial expressions remains a challenge. Most existing studies approach the facial animation task as a single regression problem, which often fail to capture the intrinsic inter-modal relationship between speech signals and 3D facial animation and overlook their inherent consistency. Moreover, due to the limited availability of 3D-audio-visual datasets, approaches learning with small-size samples have poor generalizability that decreases the performance. To address these issues, in this study, we propose a cross-modal dual-learning framework, termed DualTalker, aiming at improving data usage efficiency as well as relating cross-modal dependencies. The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components. Our joint training framework facilitates more efficient data usage by leveraging information from both tasks and explicitly capitalizing on the complementary relationship between facial motion and audio to improve performance. Furthermore, we introduce an auxiliary cross-modal consistency loss to mitigate the potential over-smoothing underlying the cross-modal complementary representations, enhancing the mapping of subtle facial expression dynamics. Through extensive experiments and a perceptual user study conducted on the VOCA and BIWI datasets, we demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. We have made our code and video demonstrations available at https://github.com/sabrina-su/iadf.git

    Facial Modelling and animation trends in the new millennium : a survey

    Get PDF
    M.Sc (Computer Science)Facial modelling and animation is considered one of the most challenging areas in the animation world. Since Parke and Waters’s (1996) comprehensive book, no major work encompassing the entire field of facial animation has been published. This thesis covers Parke and Waters’s work, while also providing a survey of the developments in the field since 1996. The thesis describes, analyses, and compares (where applicable) the existing techniques and practices used to produce the facial animation. Where applicable, the related techniques are grouped in the same chapter and described in a chronological fashion, outlining their differences, as well as their advantages and disadvantages. The thesis is concluded by exploratory work towards a talking head for Northern Sotho. Facial animation and lip synchronisation of a fragment of Northern Sotho is done by using software tools primarily designed for English.Computin

    A framework for automatic and perceptually valid facial expression generation

    Get PDF
    Facial expressions are facial movements reflecting the internal emotional states of a character or in response to social communications. Realistic facial animation should consider at least two factors: believable visual effect and valid facial movements. However, most research tends to separate these two issues. In this paper, we present a framework for generating 3D facial expressions considering both the visual the dynamics effect. A facial expression mapping approach based on local geometry encoding is proposed, which encodes deformation in the 1-ring vector. This method is capable of mapping subtle facial movements without considering those shape and topological constraints. Facial expression mapping is achieved through three steps: correspondence establishment, deviation transfer and movement mapping. Deviation is transferred to the conformal face space through minimizing the error function. This function is formed by the source neutral and the deformed face model related by those transformation matrices in 1-ring neighborhood. The transformation matrix in 1-ring neighborhood is independent of the face shape and the mesh topology. After the facial expression mapping, dynamic parameters are then integrated with facial expressions for generating valid facial expressions. The dynamic parameters were generated based on psychophysical methods. The efficiency and effectiveness of the proposed methods have been tested using various face models with different shapes and topological representations
    • 

    corecore