142,148 research outputs found

    Model-based synthesis of visual speech movements from 3D video

    Get PDF
    In this paper we describe a method for the synthesis of visual speech movements using a hybrid unit selection/model-based approach. Speech lip movements are captured using a 3D stereo face capture system, and split up into phonetic units. A dynamic parameterisation of this data is constructed which maintains the relationship between lip shapes and velocities; within this parameterisation a model of how lips move is built and is used in the animation of visual speech movements from speech audio input. The mapping from audio parameters to lip movements is disambiguated by selecting only the most similar stored phonetic units to the target utterance during synthesis. By combining properties of model-based synthesis (e.g. HMMs, neural nets) with unit selection we improve the quality of our speech synthesis

    Synthesising visual speech using dynamic visemes and deep learning architectures

    Get PDF
    This paper proposes and compares a range of methods to improve the naturalness of visual speech synthesis. A feedforward deep neural network (DNN) and many-to-one and many-to-many recurrent neural networks (RNNs) using long short-term memory (LSTM) are considered. Rather than using acoustically derived units of speech, such as phonemes, viseme representations are considered and we propose using dynamic visemes together with a deep learning framework. The input feature representation to the models is also investigated and we determine that including wide phoneme and viseme contexts is crucial for predicting realistic lip motions that are sufficiently smooth but not under-articulated. A detailed objective evaluation across a range of system configurations shows that a combined dynamic viseme-phoneme speech unit combined with a many-to-many encoder-decoder architecture models visual co-articulations effectively. Subjective preference tests reveal there to be no significant difference between animations produced using this system and using ground truth facial motion taken from the original video. Furthermore, the dynamic viseme system also outperforms significantly conventional phoneme-driven speech animation systems

    Discovering Dynamic Visemes

    Get PDF
    Abstract This thesis introduces a set of new, dynamic units of visual speech which are learnt using computer vision and machine learning techniques. Rather than clustering phoneme labels as is done traditionally, the visible articulators of a speaker are tracked and automatically segmented into short, visually intuitive speech gestures based on the dynamics of the articulators. The segmented gestures are clustered into dynamic visemes, such that movements relating to the same visual function appear within the same cluster. Speech animation can then be generated on any facial model by mapping a phoneme sequence to a sequence of dynamic visemes, and stitching together an example of each viseme in the sequence. Dynamic visemes model coarticulation and maintain the dynamics of the original speech, so simple blending at the concatenation boundaries ensures a smooth transition. The efficacy of dynamic visemes for computer animation is formally evaluated both objectively and subjectively, and compared with traditional phoneme to static lip-pose interpolation

    Articulatory features for robust visual speech recognition

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 99-105).This thesis explores a novel approach to visual speech modeling. Visual speech, or a sequence of images of the speaker's face, is traditionally viewed as a single stream of contiguous units, each corresponding to a phonetic segment. These units are defined heuristically by mapping several visually similar phonemes to one visual phoneme, sometimes referred to as a viseme. However, experimental evidence shows that phonetic models trained from visual data are not synchronous in time with acoustic phonetic models, indicating that visemes may not be the most natural building blocks of visual speech. Instead, we propose to model the visual signal in terms of the underlying articulatory features. This approach is a natural extension of feature-based modeling of acoustic speech, which has been shown to increase robustness of audio-based speech recognition systems. We start by exploring ways of defining visual articulatory features: first in a data-driven manner, using a large, multi-speaker visual speech corpus, and then in a knowledge-driven manner, using the rules of speech production. Based on these studies, we propose a set of articulatory features, and describe a computational framework for feature-based visual speech recognition. Multiple feature streams are detected in the input image sequence using Support Vector Machines, and then incorporated in a Dynamic Bayesian Network to obtain the final word hypothesis. Preliminary experiments show that our approach increases viseme classification rates in visually noisy conditions, and improves visual word recognition through feature-based context modeling.by Ekaterina Saenko.S.M

    Visual Speech Synthesis using Dynamic Visemes and Deep Learning Architectures

    Get PDF
    The aim of this work is to improve the naturalness of visual speech synthesis produced automatically from a linguistic input over existing methods. Firstly, the most important contribution is on the investigation of the most suitable speech units for the visual speech synthesis. We propose the use of dynamic visemes instead of phonemes or static visemes and found that dynamic visemes can generate better visual speech than either phone or static viseme units. Moreover, best performance is obtained by a combined phoneme-dynamic viseme system. Secondly, we examine the most appropriate model between hidden Markov model (HMM) and different deep learning models that include feedforward and recurrent structures consisting of one-to-one, many-to-one and many-to-many architectures. Results suggested that that frame-by-frame synthesis from deep learning approach outperforms state-based synthesis from HMM approaches and an encoder-decoder many-to-many architecture is better than the one-to-one and many-to-one architectures. Thirdly, we explore the importance of contextual features that include information at varying linguistic levels, from frame level up to the utterance level. Our findings found that frame level information is the most valuable feature, as it is able to avoid discontinuities in the visual feature sequence and produces a smooth and realistic animation output. Fourthly, we found that the two most common objective measures of correlation and root mean square error are not able to indicate realism and naturalness of human perceived quality. We introduce an alternative objective measure and show that the global variance is a better indicator of human perception of quality. Finally, we propose a novel method to convert a given text input and phoneme transcription into a dynamic viseme transcription in the case when a reference dynamic viseme sequence is not available. Subjective preference tests confirmed that our proposed method is able to produce animation, that are statistically indistinguishable from animation produced using reference data

    A Mouth Full of Words: Visually Consistent Acoustic Redubbing

    Get PDF
    This paper introduces a method for automatic redubbing of video that exploits the many-to-many mapping of phoneme sequences to lip movements modelled as dynamic visemes [1]. For a given utterance, the corresponding dynamic viseme sequence is sampled to construct a graph of possible phoneme sequences that synchronize with the video. When composed with a pronunciation dictionary and language model, this produces a vast number of word sequences that are in sync with the original video, literally putting plausible words into the mouth of the speaker. We demonstrate that traditional, one-to-many, static visemes lack flexibility for this application as they produce significantly fewer word sequences. This work explores the natural ambiguity in visual speech and offers insight for automatic speech recognition and the importance of language modeling

    The Effect of Speaking Rate on Audio and Visual Speech

    Get PDF
    The speed that an utterance is spoken affects both the duration of the speech and the position of the articulators. Consequently, the sounds that are produced are modified, as are the position and appearance of the lips, teeth, tongue and other visible articulators. We describe an experiment designed to measure the effect of variable speaking rate on audio and visual speech by comparing sequences of phonemes and dynamic visemes appearing in the same sentences spoken at different speeds. We find that both audio and visual speech production are affected by varying the rate of speech, however, the effect is significantly more prominent in visual speech
    • …
    corecore