1,032 research outputs found

    Artimate: an articulatory animation framework for audiovisual speech synthesis

    Get PDF
    We present a modular framework for articulatory animation synthesis using speech motion capture data obtained with electromagnetic articulography (EMA). Adapting a skeletal animation approach, the articulatory motion data is applied to a three-dimensional (3D) model of the vocal tract, creating a portable resource that can be integrated in an audiovisual (AV) speech synthesis platform to provide realistic animation of the tongue and teeth for a virtual character. The framework also provides an interface to articulatory animation synthesis, as well as an example application to illustrate its use with a 3D game engine. We rely on cross-platform, open-source software and open standards to provide a lightweight, accessible, and portable workflow.Comment: Workshop on Innovation and Applications in Speech Technology (2012

    Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers

    Get PDF
    We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

    Visual Speech Synthesis by Morphing Visemes

    Get PDF
    We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using optical flow methods, correspondence from every viseme to every other viseme is computed automatically. By morphing along this correspondence, a smooth transition between viseme images may be generated. A complete visual utterance is constructed by concatenating viseme transitions. Finally, phoneme and timing information extracted from a text-to-speech synthesizer is exploited to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a photorealistic talking face

    A FACIAL ANIMATION FRAMEWORK WITH EMOTIVE/EXPRESSIVE CAPABILITIES

    Get PDF
    LUCIA is an MPEG-4 facial animation system developed at ISTC-CNR.. It works on standard Facial Animation Parameters and speaks with the Italian version of FESTIVAL TTS. To achieve an emotive/expressive talking head LUCIA was build from real human data physically extracted by ELITE optotracking movement analyzer. LUCIA can copy a real human by reproducing the movements of passive markers positioned on his face and recorded by the ELITE device or can be driven by an emotional XML tagged input text, thus realizing a true audio/visual emotive/expressive synthesis. Synchronization between visual and audio data is very important in order to create the correct WAV and FAP files needed for the animation. LUCIA\u27s voice is based on the ISTC Italian version of FESTIVAL-MBROLA packages, modified by means of an appropriate APML/VSML tagged language. LUCIA is available in two different versions: an open source framework and the "work in progress" WebG

    Beckett the Spiritist: Breath and its Media Drama

    Get PDF
    Most of the critical attention devoted to Breathhas been focused on its adaptations and the affinities between its theatrical realization and the visual arts. Tracing Beckett’s ambivalent attitude towards the staging of the play, this articleoffers a closer analysis of Breathas a textual artefact. It discussesvarious published and unpublished versions of the script and their relation to the sketch’s infamous ‘appropriation’in its first production as part of the revue Oh! Calcutta!, in an attempt to reconstruct three episodes of a media drama that unfolds in and around the play

    LUCIA: An open source 3D expressive avatar for multimodal h.m.i.

    Get PDF
    LUCIA is an MPEG-4 facial animation system developed at ISTC-CNR . It works on standard Facial Animation Parameters and speaks with the Italian version of FESTIVAL TTS. To achieve an emotive/expressive talking head LUCIA was build from real human data physically extracted by ELITE optotracking movement analyzer. LUCIA can copy a real human by reproducing the movements of passive markers positioned on his face and recorded by the ELITE device or can be driven by an emotional XML tagged input text, thus realizing a true audio/visual emotive/expressive synthesis. Synchronization between visual and audio data is very important in order to create the correct WAV and FAP files needed for the animation. LUCIA\u27s voice is based on the ISTC Italian version of FESTIVAL-MBROLA packages, modified by means of an appropriate APML/VSML tagged language. LUCIA is available in two dif-ferent versions: an open source framework and the "work in progress" WebGL

    Predicting Hungarian sound durations for continuous speech

    Get PDF
    Direct measurements show that a number of factors influence the final value of sound durations in continuous speech. On the segmental level it is mainly the articulatory movements that determine  important influence factors, while on the suprasegmental level accent, syllabic stress, within-word position, the preceding and following syllables and finallyutterance position may have an influence on final sound durations. So the problem of how to predict  sound durations can be described with a multivariable function in which the effect of the variables cannot be easily defined with good accuracy. It is difficult to separate the effects of certain functions, i.e., it is difficult to model this function, making direct measurements on the speech signal. A model has been constructed and realized in which three well-defined levels are working separately. In the first one (this is the segmental level) the separation of the effect of articulation from other factors is solved. The second and third levels relate to the suprasegmental level of speech

    Text-based Editing of Talking-head Video

    No full text
    Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis

    GMM Mapping Of Visual Features of Cued Speech From Speech Spectral Features

    No full text
    International audienceIn this paper, we present a statistical method based on GMM modeling to map the acoustic speech spectral features to visual features of Cued Speech in the regression criterion of Minimum Mean-Square Error (MMSE) in a low signal level which is innovative and different with the classic text-to-visual approach. Two different training methods for GMM, namely Expectation-Maximization (EM) approach and supervised training method were discussed respectively. In comparison with the GMM based mapping modeling we first present the results with the use of a Multiple-Linear Regression (MLR) model also at the low signal level and study the limitation of the approach. The experimental results demonstrate that the GMM based mapping method can significantly improve the mapping performance compared with the MLR mapping model especially in the sense of the weak linear correlation between the target and the predictor such as the hand positions of Cued Speech and the acoustic speech spectral features

    Estoñol, a computer-assisted pronunciation training tool for Spanish L1 speakers to improve the pronunciation and perception of Estonian vowels

    Get PDF
    Over the past few years the number of online language teaching materials for non-native speakers of Estonian has increased. However, they focus mainly on vocabulary and pay little attention to pronunciation. In this study we introduce a computerassisted pronunciation training tool, Estoñol, developed to help native speakers of Spanish to train their perception and production of Estonian vowels. The tool’s training program involves seven vowel contrasts, /i-y/, /u-y/, /ɑ-o/, /ɑ-ĂŠ/, /e-ĂŠ/, /o-Ăž/, and /o-É€/, which have proven to be difficult for native speakers of Spanish. The training activities include theoretical videos and four training modes (exposure, discrimination, pronunciation, and mixed) in every lesson. The tool is integrated into a pre/post-test design experiment with native speakers of Spanish and Estonian to assess the language learners’ perception and production improvement. It is expected that the tool will have a positive effect on the results, as has been shown in previous studies using similar methodology. KokkuvĂ”te. Katrin Leppik ja Cristian Tejedor-GarcĂ­a: Estoñol, mobiilirakendus hispaania emakeelega eesti keele Ă”ppijatele vokaalide hÀÀlduse ja taju treenimiseks. Eesti keele Ă”ppimiseks on loodud mitmeid e-kursusi ja mobiilirakendusi, kuid need keskenduvad peamiselt sĂ”navara ja gram matika Ă”petamisele ning pööravad vĂ€ga vĂ€he tĂ€helepanu hÀÀldusele. Eesti keele hÀÀlduse omandamise lihtsustamiseks töötati vĂ€lja mobiilirakendus Estoñol, mis on mĂ”eldud hispaania emakeelega eesti keele Ă”ppijatele. Varasemad uurimused on nĂ€idanud, et hispaania emakeelega eesti keele Ă”ppijatele valmistab raskusi vokaalide /ɑ, y, Ăž, ĂŠ, É€/ hÀÀldamine. Mobiilirakenduse sisu on jagatud seitsmeks peatĂŒkiks, kus on vĂ”imalik harjutada vokaalipaaride /i-y/, /u-y/, /ɑ-o/, /ɑ-ĂŠ/, /e-ĂŠ/, /o-Ăž/, /o-É€/ tajumist ja hÀÀldamist. Iga peatĂŒkk algab teoreetilise videoga, millele jĂ€rgnevad taju- ja hÀÀldusharjutused. Mobiilirakenduse mĂ”ju hindamiseks keeleĂ”ppija hÀÀldusele ja tajule plaanitakse lĂ€bi viia eksperiment. MĂ€rksĂ”nad: CAPT, eesti keel, hispaania keel, L2, hÀÀldus, taju, vokaalid, Estoño
    • 

    corecore