Search CORE

1,032 research outputs found

Artimate: an articulatory animation framework for audiovisual speech synthesis

Author: Ouni Slim
Steiner Ingmar
Publication venue
Publication date: 01/01/2012
Field of study

We present a modular framework for articulatory animation synthesis using speech motion capture data obtained with electromagnetic articulography (EMA). Adapting a skeletal animation approach, the articulatory motion data is applied to a three-dimensional (3D) model of the vocal tract, creating a portable resource that can be integrated in an audiovisual (AV) speech synthesis platform to provide realistic animation of the tongue and teeth for a virtual character. The framework also provides an interface to articulatory animation synthesis, as well as an example application to illustrate its use with a 3D game engine. We rely on cross-platform, open-source software and open standards to provide a lightweight, accessible, and portable workflow.Comment: Workshop on Innovation and Applications in Speech Technology (2012

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers

Author: Matthews I
Theobald B
Publication venue
Publication date: 01/01/2012
Field of study

We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

Crossref

University of East Anglia digital repository

Visual Speech Synthesis by Morphing Visemes

Author: Ezzat Tony
Poggio Tomaso
Publication venue
Publication date: 01/01/1999
Field of study

We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using optical flow methods, correspondence from every viseme to every other viseme is computed automatically. By morphing along this correspondence, a smooth transition between viseme images may be generated. A complete visual utterance is constructed by concatenating viseme transitions. Finally, phoneme and timing information extracted from a text-to-speech synthesizer is exploited to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a photorealistic talking face

CiteSeerX

DSpace@MIT

A FACIAL ANIMATION FRAMEWORK WITH EMOTIVE/EXPRESSIVE CAPABILITIES

Author: Cosi Piero
Leone Giuseppe Riccardo
Publication venue: IADIS
Publication date
Field of study

LUCIA is an MPEG-4 facial animation system developed at ISTC-CNR.. It works on standard Facial Animation Parameters and speaks with the Italian version of FESTIVAL TTS. To achieve an emotive/expressive talking head LUCIA was build from real human data physically extracted by ELITE optotracking movement analyzer. LUCIA can copy a real human by reproducing the movements of passive markers positioned on his face and recorded by the ELITE device or can be driven by an emotional XML tagged input text, thus realizing a true audio/visual emotive/expressive synthesis. Synchronization between visual and audio data is very important in order to create the correct WAV and FAP files needed for the animation. LUCIA\u27s voice is based on the ISTC Italian version of FESTIVAL-MBROLA packages, modified by means of an appropriate APML/VSML tagged language. LUCIA is available in two different versions: an open source framework and the "work in progress" WebG

PUblication MAnagement

Beckett the Spiritist: Breath and its Media Drama

Author: Rapcsak Balazs
Publication venue: 'Brill'
Publication date: 01/01/2020
Field of study

Most of the critical attention devoted to Breathhas been focused on its adaptations and the affinities between its theatrical realization and the visual arts. Tracing Beckett’s ambivalent attitude towards the staging of the play, this articleoffers a closer analysis of Breathas a textual artefact. It discussesvarious published and unpublished versions of the script and their relation to the sketch’s infamous ‘appropriation’in its first production as part of the revue Oh! Calcutta!, in an attempt to reconstruct three episodes of a media drama that unfolds in and around the play

edoc

LUCIA: An open source 3D expressive avatar for multimodal h.m.i.

Author: Cosi Piero
Leone Giuseppe Riccardo
Paci Giulio
Publication venue: ICST
Publication date
Field of study

LUCIA is an MPEG-4 facial animation system developed at ISTC-CNR . It works on standard Facial Animation Parameters and speaks with the Italian version of FESTIVAL TTS. To achieve an emotive/expressive talking head LUCIA was build from real human data physically extracted by ELITE optotracking movement analyzer. LUCIA can copy a real human by reproducing the movements of passive markers positioned on his face and recorded by the ELITE device or can be driven by an emotional XML tagged input text, thus realizing a true audio/visual emotive/expressive synthesis. Synchronization between visual and audio data is very important in order to create the correct WAV and FAP files needed for the animation. LUCIA\u27s voice is based on the ISTC Italian version of FESTIVAL-MBROLA packages, modified by means of an appropriate APML/VSML tagged language. LUCIA is available in two dif-ferent versions: an open source framework and the "work in progress" WebGL

PUblication MAnagement

Predicting Hungarian sound durations for continuous speech

Author: Olaszy Gábor
Publication venue: 'Akademiai Kiado Zrt.'
Publication date: 01/01/2002
Field of study

Direct measurements show that a number of factors influence the final value of sound durations in continuous speech. On the segmental level it is mainly the articulatory movements that determine important influence factors, while on the suprasegmental level accent, syllabic stress, within-word position, the preceding and following syllables and finallyutterance position may have an influence on final sound durations. So the problem of how to predict sound durations can be described with a multivariable function in which the effect of the variables cannot be easily defined with good accuracy. It is difficult to separate the effects of certain functions, i.e., it is difficult to model this function, making direct measurements on the speech signal. A model has been constructed and realized in which three well-defined levels are working separately. In the first one (this is the segmental level) the separation of the effect of articulation from other factors is solved. The second and third levels relate to the suprasegmental level of speech

Repository of the Academy's Library

Text-based Editing of Talking-head Video

Author: Agrawala M.
Finkelstein A.
Fried O.
Genova K.
Goldman D.
Jin Z.
Shechtman E.
Tewari A.
Theobalt C.
Zollhöfer M.
Publication venue
Publication date: 01/01/2019
Field of study

Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis

MPG.PuRe

GMM Mapping Of Visual Features of Cued Speech From Speech Spectral Features

Author: Beautemps Denis
Feng Gang
Ming Zuheng
Publication venue: HAL CCSD
Publication date: 29/08/2013
Field of study

International audienceIn this paper, we present a statistical method based on GMM modeling to map the acoustic speech spectral features to visual features of Cued Speech in the regression criterion of Minimum Mean-Square Error (MMSE) in a low signal level which is innovative and different with the classic text-to-visual approach. Two different training methods for GMM, namely Expectation-Maximization (EM) approach and supervised training method were discussed respectively. In comparison with the GMM based mapping modeling we first present the results with the use of a Multiple-Linear Regression (MLR) model also at the low signal level and study the limitation of the approach. The experimental results demonstrate that the GMM based mapping method can significantly improve the mapping performance compared with the MLR mapping model especially in the sense of the weak linear correlation between the target and the predictor such as the hand positions of Cued Speech and the acoustic speech spectral features

Hal - Université Grenoble Alpes

Estoñol, a computer-assisted pronunciation training tool for Spanish L1 speakers to improve the pronunciation and perception of Estonian vowels

Author: Leppik Katrin
Tejedor-García Cristian
Publication venue: 'University of Tartu'
Publication date: 17/12/2019
Field of study

Over the past few years the number of online language teaching materials for non-native speakers of Estonian has increased. However, they focus mainly on vocabulary and pay little attention to pronunciation. In this study we introduce a computerassisted pronunciation training tool, Estoñol, developed to help native speakers of Spanish to train their perception and production of Estonian vowels. The tool’s training program involves seven vowel contrasts, /i-y/, /u-y/, /ɑ-o/, /ɑ-æ/, /e-æ/, /o-ø/, and /o-ɤ/, which have proven to be difficult for native speakers of Spanish. The training activities include theoretical videos and four training modes (exposure, discrimination, pronunciation, and mixed) in every lesson. The tool is integrated into a pre/post-test design experiment with native speakers of Spanish and Estonian to assess the language learners’ perception and production improvement. It is expected that the tool will have a positive effect on the results, as has been shown in previous studies using similar methodology. Kokkuvõte. Katrin Leppik ja Cristian Tejedor-García: Estoñol, mobiilirakendus hispaania emakeelega eesti keele õppijatele vokaalide häälduse ja taju treenimiseks. Eesti keele õppimiseks on loodud mitmeid e-kursusi ja mobiilirakendusi, kuid need keskenduvad peamiselt sõnavara ja gram matika õpetamisele ning pööravad väga vähe tähelepanu hääldusele. Eesti keele häälduse omandamise lihtsustamiseks töötati välja mobiilirakendus Estoñol, mis on mõeldud hispaania emakeelega eesti keele õppijatele. Varasemad uurimused on näidanud, et hispaania emakeelega eesti keele õppijatele valmistab raskusi vokaalide /ɑ, y, ø, æ, ɤ/ hääldamine. Mobiilirakenduse sisu on jagatud seitsmeks peatükiks, kus on võimalik harjutada vokaalipaaride /i-y/, /u-y/, /ɑ-o/, /ɑ-æ/, /e-æ/, /o-ø/, /o-ɤ/ tajumist ja hääldamist. Iga peatükk algab teoreetilise videoga, millele järgnevad taju- ja hääldusharjutused. Mobiilirakenduse mõju hindamiseks keeleõppija hääldusele ja tajule plaanitakse läbi viia eksperiment. Märksõnad: CAPT, eesti keel, hispaania keel, L2, hääldus, taju, vokaalid, Estoño

Journals from University of Tartu