Search CORE

15,027 research outputs found

Expressive Viseme Group on Facial Animation Model For Indonesian Language

Author: Reza Budiawan
Publication venue: Universitas Telkom
Publication date: 01/01/2013
Field of study

ABSTRAKSI: Viseme merupakan bentuk visual dari sebuah phoneme. Untuk menghasilkan sebuah talking head model yang mengucapkan sebuah kalimat, viseme merupakan unit terkecil yang dibutuhkan saat proses produksinya dilakukan. Viseme sangat bergantung pada bahasa yang dipakai dan tidak bisa diaplikasikan secara umum terhadap semua bahasa. Selain itu, viseme juga bergantung pada ekspresi yang ditampilkan model saat berbicara. Hal ini dikarenakan adanya kondisi conflicting muscle pada kontraksi otot wajah yang membentuk viseme tersebut. Hal ini berdampak pada banyaknya kombinasi viseme yang dibutuhkan saat membangun talking head model. Thesis ini membahas tentang klasifikasi dari kombinasi viseme tersebut. Viseme yang digunakan pada penelitian ini berbasis pada Bahasa Indonesia dan hanya mengambil 1 bentuk pola pengucapan yaitu, konsonan-vokal. Klasifikasi yang dilakukan menggunakan 19 titik penting sebagai representasi dari otot wajah dan 1 titik acuan sebagai standar pada proses normalisasi. Facial animation model akan dibangun berdasarkan grup dari hasil klasifikasi. Pada proses ini free form deformation (FFD) digunakan untuk mendeformasi model, dan Bezier curve diaplikasikan untuk menghasilkan gerakan antar frame acuan. Penelitian ini mengklasifikasikan 315 kombinasi viseme ke dalam 26 grup. Hasil pengujian kesamaan gerakan antara facial animation model dan real human video mencapai nilai kesamaan sebesar 92,3%. Dengan kata lain, grup viseme pada thesis ini telah terbukti efektif untuk menghasilkan persepsi realistis pada speech animation model.Kata Kunci : realistic speech model, klasifikasi viseme, co - articulation, expression .ABSTRACT: Visemes are visual counterpart of phoneme. In speech synthesis, a viseme was used as a small unit to generate an utterance of talking head model. Visemes depend on the pronunciation of each language. It is difficult to apply a viseme in cross language rule. Besides, viseme can be affected by facial muscle that actively contracted by the expression while talking. These result on visemes with very large combinations of forms. This makes the generation of a sequence of utterance for speech animation to use many visemes. This study proposed the classification of viseme mapping. Visemes used in this study was limited on the consonant-vowel (CV) Indonesian’s syllable pattern, which are combined with expression on lower face area. Classification was done using 19 crucial points as parameterized muscle and 1 reference points as the standard on normalization image process. Facial animation model are generated based on the classification result of visemes. The generation process used free form deformation (FFD) as deformation process and Bezier curve as the keyframe references for generating process. This study grouped 315 combinations of visemes into 26 classes. Based on the distance criteria, facial animation which is generated to show the viseme movement, has achieved 92,3% realistic perception. It means that the viseme group as a result from this study has proven effectively to be used to produce a realistic perception of speech animation model.Keyword: realistic speech model, viseme grouping, co - articulation, expression

Open Library

Text-based Editing of Talking-head Video

Author: Agrawala M.
Finkelstein A.
Fried O.
Genova K.
Goldman D.
Jin Z.
Shechtman E.
Tewari A.
Theobalt C.
Zollhöfer M.
Publication venue
Publication date: 01/01/2019
Field of study

Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis

MPG.PuRe

Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers

Author: Matthews I
Theobald B
Publication venue
Publication date: 01/01/2012
Field of study

We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

Crossref

University of East Anglia digital repository

Capture, Learning, and Synthesis of 3D Speaking Styles

Author: Black Michael J.
Bolkart Timo
Cudeiro Daniel
Laidlaw Cassidy
Ranjan Anurag
Publication venue
Publication date: 01/01/2019
Field of study

Audio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard evaluation metrics. To address this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input - even speech in languages other than English - and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. To our knowledge, VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance. We make the dataset and model available for research purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201

arXiv.org e-Print Archive

Crossref

MPG.PuRe