Search CORE

2,193 research outputs found

Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers

Author: Matthews I
Theobald B
Publication venue
Publication date: 01/01/2012
Field of study

We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

University of East Anglia digital repository

Capture, Learning, and Synthesis of 3D Speaking Styles

Author: Black Michael J.
Bolkart Timo
Cudeiro Daniel
Laidlaw Cassidy
Ranjan Anurag
Publication venue
Publication date: 01/01/2019
Field of study

Audio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard evaluation metrics. To address this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input - even speech in languages other than English - and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. To our knowledge, VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance. We make the dataset and model available for research purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201

arXiv.org e-Print Archive

Crossref

MPG.PuRe

An introduction to statistical parametric speech synthesis

Author: King Simon
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/10/2011
Field of study

Edinburgh Research Explorer

Expressive visual text to speech and expression adaptation using deep neural networks

Author: Cipolla R
Maia R
Parker J
Stylianou Y
Publication venue: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Publication date: 16/06/2017
Field of study

In this paper, we present an expressive visual text to speech system (VTTS) based on a deep neural network (DNN). Given an input text sentence and a set of expression tags, the VTTS is able to produce not only the audio speech, but also the accompanying facial movements. The expressions can either be one of the expressions in the training corpus or a blend of expressions from the training corpus. Furthermore, we present a method of adapting a previously trained DNN to include a new expression using a small amount of training data. Experiments show that the proposed DNN-based VTTS is preferred by 57.9% over the baseline hidden Markov model based VTTS which uses cluster adaptive training

Crossref

Apollo (Cambridge)

Information theory and representation in associative word learning

Author: Burns Brendan
Cohen Paul
Morrison Clayton
Sutton Charles
Publication venue
Publication date: 01/01/2003
Field of study

A significant portion of early language learning can be viewed as an associative learning problem. We investigate the use of associative language learning based on the principle that words convey Shannon information about the environment. We discuss the shortcomings in representation used by previous associative word learners and propose a functional representation that not only denotes environmental categories, but serves as the basis for activities and interaction with the environment. We present experimental results with an autonomous agent acquiring language

Edinburgh Research Explorer

CogPrints Cognitive Sciences Eprint Archive

Making Faces - State-Space Models Applied to Multi-Modal Signal Processing

Author: Lehn-Schiøler Tue
Publication venue: Technical University of Denmark
Publication date: 01/01/2005
Field of study

Online Research Database In Technology