Search CORE

196 research outputs found

Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers

Author: Matthews I
Theobald B
Publication venue
Publication date: 01/01/2012
Field of study

We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

University of East Anglia digital repository

A Practical Model for Live Speech-Driven Lip-Sync

Author: Li Wei
Zhigang Deng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Audiovisual Generation of Social Attitudes from Neutral Stimuli

Author: Bailly Gérard
Barbulescu Adela
Pouget Maël
Ronfard Rémi
Publication venue: 'The International Fiscal Association of Korea'
Publication date: 11/09/2015
Field of study

International audienceThe focus of this study is the generation of expressive audiovisual speech from neutral utterances for 3D virtual actors. Taking into account the segmental and suprasegmental aspects of audiovisual speech, we propose and compare several computational frameworks for the generation of expressive speech and face animation. We notably evaluate a standard frame-based conversion approach with two other methods that postulate the existence of global prosodic audiovisual patterns that are characteristic of social attitudes. The proposed approaches are tested on a database of " Exercises in Style " [1] performed by two semi-professional actors and results are evaluated using crowdsourced perceptual tests. The first test performs a qualitative validation of the animation platform while the second is a comparative study between several expressive speech generation methods. We evaluate how the expressiveness of our audiovisual performances is perceived in comparison to resynthesized original utterances and the outputs of a purely frame-based conversion system

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Synthesizing mood-affected signed messages: Modifications to the parametric synthesis

Author: Colás José
López-Colino Fernando
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

This is the author’s version of a work that was accepted for publication in International Journal of Human-Computer Studies. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in International Journal of Human-Computer Studies,70, 4 (2012) DOI: 10.1016/j.ijhcs.2011.11.003This paper describes the first approach in synthesizing mood-affected signed contents. The research focuses on the modifications applied to a parametric sign language synthesizer (based on phonetic descriptions of the signs). We propose some modifications that will allow for the synthesis of different perceived frames of mind within synthetic signed messages. Three of these proposals focus on modifications to three different signs' phonologic parameters (the hand shape, the movement and the non-hand parameter). The other two proposals focus on the temporal aspect of the synthesis (sign speed and transition duration) and the representation of muscular tension through inverse kinematics procedures. These resulting variations have been evaluated by Spanish deaf signers, who have concluded that our system can generate the same signed message with three different frames of mind, which are correctly identified by Spanish Sign Language signers

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

Making Faces - State-Space Models Applied to Multi-Modal Signal Processing

Author: Lehn-Schiøler Tue
Publication venue: Technical University of Denmark
Publication date: 01/01/2005
Field of study

Online Research Database In Technology

Realistic and expressive talking head : implementation and evaluation

Author: Liu Kang
Publication venue: Hannover : Gottfried Wilhelm Leibniz Universität Hannover
Publication date: 01/01/2011
Field of study

[no abstract

Institutionelles Repositorium der Leibniz Universität Hannover

Visual Speech Synthesis using Dynamic Visemes and Deep Learning Architectures

Author: Thangthai Ausdang
Publication venue
Publication date: 01/04/2018
Field of study

The aim of this work is to improve the naturalness of visual speech synthesis produced automatically from a linguistic input over existing methods. Firstly, the most important contribution is on the investigation of the most suitable speech units for the visual speech synthesis. We propose the use of dynamic visemes instead of phonemes or static visemes and found that dynamic visemes can generate better visual speech than either phone or static viseme units. Moreover, best performance is obtained by a combined phoneme-dynamic viseme system. Secondly, we examine the most appropriate model between hidden Markov model (HMM) and different deep learning models that include feedforward and recurrent structures consisting of one-to-one, many-to-one and many-to-many architectures. Results suggested that that frame-by-frame synthesis from deep learning approach outperforms state-based synthesis from HMM approaches and an encoder-decoder many-to-many architecture is better than the one-to-one and many-to-one architectures. Thirdly, we explore the importance of contextual features that include information at varying linguistic levels, from frame level up to the utterance level. Our findings found that frame level information is the most valuable feature, as it is able to avoid discontinuities in the visual feature sequence and produces a smooth and realistic animation output. Fourthly, we found that the two most common objective measures of correlation and root mean square error are not able to indicate realism and naturalness of human perceived quality. We introduce an alternative objective measure and show that the global variance is a better indicator of human perception of quality. Finally, we propose a novel method to convert a given text input and phoneme transcription into a dynamic viseme transcription in the case when a reference dynamic viseme sequence is not available. Subjective preference tests confirmed that our proposed method is able to produce animation, that are statistically indistinguishable from animation produced using reference data

University of East Anglia digital repository

Higher level techniques for the artistic rendering of images and video

Author: Collomosse John Philip
Publication venue
Publication date: 01/01/2004
Field of study

EThOS - Electronic Theses Online ServiceGBUnited Kingdo

OPUS

University of Surrey

Surrey Research Insight

OpenGrey Repository