Search CORE

47,177 research outputs found

Capture, Learning, and Synthesis of 3D Speaking Styles

Author: Black Michael J.
Bolkart Timo
Cudeiro Daniel
Laidlaw Cassidy
Ranjan Anurag
Publication venue
Publication date: 01/01/2019
Field of study

Audio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard evaluation metrics. To address this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input - even speech in languages other than English - and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. To our knowledge, VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance. We make the dataset and model available for research purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Predicting Head Pose from Speech with a Conditional Variational Autoencoder

Author: Greenwood David
Laycock Stephen
Matthews Iain
Publication venue: 'International Speech Communication Association'
Publication date: 20/08/2017
Field of study

Natural movement plays a significant role in realistic speech animation. Numerous studies have demonstrated the contribution visual cues make to the degree we, as human observers, find an animation acceptable. Rigid head motion is one visual mode that universally co-occurs with speech, and so it is a reasonable strategy to seek a transformation from the speech mode to predict the head pose. Several previous authors have shown that prediction is possible, but experiments are typically confined to rigidly produced dialogue. Natural, expressive, emotive and prosodic speech exhibit motion patterns that are far more difficult to predict with considerable variation in expected head pose. Recently, Long Short Term Memory (LSTM) networks have become an important tool for modelling speech and natural language tasks. We employ Deep Bi-Directional LSTMs (BLSTM) capable of learning long-term structure in language, to model the relationship that speech has with rigid head motion. We then extend our model by conditioning with prior motion. Finally, we introduce a generative head motion model, conditioned on audio features using a Conditional Variational Autoencoder (CVAE). Each approach mitigates the problems of the one to many mapping that a speech to head pose model must accommodat

Crossref

University of East Anglia digital repository

Expressive characters and a text chat interface

Author: Ballin D
Crabtree IB
Gillies M
Publication venue
Publication date: 01/01/2004
Field of study

UCL Discovery

Expressive visual text to speech and expression adaptation using deep neural networks

Author: Cipolla R
Maia R
Parker J
Stylianou Y
Publication venue: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Publication date: 16/06/2017
Field of study

In this paper, we present an expressive visual text to speech system (VTTS) based on a deep neural network (DNN). Given an input text sentence and a set of expression tags, the VTTS is able to produce not only the audio speech, but also the accompanying facial movements. The expressions can either be one of the expressions in the training corpus or a blend of expressions from the training corpus. Furthermore, we present a method of adapting a previously trained DNN to include a new expression using a small amount of training data. Experiments show that the proposed DNN-based VTTS is preferred by 57.9% over the baseline hidden Markov model based VTTS which uses cluster adaptive training

Crossref

Apollo (Cambridge)

A longitudinal study of audiovisual speech perception by hearing-impaired children with cochlear implants

Author: Bergeson Tonya R.
Davis Rebecca A. O.
Pisoni David B.
Publication venue: Digital Commons @ Butler University
Publication date: 01/01/2003
Field of study

The present study investigated the development of audiovisual speech perception skills in children who are prelingually deaf and received cochlear implants. We analyzed results from the Pediatric Speech Intelligibility (Jerger, Lewis, Hawkins, & Jerger, 1980) test of audiovisual spoken word and sentence recognition skills obtained from a large group of young children with cochlear implants enrolled in a longitudinal study, from pre-implantation to 3 years post-implantation. The results revealed better performance under the audiovisual presentation condition compared with auditory-alone and visual-alone conditions. Performance in all three conditions improved over time following implantation. The results also revealed differential effects of early sensory and linguistic experience. Children from oral communication (OC) education backgrounds performed better overall than children from total communication (TC backgrounds. Finally, children in the early-implanted group performed better than children in the late-implanted group in the auditory-alone presentation condition after 2 years of cochlear implant use, whereas children in the late-implanted group performed better than children in the early-implanted group in the visual-alone condition. The results of the present study suggest that measures of audiovisual speech perception may provide new methods to assess hearing, speech, and language development in young children with cochlear implants

Digital Commons @ Butler University

Recommended from our members

Visual to auditory silent matching task in adults who do and do not stutter

Author: Novack Julie Sarah
Publication venue
Publication date: 20/10/2015
Field of study

textThe purpose of the present study was to investigate the role of phonological working memory in adults who do and do not stutter through a visual to auditory silent matching task. This task also explored the possible relationship between auditory processing and its ability to affect performance on the task. Participants were 13 adults who stutter (mean age = 28 years), matched in age, gender, handedness, and education level with 13 adults who do not stutter (mean age = 28 years). For the nonvocal visual to auditory task, participants silently read an initial target nonword and matched that target nonword to four subsequent auditory nonword choices. The participants completed this task for 4- syllable and 7- syllable nonwords (N = 8 per set). Results indicated that adults who stutter were significantly less accurate than adults who do not stutter at both syllable lengths. Our present findings support previous research that suggests less efficient phonological working memory in adults who stutter.Communication Sciences and Disorder

Texas ScholarWorks