47,177 research outputs found

    Capture, Learning, and Synthesis of 3D Speaking Styles

    Full text link
    Audio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard evaluation metrics. To address this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input - even speech in languages other than English - and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. To our knowledge, VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance. We make the dataset and model available for research purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201

    Predicting Head Pose from Speech with a Conditional Variational Autoencoder

    Get PDF
    Natural movement plays a significant role in realistic speech animation. Numerous studies have demonstrated the contribution visual cues make to the degree we, as human observers, find an animation acceptable. Rigid head motion is one visual mode that universally co-occurs with speech, and so it is a reasonable strategy to seek a transformation from the speech mode to predict the head pose. Several previous authors have shown that prediction is possible, but experiments are typically confined to rigidly produced dialogue. Natural, expressive, emotive and prosodic speech exhibit motion patterns that are far more difficult to predict with considerable variation in expected head pose. Recently, Long Short Term Memory (LSTM) networks have become an important tool for modelling speech and natural language tasks. We employ Deep Bi-Directional LSTMs (BLSTM) capable of learning long-term structure in language, to model the relationship that speech has with rigid head motion. We then extend our model by conditioning with prior motion. Finally, we introduce a generative head motion model, conditioned on audio features using a Conditional Variational Autoencoder (CVAE). Each approach mitigates the problems of the one to many mapping that a speech to head pose model must accommodat

    Expressive visual text to speech and expression adaptation using deep neural networks

    Get PDF
    In this paper, we present an expressive visual text to speech system (VTTS) based on a deep neural network (DNN). Given an input text sentence and a set of expression tags, the VTTS is able to produce not only the audio speech, but also the accompanying facial movements. The expressions can either be one of the expressions in the training corpus or a blend of expressions from the training corpus. Furthermore, we present a method of adapting a previously trained DNN to include a new expression using a small amount of training data. Experiments show that the proposed DNN-based VTTS is preferred by 57.9% over the baseline hidden Markov model based VTTS which uses cluster adaptive training

    A longitudinal study of audiovisual speech perception by hearing-impaired children with cochlear implants

    Get PDF
    The present study investigated the development of audiovisual speech perception skills in children who are prelingually deaf and received cochlear implants. We analyzed results from the Pediatric Speech Intelligibility (Jerger, Lewis, Hawkins, & Jerger, 1980) test of audiovisual spoken word and sentence recognition skills obtained from a large group of young children with cochlear implants enrolled in a longitudinal study, from pre-implantation to 3 years post-implantation. The results revealed better performance under the audiovisual presentation condition compared with auditory-alone and visual-alone conditions. Performance in all three conditions improved over time following implantation. The results also revealed differential effects of early sensory and linguistic experience. Children from oral communication (OC) education backgrounds performed better overall than children from total communication (TC backgrounds. Finally, children in the early-implanted group performed better than children in the late-implanted group in the auditory-alone presentation condition after 2 years of cochlear implant use, whereas children in the late-implanted group performed better than children in the early-implanted group in the visual-alone condition. The results of the present study suggest that measures of audiovisual speech perception may provide new methods to assess hearing, speech, and language development in young children with cochlear implants
    corecore