Search CORE

6 research outputs found

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

Author: Liu Yu
Liu Ziwei
Luo Ping
Wang Xiaogang
Zhou Hang
Publication venue
Publication date: 23/04/2019
Field of study

Talking face generation aims to synthesize a sequence of face images that correspond to a clip of speech. This is a challenging task because face appearance variation and semantics of speech are coupled together in the subtle movements of the talking face regions. Existing works either construct specific face appearance model on specific subjects or model the transformation between lip motion and speech. In this work, we integrate both aspects and enable arbitrary-subject talking face generation by learning disentangled audio-visual representation. We find that the talking face sequence is actually a composition of both subject-related information and speech-related information. These two spaces are then explicitly disentangled through a novel associative-and-adversarial training process. This disentangled representation has an advantage where both audio and video can serve as inputs for generation. Extensive experiments show that the proposed approach generates realistic talking face sequences on arbitrary subjects with much clearer lip motion patterns than previous work. We also demonstrate the learned audio-visual representation is extremely useful for the tasks of automatic lip reading and audio-video retrieval.Comment: AAAI Conference on Artificial Intelligence (AAAI 2019) Oral Presentation. Code, models, and video results are available on our webpage: https://liuziwei7.github.io/projects/TalkingFace.htm

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Capture, Learning, and Synthesis of 3D Speaking Styles

Author: Black Michael J.
Bolkart Timo
Cudeiro Daniel
Laidlaw Cassidy
Ranjan Anurag
Publication venue
Publication date: 01/01/2019
Field of study

Audio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard evaluation metrics. To address this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input - even speech in languages other than English - and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. To our knowledge, VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance. We make the dataset and model available for research purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201

arXiv.org e-Print Archive

Crossref

MPG.PuRe

A Photo-realistic Voice-bot

Author: Alexander Jorge
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2019
Field of study

Technology is at the point where systems are capable of synthesizing video of human actors indistinguishably from ones in which the actor is present. This research investigates whether or not it is possible to use this technology in order to create a system which, allows video generation of a human actor, that is able to interact with a user through speech in real-time, whilst also remaining indistinguishable from a real human actor. In other words, a photo-realistic voicebot. The work discusses the motivations and ethics, but also presents and tests a prototype system. The prototype aims to take advantage of the latest in real-time video manipulation software to create a natural sounding conversation with an artificially synthesized video

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Data-driven Communicative Behaviour Generation: A Survey

Author: Aly A
Belpaeme T
Oralbayeva N
Sandygulova A
Publication venue: Association for Computing Machinery (ACM)
Publication date: 02/05/2024
Field of study

The development of data-driven behaviour generating systems has recently become the focus of considerable attention in the fields of human–agent interaction and human–robot interaction. Although rule-based approaches were dominant for years, these proved inflexible and expensive to develop. The difficulty of developing production rules, as well as the need for manual configuration to generate artificial behaviours, places a limit on how complex and diverse rule-based behaviours can be. In contrast, actual human–human interaction data collected using tracking and recording devices makes humanlike multimodal co-speech behaviour generation possible using machine learning and specifically, in recent years, deep learning. This survey provides an overview of the state of the art of deep learning-based co-speech behaviour generation models and offers an outlook for future research in this area.</jats:p

Plymouth Electronic Archive and Research Library