3,164 research outputs found
LaughTalk: Expressive 3D Talking Head Generation with Laughter
Laughter is a unique expression, essential to affirmative social interactions
of humans. Although current 3D talking head generation methods produce
convincing verbal articulations, they often fail to capture the vitality and
subtleties of laughter and smiles despite their importance in social context.
In this paper, we introduce a novel task to generate 3D talking heads capable
of both articulate speech and authentic laughter. Our newly curated dataset
comprises 2D laughing videos paired with pseudo-annotated and human-validated
3D FLAME parameters and vertices. Given our proposed dataset, we present a
strong baseline with a two-stage training scheme: the model first learns to
talk and then acquires the ability to express laughter. Extensive experiments
demonstrate that our method performs favorably compared to existing approaches
in both talking head generation and expressing laughter signals. We further
explore potential applications on top of our proposed method for rigging
realistic avatars.Comment: Accepted to WACV202
HeadOn: Real-time Reenactment of Human Portrait Videos
We propose HeadOn, the first real-time source-to-target reenactment approach
for complete human portrait videos that enables transfer of torso and head
motion, face expression, and eye gaze. Given a short RGB-D video of the target
actor, we automatically construct a personalized geometry proxy that embeds a
parametric head, eye, and kinematic torso model. A novel real-time reenactment
algorithm employs this proxy to photo-realistically map the captured motion
from the source actor to the target actor. On top of the coarse geometric
proxy, we propose a video-based rendering technique that composites the
modified target portrait video via view- and pose-dependent texturing, and
creates photo-realistic imagery of the target actor under novel torso and head
poses, facial expressions, and gaze directions. To this end, we propose a
robust tracking of the face and torso of the source actor. We extensively
evaluate our approach and show significant improvements in enabling much
greater flexibility in creating realistic reenacted output videos.Comment: Video: https://www.youtube.com/watch?v=7Dg49wv2c_g Presented at
Siggraph'1
Recommended from our members
Highly automated method for facial expression synthesis
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The synthesis of realistic facial expressions has been an unexplored area for computer graphics scientists. Over the last three decades, several different construction methods have been formulated in order to obtain natural graphic results. Despite these advancements, though, current techniques still require costly resources, heavy user intervention and specific training and outcomes are still not completely realistic. This thesis, therefore, aims to achieve an automated synthesis that will produce realistic facial expressions at a low cost.
This thesis, proposes a highly automated approach for achieving a realistic facial
expression synthesis, which allows for enhanced performance in speed (3 minutes
processing time maximum) and quality with a minimum of user intervention. It will also demonstrate a highly technical and automated method of facial feature detection, by allowing users to obtain their desired facial expression synthesis with minimal
physical input. Moreover, it will describe a novel approach to the normalization of the
illumination settings values between source and target images, thereby allowing the
algorithm to work accurately, even in different lighting conditions.
Finally, we will present the results obtained from the proposed techniques, together with our conclusions, at the end of the paper
Audio-driven Robot Upper-body Motion Synthesis
Body language is an important aspect of human communication, which an effective human-robot interaction interface should mimic well. The currently available robotic platforms are limited in their ability to automatically generate behaviours that align with their speech. In this paper, we developed a neural network based system that takes audio from a user as an input and generates upper-body gestures including head, hand and hip movements of the user on a humanoid robot, namely, Softbank Robotics’ Pepper. The developed system was evaluated quantitatively as well as qualitatively using web-surveys when driven by natural speech and synthetic speech. We particularly compared the impact of generic and person-specific neural network models on the quality of synthesised movements. We further investigated the relationships between quantitative and qualitative evaluations and examined how the speaker’s personality traits affect the synthesised movements
Generation of realistic human behaviour
As the use of computers and robots in our everyday lives increases so does the need for better interaction with these devices. Human-computer interaction relies on the ability to understand and generate human behavioural signals such as speech, facial expressions and motion. This thesis deals with the synthesis and evaluation of such signals, focusing not only on their intelligibility but also on their realism. Since these signals are often correlated, it is common for methods to drive the generation of one signal using another. The thesis begins by tackling the problem of speech-driven facial animation and proposing models capable of producing realistic animations from a single image and an audio clip. The goal of these models is to produce a video of a target person, whose lips move in accordance with the driving audio. Particular focus is also placed on a) generating spontaneous expression such as blinks, b) achieving audio-visual synchrony and c) transferring or producing natural head motion. The second problem addressed in this thesis is that of video-driven speech reconstruction, which aims at converting a silent video into waveforms containing speech. The method proposed for solving this problem is capable of generating intelligible and accurate speech for both seen and unseen speakers. The spoken content is correctly captured thanks to a perceptual loss, which uses features from pre-trained speech-driven animation models. The ability of the video-to-speech model to run in real-time allows its use in hearing assistive devices and telecommunications. The final work proposed in this thesis is a generic domain translation system, that can be used for any translation problem including those mapping across different modalities. The framework is made up of two networks performing translations in opposite directions and can be successfully applied to solve diverse sets of translation problems, including speech-driven animation and video-driven speech reconstruction.Open Acces
- …