993 research outputs found
HeadOn: Real-time Reenactment of Human Portrait Videos
We propose HeadOn, the first real-time source-to-target reenactment approach
for complete human portrait videos that enables transfer of torso and head
motion, face expression, and eye gaze. Given a short RGB-D video of the target
actor, we automatically construct a personalized geometry proxy that embeds a
parametric head, eye, and kinematic torso model. A novel real-time reenactment
algorithm employs this proxy to photo-realistically map the captured motion
from the source actor to the target actor. On top of the coarse geometric
proxy, we propose a video-based rendering technique that composites the
modified target portrait video via view- and pose-dependent texturing, and
creates photo-realistic imagery of the target actor under novel torso and head
poses, facial expressions, and gaze directions. To this end, we propose a
robust tracking of the face and torso of the source actor. We extensively
evaluate our approach and show significant improvements in enabling much
greater flexibility in creating realistic reenacted output videos.Comment: Video: https://www.youtube.com/watch?v=7Dg49wv2c_g Presented at
Siggraph'1
Audiovisual Generation of Social Attitudes from Neutral Stimuli
International audienceThe focus of this study is the generation of expressive audiovisual speech from neutral utterances for 3D virtual actors. Taking into account the segmental and suprasegmental aspects of audiovisual speech, we propose and compare several computational frameworks for the generation of expressive speech and face animation. We notably evaluate a standard frame-based conversion approach with two other methods that postulate the existence of global prosodic audiovisual patterns that are characteristic of social attitudes. The proposed approaches are tested on a database of " Exercises in Style " [1] performed by two semi-professional actors and results are evaluated using crowdsourced perceptual tests. The first test performs a qualitative validation of the animation platform while the second is a comparative study between several expressive speech generation methods. We evaluate how the expressiveness of our audiovisual performances is perceived in comparison to resynthesized original utterances and the outputs of a purely frame-based conversion system
Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech using Adversarial Disentanglement of Multimodal Style Encoding
Modeling virtual agents with behavior style is one factor for personalizing
human agent interaction. We propose an efficient yet effective machine learning
approach to synthesize gestures driven by prosodic features and text in the
style of different speakers including those unseen during training. Our model
performs zero shot multimodal style transfer driven by multimodal data from the
PATS database containing videos of various speakers. We view style as being
pervasive while speaking, it colors the communicative behaviors expressivity
while speech content is carried by multimodal signals and text. This
disentanglement scheme of content and style allows us to directly infer the
style embedding even of speaker whose data are not part of the training phase,
without requiring any further training or fine tuning. The first goal of our
model is to generate the gestures of a source speaker based on the content of
two audio and text modalities. The second goal is to condition the source
speaker predicted gestures on the multimodal behavior style embedding of a
target speaker. The third goal is to allow zero shot style transfer of speakers
unseen during training without retraining the model. Our system consists of:
(1) a speaker style encoder network that learns to generate a fixed dimensional
speaker embedding style from a target speaker multimodal data and (2) a
sequence to sequence synthesis network that synthesizes gestures based on the
content of the input modalities of a source speaker and conditioned on the
speaker style embedding. We evaluate that our model can synthesize gestures of
a source speaker and transfer the knowledge of target speaker style variability
to the gesture generation task in a zero shot setup. We convert the 2D gestures
to 3D poses and produce 3D animations. We conduct objective and subjective
evaluations to validate our approach and compare it with a baseline
SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
Generating talking head videos through a face image and a piece of speech
audio still contains many challenges. ie, unnatural head movement, distorted
expression, and identity modification. We argue that these issues are mainly
because of learning from the coupled 2D motion fields. On the other hand,
explicitly using 3D information also suffers problems of stiff expression and
incoherent video. We present SadTalker, which generates 3D motion coefficients
(head pose, expression) of the 3DMM from audio and implicitly modulates a novel
3D-aware face render for talking head generation. To learn the realistic motion
coefficients, we explicitly model the connections between audio and different
types of motion coefficients individually. Precisely, we present ExpNet to
learn the accurate facial expression from audio by distilling both coefficients
and 3D-rendered faces. As for the head pose, we design PoseVAE via a
conditional VAE to synthesize head motion in different styles. Finally, the
generated 3D motion coefficients are mapped to the unsupervised 3D keypoints
space of the proposed face render, and synthesize the final video. We conduct
extensive experiments to show the superior of our method in terms of motion and
video quality.Comment: Project page: https://sadtalker.github.i
A High-Fidelity Open Embodied Avatar with Lip Syncing and Expression Capabilities
Embodied avatars as virtual agents have many applications and provide
benefits over disembodied agents, allowing non-verbal social and interactional
cues to be leveraged, in a similar manner to how humans interact with each
other. We present an open embodied avatar built upon the Unreal Engine that can
be controlled via a simple python programming interface. The avatar has lip
syncing (phoneme control), head gesture and facial expression (using either
facial action units or cardinal emotion categories) capabilities. We release
code and models to illustrate how the avatar can be controlled like a puppet or
used to create a simple conversational agent using public application
programming interfaces (APIs). GITHUB link:
https://github.com/danmcduff/AvatarSimComment: International Conference on Multimodal Interaction (ICMI 2019
ChatAnything: Facetime Chat with LLM-Enhanced Personas
In this technical report, we target generating anthropomorphized personas for
LLM-based characters in an online manner, including visual appearance,
personality and tones, with only text descriptions. To achieve this, we first
leverage the in-context learning capability of LLMs for personality generation
by carefully designing a set of system prompts. We then propose two novel
concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for
diverse voice and appearance generation. For MoV, we utilize the text-to-speech
(TTS) algorithms with a variety of pre-defined tones and select the most
matching one based on the user-provided text description automatically. For
MoD, we combine the recent popular text-to-image generation techniques and
talking head algorithms to streamline the process of generating talking
objects. We termed the whole framework as ChatAnything. With it, users could be
able to animate anything with any personas that are anthropomorphic using just
a few text inputs. However, we have observed that the anthropomorphic objects
produced by current generative models are often undetectable by pre-trained
face landmark detectors, leading to failure of the face motion generation, even
if these faces possess human-like appearances because those images are nearly
seen during the training (e.g., OOD samples). To address this issue, we
incorporate pixel-level guidance to infuse human face landmarks during the
image generation phase. To benchmark these metrics, we have built an evaluation
dataset. Based on it, we verify that the detection rate of the face landmark is
significantly increased from 57.0% to 92.5% thus allowing automatic face
animation based on generated speech content. The code and more results can be
found at https://chatanything.github.io/
FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion
Speech-driven 3D facial animation synthesis has been a challenging task both
in industry and research. Recent methods mostly focus on deterministic deep
learning methods meaning that given a speech input, the output is always the
same. However, in reality, the non-verbal facial cues that reside throughout
the face are non-deterministic in nature. In addition, majority of the
approaches focus on 3D vertex based datasets and methods that are compatible
with existing facial animation pipelines with rigged characters is scarce. To
eliminate these issues, we present FaceDiffuser, a non-deterministic deep
learning model to generate speech-driven facial animations that is trained with
both 3D vertex and blendshape based datasets. Our method is based on the
diffusion technique and uses the pre-trained large speech representation model
HuBERT to encode the audio input. To the best of our knowledge, we are the
first to employ the diffusion method for the task of speech-driven 3D facial
animation synthesis. We have run extensive objective and subjective analyses
and show that our approach achieves better or comparable results in comparison
to the state-of-the-art methods. We also introduce a new in-house dataset that
is based on a blendshape based rigged character. We recommend watching the
accompanying supplementary video. The code and the dataset will be publicly
available.Comment: Pre-print of the paper accepted at ACM SIGGRAPH MIG 202
Characterization of Audiovisual Dramatic Attitudes
International audienceIn this work we explore the capability of audiovisual parameters (such as voice frequency, rhythm, head motion or facial expressions) to discriminate among different dramatic attitudes. We extract the audiovisual parameters from an acted corpus of attitudes and structure them as frame, syllable, and sentence-level features. Using Linear Discriminant Analysis classifiers, we show that sentence-level features present a higher discriminating rate among the attitudes and are less dependent on the speaker than frame and sylable features. We also compare the classification results with the perceptual evaluation tests, showing that voice frequency is correlated to the perceptual results for all attitudes, while other features, such as head motion, contribute differently, depending both on the attitude and the speaker
- …