1,881 research outputs found
FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion
Speech-driven 3D facial animation synthesis has been a challenging task both
in industry and research. Recent methods mostly focus on deterministic deep
learning methods meaning that given a speech input, the output is always the
same. However, in reality, the non-verbal facial cues that reside throughout
the face are non-deterministic in nature. In addition, majority of the
approaches focus on 3D vertex based datasets and methods that are compatible
with existing facial animation pipelines with rigged characters is scarce. To
eliminate these issues, we present FaceDiffuser, a non-deterministic deep
learning model to generate speech-driven facial animations that is trained with
both 3D vertex and blendshape based datasets. Our method is based on the
diffusion technique and uses the pre-trained large speech representation model
HuBERT to encode the audio input. To the best of our knowledge, we are the
first to employ the diffusion method for the task of speech-driven 3D facial
animation synthesis. We have run extensive objective and subjective analyses
and show that our approach achieves better or comparable results in comparison
to the state-of-the-art methods. We also introduce a new in-house dataset that
is based on a blendshape based rigged character. We recommend watching the
accompanying supplementary video. The code and the dataset will be publicly
available.Comment: Pre-print of the paper accepted at ACM SIGGRAPH MIG 202
Pose-Guided Human Animation from a Single Image in the Wild
We present a new pose transfer method for synthesizing a human animation from a single image of a person controlled by a sequence of body poses. Existing pose transfer methods exhibit significant visual artifacts when applying to a novel scene, resulting in temporal inconsistency and failures in preserving the identity and textures of the person. To address these limitations, we design a compositional neural network that predicts the silhouette, garment labels, and textures. Each modular network is explicitly dedicated to a subtask that can be learned from the synthetic data. At the inference time, we utilize the trained network to produce a unified representation of appearance and its labels in UV coordinates, which remains constant across poses. The unified representation provides an incomplete yet strong guidance to generating the appearance in response to the pose change. We use the trained network to complete the appearance and render it with the background. With these strategies, we are able to synthesize human animations that can preserve the identity and appearance of the person in a temporally coherent way without any fine-tuning of the network on the testing scene. Experiments show that our method outperforms the state-of-the-arts in terms of synthesis quality, temporal coherence, and generalization ability
TapMo: Shape-aware Motion Generation of Skeleton-free Characters
Previous motion generation methods are limited to the pre-rigged 3D human
model, hindering their applications in the animation of various non-rigged
characters. In this work, we present TapMo, a Text-driven Animation Pipeline
for synthesizing Motion in a broad spectrum of skeleton-free 3D characters. The
pivotal innovation in TapMo is its use of shape deformation-aware features as a
condition to guide the diffusion model, thereby enabling the generation of
mesh-specific motions for various characters. Specifically, TapMo comprises two
main components - Mesh Handle Predictor and Shape-aware Diffusion Module. Mesh
Handle Predictor predicts the skinning weights and clusters mesh vertices into
adaptive handles for deformation control, which eliminates the need for
traditional skeletal rigging. Shape-aware Motion Diffusion synthesizes motion
with mesh-specific adaptations. This module employs text-guided motions and
mesh features extracted during the first stage, preserving the geometric
integrity of the animations by accounting for the character's shape and
deformation. Trained in a weakly-supervised manner, TapMo can accommodate a
multitude of non-human meshes, both with and without associated text motions.
We demonstrate the effectiveness and generalizability of TapMo through rigorous
qualitative and quantitative experiments. Our results reveal that TapMo
consistently outperforms existing auto-animation methods, delivering
superior-quality animations for both seen or unseen heterogeneous 3D
characters
Learning Speech-driven 3D Conversational Gestures from Video
We propose the first approach to automatically and jointly synthesize both
the synchronous 3D conversational body and hand gestures, as well as 3D face
and head animations, of a virtual character from speech input. Our algorithm
uses a CNN architecture that leverages the inherent correlation between facial
expression and hand gestures. Synthesis of conversational body gestures is a
multi-modal problem since many similar gestures can plausibly accompany the
same input speech. To synthesize plausible body gestures in this setting, we
train a Generative Adversarial Network (GAN) based model that measures the
plausibility of the generated sequences of 3D body motion when paired with the
input audio features. We also contribute a new way to create a large corpus of
more than 33 hours of annotated body, hand, and face data from in-the-wild
videos of talking people. To this end, we apply state-of-the-art monocular
approaches for 3D body and hand pose estimation as well as dense 3D face
performance capture to the video corpus. In this way, we can train on orders of
magnitude more data than previous algorithms that resort to complex in-studio
motion capture solutions, and thereby train more expressive synthesis
algorithms. Our experiments and user study show the state-of-the-art quality of
our speech-synthesized full 3D character animations
- …