34,786 research outputs found
Speech-Driven 3D Face Animation with Composite and Regional Facial Movements
Speech-driven 3D face animation poses significant challenges due to the
intricacy and variability inherent in human facial movements. This paper
emphasizes the importance of considering both the composite and regional
natures of facial movements in speech-driven 3D face animation. The composite
nature pertains to how speech-independent factors globally modulate
speech-driven facial movements along the temporal dimension. Meanwhile, the
regional nature alludes to the notion that facial movements are not globally
correlated but are actuated by local musculature along the spatial dimension.
It is thus indispensable to incorporate both natures for engendering vivid
animation. To address the composite nature, we introduce an adaptive modulation
module that employs arbitrary facial movements to dynamically adjust
speech-driven facial movements across frames on a global scale. To accommodate
the regional nature, our approach ensures that each constituent of the facial
features for every frame focuses on the local spatial movements of 3D faces.
Moreover, we present a non-autoregressive backbone for translating audio to 3D
facial movements, which maintains high-frequency nuances of facial movements
and facilitates efficient inference. Comprehensive experiments and user studies
demonstrate that our method surpasses contemporary state-of-the-art approaches
both qualitatively and quantitatively.Comment: Accepted by MM 2023, 9 pages, 7 figures. arXiv admin note: text
overlap with arXiv:2303.0979
Personalized Speech-driven Expressive 3D Facial Animation Synthesis with Style Control
Different people have different facial expressions while speaking
emotionally. A realistic facial animation system should consider such
identity-specific speaking styles and facial idiosyncrasies to achieve
high-degree of naturalness and plausibility. Existing approaches to
personalized speech-driven 3D facial animation either use one-hot identity
labels or rely-on person specific models which limit their scalability. We
present a personalized speech-driven expressive 3D facial animation synthesis
framework that models identity specific facial motion as latent representations
(called as styles), and synthesizes novel animations given a speech input with
the target style for various emotion categories. Our framework is trained in an
end-to-end fashion and has a non-autoregressive encoder-decoder architecture
with three main components: expression encoder, speech encoder and expression
decoder. Since, expressive facial motion includes both identity-specific style
and speech-related content information; expression encoder first disentangles
facial motion sequences into style and content representations, respectively.
Then, both of the speech encoder and the expression decoders input the
extracted style information to update transformer layer weights during training
phase. Our speech encoder also extracts speech phoneme label and duration
information to achieve better synchrony within the non-autoregressive synthesis
mechanism more effectively. Through detailed experiments, we demonstrate that
our approach produces temporally coherent facial expressions from input speech
while preserving the speaking styles of the target identities.Comment: 8 page
The Effect of Real-Time Constraints on Automatic Speech Animation
Machine learning has previously been applied successfully to speech-driven facial animation. To account for carry-over and anticipatory coarticulation a common approach is to predict the facial pose using a symmetric window of acoustic speech that includes both past and future context. Using future context limits this approach for animating the faces of characters in real-time and networked applications, such as online gaming. An acceptable latency for conversational speech is 200ms and typically network transmission times will consume a significant part of this. Consequently, we consider asymmetric windows by investigating the extent to which decreasing the future context effects the quality of predicted animation using both deep neural networks (DNNs) and bi-directional LSTM recurrent neural networks (BiLSTMs). Specifically we investigate future contexts from 170ms (fully-symmetric) to 0ms (fullyasymmetric
PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo Multi-modal Features
Speech-driven 3D facial animation has improved a lot recently while most
related works only utilize acoustic modality and neglect the influence of
visual and textual cues, leading to unsatisfactory results in terms of
precision and coherence. We argue that visual and textual cues are not trivial
information. Therefore, we present a novel framework, namely PMMTalk, using
complementary Pseudo Multi-Modal features for improving the accuracy of facial
animation. The framework entails three modules: PMMTalk encoder, cross-modal
alignment module, and PMMTalk decoder. Specifically, the PMMTalk encoder
employs the off-the-shelf talking head generation architecture and speech
recognition technology to extract visual and textual information from speech,
respectively. Subsequently, the cross-modal alignment module aligns the
audio-image-text features at temporal and semantic levels. Then PMMTalk decoder
is employed to predict lip-syncing facial blendshape coefficients. Contrary to
prior methods, PMMTalk only requires an additional random reference face image
but yields more accurate results. Additionally, it is artist-friendly as it
seamlessly integrates into standard animation production workflows by
introducing facial blendshape coefficients. Finally, given the scarcity of 3D
talking face datasets, we introduce a large-scale 3D Chinese Audio-Visual
Facial Animation (3D-CAVFA) dataset. Extensive experiments and user studies
show that our approach outperforms the state of the art. We recommend watching
the supplementary video
Speech Driven Expressive Animations
Current state of the art lip sync facial animation systems use vision Âbased performance capture methods which are highly resource consuming. These techniques lack scalability and post hoc customizability whilst simpler and more automated alternatives often lack expressiveness. We propose an extension for a deep learning based speech driven lip sync facial synthesis system that allows for expressiveness and manual tweaking in the emotion space. Our model generates expressive animations by mapping recorded speech features into facial rig parameters. Our architecture consists of a conditional Variational Autoencoder conditioned on speech, whose latent space controls the facial expression during inference and is driven by predictions from a Speech Emotion Recognition module. This approach, to the extent of our knowledge, has not been tried before in the literature. The results show that our Speech Emotion Recognition (SER) model is able to make meaningful predictions and generalize to unseen game speech utterances. Our user study shows that participants significantly prefer our model animations when compared to animations generated from random emotions and a baseline neutral emotion model
Speaker-independent speech animation using perceptual loss functions and synthetic data
We propose a real-time speaker-independent speech- to-facial animation system that predicts lip and jaw movements on a reference face for audio speech taken from any speaker. Our approach is motivated by two key observations; 1) Speaker- independent facial animation can be generated from phoneme labels, but to perform this automatically a speech recogniser is needed which, due to contextual look-ahead, introduces too much time lag. 2) Audio-driven speech animation can be performed in real-time but requires large, multi-speaker audio-visual speech datasets of which there are few. We adopt a novel three- stage training procedure that leverages the advantages of each approach. First we train a phoneme-to-visual speech model from a large single-speaker audio-visual dataset. Next, we use this model to generate the synthetic visual component of a large multi-speaker audio dataset of which the video is not available. Finally, we learn an audio-to-visual speech mapping using the synthetic visual features as the target. Furthermore, we increase the realism of the predicted facial animation by introducing two perceptually-based loss functions that aim to improve mouth closures and openings. The proposed method and loss functions are evaluated objectively using mean square error, global variance and a new metric that measures the extent of mouth opening. Subjective tests show that our approach produces facial animation comparable to those produced from phoneme sequences and that improved mouth closures, particularly for bilabial closures, are achieved
Prediction of Head Motion from Speech Waveforms with a Canonical-Correlation-Constrained Autoencoder
This study investigates the direct use of speech waveforms to predict head
motion for speech-driven head-motion synthesis, whereas the use of spectral
features such as MFCC as basic input features together with additional features
such as energy and F0 is common in the literature. We show that, rather than
combining different features that originate from waveforms, it is more
effective to use waveforms directly predicting corresponding head motion. The
challenge with the waveform-based approach is that waveforms contain a large
amount of information irrelevant to predict head motion, which hinders the
training of neural networks. To overcome the problem, we propose a
canonical-correlation-constrained autoencoder (CCCAE), where hidden layers are
trained to not only minimise the error but also maximise the canonical
correlation with head motion. Compared with an MFCC-based system, the proposed
system shows comparable performance in objective evaluation, and better
performance in subject evaluation.Comment: head motion synthesis, speech-driven animation, deep canonically
correlated autoencode
EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation
Speech-driven 3D face animation aims to generate realistic facial expressions
that match the speech content and emotion. However, existing methods often
neglect emotional facial expressions or fail to disentangle them from speech
content. To address this issue, this paper proposes an end-to-end neural
network to disentangle different emotions in speech so as to generate rich 3D
facial expressions. Specifically, we introduce the emotion disentangling
encoder (EDE) to disentangle the emotion and content in the speech by
cross-reconstructed speech signals with different emotion labels. Then an
emotion-guided feature fusion decoder is employed to generate a 3D talking face
with enhanced emotion. The decoder is driven by the disentangled identity,
emotional, and content embeddings so as to generate controllable personal and
emotional styles. Finally, considering the scarcity of the 3D emotional talking
face data, we resort to the supervision of facial blendshapes, which enables
the reconstruction of plausible 3D faces from 2D emotional data, and contribute
a large-scale 3D emotional talking face dataset (3D-ETF) to train the network.
Our experiments and user studies demonstrate that our approach outperforms
state-of-the-art methods and exhibits more diverse facial movements. We
recommend watching the supplementary video:
https://ziqiaopeng.github.io/emotalkComment: Accepted by ICCV 202
- …