2,780 research outputs found
Articulatory features for speech-driven head motion synthesis
This study investigates the use of articulatory features for speech-driven head motion synthesis as opposed to prosody features such as F0 and energy that have been mainly used in the literature. In the proposed approach, multi-stream HMMs are trained jointly on the synchronous streams of speech and head motion data. Articulatory features can be regarded as an intermediate parametrisation of speech that are expected to have a close link with head movement. Measured head and articulatory movements acquired by EMA were synchronously recorded with speech. Measured articulatory data was compared to those predicted from speech using an HMM-based inversion mapping system trained in a semi-supervised fashion. Canonical correlation analysis (CCA) on a data set of free speech of 12 people shows that the articulatory features are more correlated with head rotation than prosodic and/or cepstral speech features. It is also shown that the synthesised head motion using articulatory features gave higher correlations with the original head motion than when only prosodic features are used. Index Terms: head motion synthesis, articulatory features, canonical correlation analysis, acoustic-to-articulatory mappin
Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video
Synthesizing realistic videos according to a given speech is still an open
challenge. Previous works have been plagued by issues such as inaccurate lip
shape generation and poor image quality. The key reason is that only motions
and appearances on limited facial areas (e.g., lip area) are mainly driven by
the input speech. Therefore, directly learning a mapping function from speech
to the entire head image is prone to ambiguity, particularly when using a short
video for training. We thus propose a decomposition-synthesis-composition
framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive
and speech-insensitive motion/appearance to facilitate effective learning from
limited training data, resulting in the generation of natural-looking videos.
First, given a fixed head pose (i.e., canonical space), we present a
speech-driven implicit model for lip image generation which concentrates on
learning speech-sensitive motion and appearance. Next, to model the major
speech-insensitive motion (i.e., head movement), we introduce a geometry-aware
mutual explicit mapping (GAMEM) module that establishes geometric mappings
between different head poses. This allows us to paste generated lip images at
the canonical space onto head images with arbitrary poses and synthesize
talking videos with natural head movements. In addition, a Blend-Net and a
contrastive sync loss are introduced to enhance the overall synthesis
performance. Quantitative and qualitative results on three benchmarks
demonstrate that our model can be trained by a video of just a few minutes in
length and achieve state-of-the-art performance in both visual quality and
speech-visual synchronization. Code: https://github.com/CVMI-Lab/Speech2Lip
Rule-based lip-syncing algorithm for virtual character in voice chatbot
Virtual characters changed the way we interact with computers. The underlying key for a believable virtual character is accurate synchronization between the visual (lip movements) and the audio (speech) in real-time. This work develops a 3D model for the virtual character and implements the rule-based lip-syncing algorithm for the virtual character's lip movements. We use the Jacob voice chatbot as the platform for the design and implementation of the virtual character. Thus, audio-driven articulation and manual mapping methods are considered suitable for real-time applications such as Jacob. We evaluate the proposed virtual character using hedonic motivation system adoption model (HMSAM) with 70 users. The HMSAM results for the behavioral intention to use is 91.74%, and the immersion is 72.95%. The average score for all aspects of the HMSAM is 85.50%. The rule-based lip-syncing algorithm accurately synchronizes the lip movements with the Jacob voice chatbot's speech in real-time
A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild
In this work, we investigate the problem of lip-syncing a talking face video
of an arbitrary identity to match a target speech segment. Current works excel
at producing accurate lip movements on a static image or videos of specific
people seen during the training phase. However, they fail to accurately morph
the lip movements of arbitrary identities in dynamic, unconstrained talking
face videos, resulting in significant parts of the video being out-of-sync with
the new audio. We identify key reasons pertaining to this and hence resolve
them by learning from a powerful lip-sync discriminator. Next, we propose new,
rigorous evaluation benchmarks and metrics to accurately measure lip
synchronization in unconstrained videos. Extensive quantitative evaluations on
our challenging benchmarks show that the lip-sync accuracy of the videos
generated by our Wav2Lip model is almost as good as real synced videos. We
provide a demo video clearly showing the substantial impact of our Wav2Lip
model and evaluation benchmarks on our website:
\url{cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild}.
The code and models are released at this GitHub repository:
\url{github.com/Rudrabha/Wav2Lip}. You can also try out the interactive demo at
this link: \url{bhaasha.iiit.ac.in/lipsync}.Comment: 9 pages (including references), 3 figures, Accepted in ACM
Multimedia, 202
Breathing Life into Faces: Speech-driven 3D Facial Animation with Natural Head Pose and Detailed Shape
The creation of lifelike speech-driven 3D facial animation requires a natural
and precise synchronization between audio input and facial expressions.
However, existing works still fail to render shapes with flexible head poses
and natural facial details (e.g., wrinkles). This limitation is mainly due to
two aspects: 1) Collecting training set with detailed 3D facial shapes is
highly expensive. This scarcity of detailed shape annotations hinders the
training of models with expressive facial animation. 2) Compared to mouth
movement, the head pose is much less correlated to speech content.
Consequently, concurrent modeling of both mouth movement and head pose yields
the lack of facial movement controllability. To address these challenges, we
introduce VividTalker, a new framework designed to facilitate speech-driven 3D
facial animation characterized by flexible head pose and natural facial
details. Specifically, we explicitly disentangle facial animation into head
pose and mouth movement and encode them separately into discrete latent spaces.
Then, these attributes are generated through an autoregressive process
leveraging a window-based Transformer architecture. To augment the richness of
3D facial animation, we construct a new 3D dataset with detailed shapes and
learn to synthesize facial details in line with speech content. Extensive
quantitative and qualitative experiments demonstrate that VividTalker
outperforms state-of-the-art methods, resulting in vivid and realistic
speech-driven 3D facial animation
RADIO: Reference-Agnostic Dubbing Video Synthesis
One of the most challenging problems in audio-driven talking head generation
is achieving high-fidelity detail while ensuring precise synchronization. Given
only a single reference image, extracting meaningful identity attributes
becomes even more challenging, often causing the network to mirror the facial
and lip structures too closely. To address these issues, we introduce RADIO, a
framework engineered to yield high-quality dubbed videos regardless of the pose
or expression in reference images. The key is to modulate the decoder layers
using latent space composed of audio and reference features. Additionally, we
incorporate ViT blocks into the decoder to emphasize high-fidelity details,
especially in the lip region. Our experimental results demonstrate that RADIO
displays high synchronization without the loss of fidelity. Especially in harsh
scenarios where the reference frame deviates significantly from the ground
truth, our method outperforms state-of-the-art methods, highlighting its
robustness. Pre-trained model and codes will be made public after the review.Comment: Under revie
DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models
Diffusion models have shown remarkable success in a variety of downstream
generative tasks, yet remain under-explored in the important and challenging
expressive talking head generation. In this work, we propose a DreamTalk
framework to fulfill this gap, which employs meticulous design to unlock the
potential of diffusion models in generating expressive talking heads.
Specifically, DreamTalk consists of three crucial components: a denoising
network, a style-aware lip expert, and a style predictor. The diffusion-based
denoising network is able to consistently synthesize high-quality audio-driven
face motions across diverse expressions. To enhance the expressiveness and
accuracy of lip motions, we introduce a style-aware lip expert that can guide
lip-sync while being mindful of the speaking styles. To eliminate the need for
expression reference video or text, an extra diffusion-based style predictor is
utilized to predict the target expression directly from the audio. By this
means, DreamTalk can harness powerful diffusion models to generate expressive
faces effectively and reduce the reliance on expensive style references.
Experimental results demonstrate that DreamTalk is capable of generating
photo-realistic talking faces with diverse speaking styles and achieving
accurate lip motions, surpassing existing state-of-the-art counterparts.Comment: Project Page: https://dreamtalk-project.github.i
- …