9,206 research outputs found
Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis
Humans involuntarily tend to infer parts of the conversation from lip
movements when the speech is absent or corrupted by external noise. In this
work, we explore the task of lip to speech synthesis, i.e., learning to
generate natural speech given only the lip movements of a speaker.
Acknowledging the importance of contextual and speaker-specific cues for
accurate lip-reading, we take a different path from existing works. We focus on
learning accurate lip sequences to speech mappings for individual speakers in
unconstrained, large vocabulary settings. To this end, we collect and release a
large-scale benchmark dataset, the first of its kind, specifically to train and
evaluate the single-speaker lip to speech task in natural settings. We propose
a novel approach with key design choices to achieve accurate, natural lip to
speech synthesis in such unconstrained scenarios for the first time. Extensive
evaluation using quantitative, qualitative metrics and human evaluation shows
that our method is four times more intelligible than previous works in this
space. Please check out our demo video for a quick overview of the paper,
method, and qualitative results.
https://www.youtube.com/watch?v=HziA-jmlk_4&feature=youtu.beComment: 10 pages (including references), 5 figures, Accepted in CVPR, 202
DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models
Diffusion models have shown remarkable success in a variety of downstream
generative tasks, yet remain under-explored in the important and challenging
expressive talking head generation. In this work, we propose a DreamTalk
framework to fulfill this gap, which employs meticulous design to unlock the
potential of diffusion models in generating expressive talking heads.
Specifically, DreamTalk consists of three crucial components: a denoising
network, a style-aware lip expert, and a style predictor. The diffusion-based
denoising network is able to consistently synthesize high-quality audio-driven
face motions across diverse expressions. To enhance the expressiveness and
accuracy of lip motions, we introduce a style-aware lip expert that can guide
lip-sync while being mindful of the speaking styles. To eliminate the need for
expression reference video or text, an extra diffusion-based style predictor is
utilized to predict the target expression directly from the audio. By this
means, DreamTalk can harness powerful diffusion models to generate expressive
faces effectively and reduce the reliance on expensive style references.
Experimental results demonstrate that DreamTalk is capable of generating
photo-realistic talking faces with diverse speaking styles and achieving
accurate lip motions, surpassing existing state-of-the-art counterparts.Comment: Project Page: https://dreamtalk-project.github.i
DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion
Speech-driven 3D facial animation has gained significant attention for its
ability to create realistic and expressive facial animations in 3D space based
on speech. Learning-based methods have shown promising progress in achieving
accurate facial motion synchronized with speech. However, one-to-many nature of
speech-to-3D facial synthesis has not been fully explored: while the lip
accurately synchronizes with the speech content, other facial attributes beyond
speech-related motions are variable with respect to the speech. To account for
the potential variance in the facial attributes within a single speech, we
propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
DF-3DFace captures the complex one-to-many relationships between speech and 3D
face based on diffusion. It concurrently achieves aligned lip motion by
exploiting audio-mesh synchronization and masked conditioning. Furthermore, the
proposed method jointly models identity and pose in addition to facial motions
so that it can generate 3D face animation without requiring a reference
identity mesh and produce natural head poses. We contribute a new large-scale
3D facial mesh dataset, 3D-HDTF to enable the synthesis of variations in
identities, poses, and facial motions of 3D face mesh. Extensive experiments
demonstrate that our method successfully generates highly variable facial
shapes and motions from speech and simultaneously achieves more realistic
facial animation than the state-of-the-art methods
ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment
The objective of stylized speech-driven facial animation is to create
animations that encapsulate specific emotional expressions. Existing methods
often depend on pre-established emotional labels or facial expression
templates, which may limit the necessary flexibility for accurately conveying
user intent. In this research, we introduce a technique that enables the
control of arbitrary styles by leveraging natural language as emotion prompts.
This technique presents benefits in terms of both flexibility and
user-friendliness. To realize this objective, we initially construct a
Text-Expression Alignment Dataset (TEAD), wherein each facial expression is
paired with several prompt-like descriptions.We propose an innovative automatic
annotation method, supported by Large Language Models (LLMs), to expedite the
dataset construction, thereby eliminating the substantial expense of manual
annotation. Following this, we utilize TEAD to train a CLIP-based model, termed
ExpCLIP, which encodes text and facial expressions into semantically aligned
style embeddings. The embeddings are subsequently integrated into the facial
animation generator to yield expressive and controllable facial animations.
Given the limited diversity of facial emotions in existing speech-driven facial
animation training data, we further introduce an effective Expression Prompt
Augmentation (EPA) mechanism to enable the animation generator to support
unprecedented richness in style control. Comprehensive experiments illustrate
that our method accomplishes expressive facial animation generation and offers
enhanced flexibility in effectively conveying the desired style
GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance
Although existing speech-driven talking face generation methods achieve
significant progress, they are far from real-world application due to the
avatar-specific training demand and unstable lip movements. To address the
above issues, we propose the GSmoothFace, a novel two-stage generalized talking
face generation model guided by a fine-grained 3d face model, which can
synthesize smooth lip dynamics while preserving the speaker's identity. Our
proposed GSmoothFace model mainly consists of the Audio to Expression
Prediction (A2EP) module and the Target Adaptive Face Translation (TAFT)
module. Specifically, we first develop the A2EP module to predict expression
parameters synchronized with the driven speech. It uses a transformer to
capture the long-term audio context and learns the parameters from the
fine-grained 3D facial vertices, resulting in accurate and smooth
lip-synchronization performance. Afterward, the well-designed TAFT module,
empowered by Morphology Augmented Face Blending (MAFB), takes the predicted
expression parameters and target video as inputs to modify the facial region of
the target video without distorting the background content. The TAFT
effectively exploits the identity appearance and background context in the
target video, which makes it possible to generalize to different speakers
without retraining. Both quantitative and qualitative experiments confirm the
superiority of our method in terms of realism, lip synchronization, and visual
quality. See the project page for code, data, and request pre-trained models:
https://zhanghm1995.github.io/GSmoothFace
PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo Multi-modal Features
Speech-driven 3D facial animation has improved a lot recently while most
related works only utilize acoustic modality and neglect the influence of
visual and textual cues, leading to unsatisfactory results in terms of
precision and coherence. We argue that visual and textual cues are not trivial
information. Therefore, we present a novel framework, namely PMMTalk, using
complementary Pseudo Multi-Modal features for improving the accuracy of facial
animation. The framework entails three modules: PMMTalk encoder, cross-modal
alignment module, and PMMTalk decoder. Specifically, the PMMTalk encoder
employs the off-the-shelf talking head generation architecture and speech
recognition technology to extract visual and textual information from speech,
respectively. Subsequently, the cross-modal alignment module aligns the
audio-image-text features at temporal and semantic levels. Then PMMTalk decoder
is employed to predict lip-syncing facial blendshape coefficients. Contrary to
prior methods, PMMTalk only requires an additional random reference face image
but yields more accurate results. Additionally, it is artist-friendly as it
seamlessly integrates into standard animation production workflows by
introducing facial blendshape coefficients. Finally, given the scarcity of 3D
talking face datasets, we introduce a large-scale 3D Chinese Audio-Visual
Facial Animation (3D-CAVFA) dataset. Extensive experiments and user studies
show that our approach outperforms the state of the art. We recommend watching
the supplementary video
An exploration of the potential of Automatic Speech Recognition to assist and enable receptive communication in higher education
The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search learning material more readily by augmenting synthetic speech with natural recorded real speech is also discussed and evaluated. The automatic provision of online lecture notes, synchronised with speech, enables staff and students to focus on learning and teaching issues, while also benefiting learners unable to attend the lecture or who find it difficult or impossible to take notes at the same time as listening, watching and thinking
DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser
Speech-driven 3D facial animation has been an attractive task in both
academia and industry. Traditional methods mostly focus on learning a
deterministic mapping from speech to animation. Recent approaches start to
consider the non-deterministic fact of speech-driven 3D face animation and
employ the diffusion model for the task. However, personalizing facial
animation and accelerating animation generation are still two major limitations
of existing diffusion-based methods. To address the above limitations, we
propose DiffusionTalker, a diffusion-based method that utilizes contrastive
learning to personalize 3D facial animation and knowledge distillation to
accelerate 3D animation generation. Specifically, to enable personalization, we
introduce a learnable talking identity to aggregate knowledge in audio
sequences. The proposed identity embeddings extract customized facial cues
across different people in a contrastive learning manner. During inference,
users can obtain personalized facial animation based on input audio, reflecting
a specific talking style. With a trained diffusion model with hundreds of
steps, we distill it into a lightweight model with 8 steps for acceleration.
Extensive experiments are conducted to demonstrate that our method outperforms
state-of-the-art methods. The code will be released
- …