139 research outputs found
Rule-based lip-syncing algorithm for virtual character in voice chatbot
Virtual characters changed the way we interact with computers. The underlying key for a believable virtual character is accurate synchronization between the visual (lip movements) and the audio (speech) in real-time. This work develops a 3D model for the virtual character and implements the rule-based lip-syncing algorithm for the virtual character's lip movements. We use the Jacob voice chatbot as the platform for the design and implementation of the virtual character. Thus, audio-driven articulation and manual mapping methods are considered suitable for real-time applications such as Jacob. We evaluate the proposed virtual character using hedonic motivation system adoption model (HMSAM) with 70 users. The HMSAM results for the behavioral intention to use is 91.74%, and the immersion is 72.95%. The average score for all aspects of the HMSAM is 85.50%. The rule-based lip-syncing algorithm accurately synchronizes the lip movements with the Jacob voice chatbot's speech in real-time
Realistic Lip Syncing for Virtual Character Using Common Viseme Set
Speech is one of the most important interaction methods between the humans. Therefore, most of avatar researches focus on this area with significant attention. Creating animated speech requires a facial model capable of representing the myriad shapes the human face expressions during speech. Moreover, a method to produce the correct shape at the correct time is also in order. One of the main challenges is to create precise lip movements of the avatar and synchronize it with a recorded audio. This paper proposes a new lip synchronization algorithm for realistic applications, which can be employed to generate synchronized facial movements among the audio generated from natural speech or through a text-to-speech engine. This method requires an animator to construct animations using a canonical set of visemes for all pair wise combination of a reduced phoneme set. These animations are then stitched together smoothly to construct the final animation
Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications
We consider the task of animating 3D facial geometry from speech signal.
Existing works are primarily deterministic, focusing on learning a one-to-one
mapping from speech signal to 3D face meshes on small datasets with limited
speakers. While these models can achieve high-quality lip articulation for
speakers in the training set, they are unable to capture the full and diverse
distribution of 3D facial motions that accompany speech in the real world.
Importantly, the relationship between speech and facial motion is one-to-many,
containing both inter-speaker and intra-speaker variations and necessitating a
probabilistic approach. In this paper, we identify and address key challenges
that have so far limited the development of probabilistic models: lack of
datasets and metrics that are suitable for training and evaluating them, as
well as the difficulty of designing a model that generates diverse results
while remaining faithful to a strong conditioning signal as speech. We first
propose large-scale benchmark datasets and metrics suitable for probabilistic
modeling. Then, we demonstrate a probabilistic model that achieves both
diversity and fidelity to speech, outperforming other methods across the
proposed benchmarks. Finally, we showcase useful applications of probabilistic
models trained on these large-scale datasets: we can generate diverse
speech-driven 3D facial motion that matches unseen speaker styles extracted
from reference clips; and our synthetic meshes can be used to improve the
performance of downstream audio-visual models
Breathing Life into Faces: Speech-driven 3D Facial Animation with Natural Head Pose and Detailed Shape
The creation of lifelike speech-driven 3D facial animation requires a natural
and precise synchronization between audio input and facial expressions.
However, existing works still fail to render shapes with flexible head poses
and natural facial details (e.g., wrinkles). This limitation is mainly due to
two aspects: 1) Collecting training set with detailed 3D facial shapes is
highly expensive. This scarcity of detailed shape annotations hinders the
training of models with expressive facial animation. 2) Compared to mouth
movement, the head pose is much less correlated to speech content.
Consequently, concurrent modeling of both mouth movement and head pose yields
the lack of facial movement controllability. To address these challenges, we
introduce VividTalker, a new framework designed to facilitate speech-driven 3D
facial animation characterized by flexible head pose and natural facial
details. Specifically, we explicitly disentangle facial animation into head
pose and mouth movement and encode them separately into discrete latent spaces.
Then, these attributes are generated through an autoregressive process
leveraging a window-based Transformer architecture. To augment the richness of
3D facial animation, we construct a new 3D dataset with detailed shapes and
learn to synthesize facial details in line with speech content. Extensive
quantitative and qualitative experiments demonstrate that VividTalker
outperforms state-of-the-art methods, resulting in vivid and realistic
speech-driven 3D facial animation
Emotional Speech-Driven Animation with Content-Emotion Disentanglement
To be widely adopted, 3D facial avatars must be animated easily,
realistically, and directly from speech signals. While the best recent methods
generate 3D animations that are synchronized with the input audio, they largely
ignore the impact of emotions on facial expressions. Realistic facial animation
requires lip-sync together with the natural expression of emotion. To that end,
we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which
generates 3D talking-head avatars that maintain lip-sync from speech while
enabling explicit control over the expression of emotion. To achieve this, we
supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion.
These losses are based on two key observations: (1) deformations of the face
due to speech are spatially localized around the mouth and have high temporal
frequency, whereas (2) facial expressions may deform the whole face and occur
over longer intervals. Thus, we train EMOTE with a per-frame lip-reading loss
to preserve the speech-dependent content, while supervising emotion at the
sequence level. Furthermore, we employ a content-emotion exchange mechanism in
order to supervise different emotions on the same audio, while maintaining the
lip motion synchronized with the speech. To employ deep perceptual losses
without getting undesirable artifacts, we devise a motion prior in the form of
a temporal VAE. Due to the absence of high-quality aligned emotional 3D face
datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted
from an emotional video dataset (i.e., MEAD). Extensive qualitative and
perceptual evaluations demonstrate that EMOTE produces speech-driven facial
animations with better lip-sync than state-of-the-art methods trained on the
same data, while offering additional, high-quality emotional control.Comment: SIGGRAPH Asia 2023 Conference Pape
Implementasi Algoritma Rule-based Lip-syncing untuk Karakter Virtual Chatbot Jacob
Jacob adalah aplikasi voice chatbot yang menyediakan informasi mengenai Program Dual Degree Informatika Universitas Multimedia Nusantara. Jacob telah memiliki fitur chat dengan menggunakan suara dan pengenalan wajah, tetapi belum terdapat karakter virtual pada Jacob. Karakter virtual dapat meningkatkan behavioral intention to use dan immersion pengguna Jacob. Penelitian ini mengimplementasikan algoritma rule-based lip-syncing untuk karakter virtual Jacob. Aplikasi karakter virtual Jacob dibangun dengan menggunakan Unity dengan bahasa pemrograman C#. Evaluasi dilakukan dengan menggunakan kuesioner dengan model Hedonic Motivation System Adoption Model (HMSAM). Hasil evaluasi menyatakan bahwa persentase tingkat behavioral intention to use adalah sebesar 90,63% dan persentase tingkat immersion adalah sebesar 74,00%
Speech-Driven 3D Face Animation with Composite and Regional Facial Movements
Speech-driven 3D face animation poses significant challenges due to the
intricacy and variability inherent in human facial movements. This paper
emphasizes the importance of considering both the composite and regional
natures of facial movements in speech-driven 3D face animation. The composite
nature pertains to how speech-independent factors globally modulate
speech-driven facial movements along the temporal dimension. Meanwhile, the
regional nature alludes to the notion that facial movements are not globally
correlated but are actuated by local musculature along the spatial dimension.
It is thus indispensable to incorporate both natures for engendering vivid
animation. To address the composite nature, we introduce an adaptive modulation
module that employs arbitrary facial movements to dynamically adjust
speech-driven facial movements across frames on a global scale. To accommodate
the regional nature, our approach ensures that each constituent of the facial
features for every frame focuses on the local spatial movements of 3D faces.
Moreover, we present a non-autoregressive backbone for translating audio to 3D
facial movements, which maintains high-frequency nuances of facial movements
and facilitates efficient inference. Comprehensive experiments and user studies
demonstrate that our method surpasses contemporary state-of-the-art approaches
both qualitatively and quantitatively.Comment: Accepted by MM 2023, 9 pages, 7 figures. arXiv admin note: text
overlap with arXiv:2303.0979
- …