Search CORE

139 research outputs found

A Practical and Configurable Lip Sync Method for Games

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

Rule-based lip-syncing algorithm for virtual character in voice chatbot

Author: Lovely Felicia Priscilla
Wicaksana Arya
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/10/2021
Field of study

Virtual characters changed the way we interact with computers. The underlying key for a believable virtual character is accurate synchronization between the visual (lip movements) and the audio (speech) in real-time. This work develops a 3D model for the virtual character and implements the rule-based lip-syncing algorithm for the virtual character's lip movements. We use the Jacob voice chatbot as the platform for the design and implementation of the virtual character. Thus, audio-driven articulation and manual mapping methods are considered suitable for real-time applications such as Jacob. We evaluate the proposed virtual character using hedonic motivation system adoption model (HMSAM) with 70 users. The HMSAM results for the behavioral intention to use is 91.74%, and the immersion is 72.95%. The average score for all aspects of the HMSAM is 85.50%. The rule-based lip-syncing algorithm accurately synchronizes the lip movements with the Jacob voice chatbot's speech in real-time

Journal of Education and Learning (EduLearn)

TELKOMNIKA (Telecommunication Computing Electronics and Control)

UAD Journal Management System

Realistic Lip Syncing for Virtual Character Using Common Viseme Set

Author: Ali IR
Kolivand H
Sulong G
Publication venue: 'Canadian Center of Science and Education'
Publication date
Field of study

Speech is one of the most important interaction methods between the humans. Therefore, most of avatar researches focus on this area with significant attention. Creating animated speech requires a facial model capable of representing the myriad shapes the human face expressions during speech. Moreover, a method to produce the correct shape at the correct time is also in order. One of the main challenges is to create precise lip movements of the avatar and synchronize it with a recorded audio. This paper proposes a new lip synchronization algorithm for realistic applications, which can be employed to generate synchronized facial movements among the audio generated from natural speech or through a text-to-speech engine. This method requires an animator to construct animations using a canonical set of visemes for all pair wise combination of a reduced phoneme set. These animations are then stitched together smoothly to construct the final animation

LJMU Research Online (Liverpool John Moores University)

Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications

Author: Chang Jen-Hao Rick
Ranjan Anurag
Tuzel Oncel
Vemulapalli Raviteja
Yang Karren D.
Publication venue
Publication date: 29/11/2023
Field of study

We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D facial motions that accompany speech in the real world. Importantly, the relationship between speech and facial motion is one-to-many, containing both inter-speaker and intra-speaker variations and necessitating a probabilistic approach. In this paper, we identify and address key challenges that have so far limited the development of probabilistic models: lack of datasets and metrics that are suitable for training and evaluating them, as well as the difficulty of designing a model that generates diverse results while remaining faithful to a strong conditioning signal as speech. We first propose large-scale benchmark datasets and metrics suitable for probabilistic modeling. Then, we demonstrate a probabilistic model that achieves both diversity and fidelity to speech, outperforming other methods across the proposed benchmarks. Finally, we showcase useful applications of probabilistic models trained on these large-scale datasets: we can generate diverse speech-driven 3D facial motion that matches unseen speaker styles extracted from reference clips; and our synthetic meshes can be used to improve the performance of downstream audio-visual models

arXiv.org e-Print Archive

Breathing Life into Faces: Speech-driven 3D Facial Animation with Natural Head Pose and Detailed Shape

Author: He Tianyu
Jin Xin
Lin Jianxin
Wang Yijun
Yin Lianying
Zhao Wei
Publication venue
Publication date: 31/10/2023
Field of study

The creation of lifelike speech-driven 3D facial animation requires a natural and precise synchronization between audio input and facial expressions. However, existing works still fail to render shapes with flexible head poses and natural facial details (e.g., wrinkles). This limitation is mainly due to two aspects: 1) Collecting training set with detailed 3D facial shapes is highly expensive. This scarcity of detailed shape annotations hinders the training of models with expressive facial animation. 2) Compared to mouth movement, the head pose is much less correlated to speech content. Consequently, concurrent modeling of both mouth movement and head pose yields the lack of facial movement controllability. To address these challenges, we introduce VividTalker, a new framework designed to facilitate speech-driven 3D facial animation characterized by flexible head pose and natural facial details. Specifically, we explicitly disentangle facial animation into head pose and mouth movement and encode them separately into discrete latent spaces. Then, these attributes are generated through an autoregressive process leveraging a window-based Transformer architecture. To augment the richness of 3D facial animation, we construct a new 3D dataset with detailed shapes and learn to synthesize facial details in line with speech content. Extensive quantitative and qualitative experiments demonstrate that VividTalker outperforms state-of-the-art methods, resulting in vivid and realistic speech-driven 3D facial animation

arXiv.org e-Print Archive

Emotional Speech-Driven Animation with Content-Emotion Disentanglement

Author: Black Michael J.
Bolkart Timo
Chhatre Kiran
Daněček Radek
Tripathi Shashank
Wen Yandong
Publication venue
Publication date: 26/09/2023
Field of study

To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus, we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control.Comment: SIGGRAPH Asia 2023 Conference Pape

arXiv.org e-Print Archive

Implementasi Algoritma Rule-based Lip-syncing untuk Karakter Virtual Chatbot Jacob

Author: Priscilla Lovely Felicia
Publication venue
Publication date: 01/01/2020
Field of study

Jacob adalah aplikasi voice chatbot yang menyediakan informasi mengenai Program Dual Degree Informatika Universitas Multimedia Nusantara. Jacob telah memiliki fitur chat dengan menggunakan suara dan pengenalan wajah, tetapi belum terdapat karakter virtual pada Jacob. Karakter virtual dapat meningkatkan behavioral intention to use dan immersion pengguna Jacob. Penelitian ini mengimplementasikan algoritma rule-based lip-syncing untuk karakter virtual Jacob. Aplikasi karakter virtual Jacob dibangun dengan menggunakan Unity dengan bahasa pemrograman C#. Evaluasi dilakukan dengan menggunakan kuesioner dengan model Hedonic Motivation System Adoption Model (HMSAM). Hasil evaluasi menyatakan bahwa persentase tingkat behavioral intention to use adalah sebesar 90,63% dan persentase tingkat immersion adalah sebesar 74,00%

UMN Knowledge Center

Speech-Driven 3D Face Animation with Composite and Regional Facial Movements

Author: Jia Jia
Wen Qi
Wen Xiang
Wu Haozhe
Xing Junliang
Zhou Songtao
Publication venue
Publication date: 10/08/2023
Field of study

Speech-driven 3D face animation poses significant challenges due to the intricacy and variability inherent in human facial movements. This paper emphasizes the importance of considering both the composite and regional natures of facial movements in speech-driven 3D face animation. The composite nature pertains to how speech-independent factors globally modulate speech-driven facial movements along the temporal dimension. Meanwhile, the regional nature alludes to the notion that facial movements are not globally correlated but are actuated by local musculature along the spatial dimension. It is thus indispensable to incorporate both natures for engendering vivid animation. To address the composite nature, we introduce an adaptive modulation module that employs arbitrary facial movements to dynamically adjust speech-driven facial movements across frames on a global scale. To accommodate the regional nature, our approach ensures that each constituent of the facial features for every frame focuses on the local spatial movements of 3D faces. Moreover, we present a non-autoregressive backbone for translating audio to 3D facial movements, which maintains high-frequency nuances of facial movements and facilitates efficient inference. Comprehensive experiments and user studies demonstrate that our method surpasses contemporary state-of-the-art approaches both qualitatively and quantitatively.Comment: Accepted by MM 2023, 9 pages, 7 figures. arXiv admin note: text overlap with arXiv:2303.0979

arXiv.org e-Print Archive