Search CORE

496,632 research outputs found

Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation

Author: Gan Yuan
Sun Lingyun
Yang Yi
Yang Zongxin
Yue Xihang
Publication venue
Publication date: 10/09/2023
Field of study

Audio-driven talking-head synthesis is a popular research topic for virtual human-related applications. However, the inflexibility and inefficiency of existing methods, which necessitate expensive end-to-end training to transfer emotions from guidance videos to talking-head predictions, are significant limitations. In this work, we propose the Emotional Adaptation for Audio-driven Talking-head (EAT) method, which transforms emotion-agnostic talking-head models into emotion-controllable ones in a cost-effective and efficient manner through parameter-efficient adaptations. Our approach utilizes a pretrained emotion-agnostic talking-head transformer and introduces three lightweight adaptations (the Deep Emotional Prompts, Emotional Deformation Network, and Emotional Adaptation Module) from different perspectives to enable precise and realistic emotion controls. Our experiments demonstrate that our approach achieves state-of-the-art performance on widely-used benchmarks, including LRW and MEAD. Additionally, our parameter-efficient adaptations exhibit remarkable generalization ability, even in scenarios where emotional training videos are scarce or nonexistent. Project website: https://yuangan.github.io/eat/Comment: Accepted to ICCV 2023. Project page: https://yuangan.github.io/eat

arXiv.org e-Print Archive

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

Author: Bo Liefeng
Cao Xun
Gao Daiheng
Ji Xinya
Sun Xusen
Zhang Bang
Zhang Longhao
Zhang Peng
Zhou Kangneng
Zhu Hao
Publication venue
Publication date: 06/12/2023
Field of study

Audio-driven talking head generation has drawn much attention in recent years, and many efforts have been made in lip-sync, expressive facial expressions, natural head pose generation, and high video quality. However, no model has yet led or tied on all these metrics due to the one-to-many mapping between audio and motion. In this paper, we propose VividTalk, a two-stage generic framework that supports generating high-visual quality talking head videos with all the above properties. Specifically, in the first stage, we map the audio to mesh by learning two motions, including non-rigid expression motion and rigid head motion. For expression motion, both blendshape and vertex are adopted as the intermediate representation to maximize the representation ability of the model. For natural head motion, a novel learnable head pose codebook with a two-phase training mechanism is proposed. In the second stage, we proposed a dual branch motion-vae and a generator to transform the meshes into dense motion and synthesize high-quality video frame-by-frame. Extensive experiments show that the proposed VividTalk can generate high-visual quality talking head videos with lip-sync and realistic enhanced by a large margin, and outperforms previous state-of-the-art works in objective and subjective comparisons.Comment: 10 pages, 8 figure

arXiv.org e-Print Archive

CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding

Author: Cheng Ning
Deng Yimin
Liang Ziqi
Wang Jianzong
Xiao Jing
Zhang Xulong
Publication venue
Publication date: 14/11/2023
Field of study

This paper proposes a talking face generation method named "CP-EB" that takes an audio signal as input and a person image as reference, to synthesize a photo-realistic people talking video with head poses controlled by a short video clip and proper eye blinking embedding. It's noted that not only the head pose but also eye blinking are both important aspects for deep fake detection. The implicit control of poses by video has already achieved by the state-of-art work. According to recent research, eye blinking has weak correlation with input audio which means eye blinks extraction from audio and generation are possible. Hence, we propose a GAN-based architecture to extract eye blink feature from input audio and reference video respectively and employ contrastive training between them, then embed it into the concatenated features of identity and poses to generate talking face images. Experimental results show that the proposed method can generate photo-realistic talking face with synchronous lips motions, natural head poses and blinking eyes.Comment: Accepted by the 21st IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA 2023

arXiv.org e-Print Archive

LaughTalk: Expressive 3D Talking Head Generation with Laughter

Author: Hong Da Hye
Hyun Lee
Ju Janghoon
Nam Suekyeong
Oh Tae-Hyun
Sung-Bin Kim
Publication venue
Publication date: 02/11/2023
Field of study

Laughter is a unique expression, essential to affirmative social interactions of humans. Although current 3D talking head generation methods produce convincing verbal articulations, they often fail to capture the vitality and subtleties of laughter and smiles despite their importance in social context. In this paper, we introduce a novel task to generate 3D talking heads capable of both articulate speech and authentic laughter. Our newly curated dataset comprises 2D laughing videos paired with pseudo-annotated and human-validated 3D FLAME parameters and vertices. Given our proposed dataset, we present a strong baseline with a two-stage training scheme: the model first learns to talk and then acquires the ability to express laughter. Extensive experiments demonstrate that our method performs favorably compared to existing approaches in both talking head generation and expressing laughter signals. We further explore potential applications on top of our proposed method for rigging realistic avatars.Comment: Accepted to WACV202

arXiv.org e-Print Archive

Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition

Author: Chen Xiaokang
He Dongliang
Hu Tianshu
Liu Jingtuo
Tang Jiaxiang
Wang Jingdong
Wang Kaisiyuan
Zeng Gang
Zhou Hang
Publication venue
Publication date: 22/11/2022
Field of study

While dynamic Neural Radiance Fields (NeRF) have shown success in high-fidelity 3D modeling of talking portraits, the slow training and inference speed severely obstruct their potential usage. In this paper, we propose an efficient NeRF-based framework that enables real-time synthesizing of talking portraits and faster convergence by leveraging the recent success of grid-based NeRF. Our key insight is to decompose the inherently high-dimensional talking portrait representation into three low-dimensional feature grids. Specifically, a Decomposed Audio-spatial Encoding Module models the dynamic head with a 3D spatial grid and a 2D audio grid. The torso is handled with another 2D grid in a lightweight Pseudo-3D Deformable Module. Both modules focus on efficiency under the premise of good rendering quality. Extensive experiments demonstrate that our method can generate realistic and audio-lips synchronized talking portrait videos, while also being highly efficient compared to previous methods.Comment: Project page: https://me.kiui.moe/radnerf

arXiv.org e-Print Archive

Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head video Generation

Author: Hong Fa-Ting
Xu Dan
Publication venue
Publication date: 20/07/2023
Field of study

Talking head video generation aims to animate a human face in a still image with dynamic poses and expressions using motion information derived from a target-driving video, while maintaining the person's identity in the source image. However, dramatic and complex motions in the driving video cause ambiguous generation, because the still source image cannot provide sufficient appearance information for occluded regions or delicate expression variations, which produces severe artifacts and significantly degrades the generation quality. To tackle this problem, we propose to learn a global facial representation space, and design a novel implicit identity representation conditioned memory compensation network, coined as MCNet, for high-fidelity talking head generation.~Specifically, we devise a network module to learn a unified spatial facial meta-memory bank from all training samples, which can provide rich facial structure and appearance priors to compensate warped source facial features for the generation. Furthermore, we propose an effective query mechanism based on implicit identity representations learned from the discrete keypoints of the source image. It can greatly facilitate the retrieval of more correlated information from the memory bank for the compensation. Extensive experiments demonstrate that MCNet can learn representative and complementary facial memory, and can clearly outperform previous state-of-the-art talking head generation methods on VoxCeleb1 and CelebV datasets. Please check our \href{https://github.com/harlanhong/ICCV2023-MCNET}{Project}.Comment: Accepted by ICCV2023, update the reference and figure

arXiv.org e-Print Archive

An efficient virtual patient image model: interview training in pharmacy

Author: Park Mira
Summons Peter
Publication venue: Science and Engineering Research Support Society
Publication date: 01/01/2013
Field of study

This paper presents the development of a virtual patient simulation by a 3D talking head and its use by pharmacy students as a training aid for patient consultation. The paper concentrates on the virtual patient modeling, its synthesis with a speech engine and facial expression interaction. The virtual patient model is developed in three stages: building a personalized 3D face model; animation of the face model; and speech driven face synthesis. The model is used in conjunction with a training artificial intelligence module that creates several scenarios in which the student oral interview ability is assessed. The final evaluation phase is a randomized controlled trial at three partner universities: The University of Newcastle, Monash University and Charles Stuart University. It shows the potential to revolutionize the way pharmacy students’ training is conducted

University of Newcastle's Digital Repository

An interactive speech training system with virtual reality articulation for Mandarin-speaking hearing impaired children

Author: Liu X
Ng ML
Wang L
Wu X
Yan N
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

The present project involved the development of a novel interactive speech training system based on virtual reality articulation and examination of the efficacy of the system for hearing impaired (HI) children. Twenty meaningful Mandarin words were presented to the HI children via a 3-D talking head during articulation training. Electromagnetic Articulography (EMA) and graphic transform technology were used to depict movements of various articulators. In addition, speech corpuses were organized in listening and speaking training modules of the system to help improve language skills of the HI children. Accuracy of virtual reality articulatory movement was evaluated through a series of experiments. Finally, a pilot test was performed to train two HI children using the system. Preliminary results showed improvement in speech production by the HI children, and the system was recognized as acceptable and interesting for children. It can be concluded that the training system is effective and valid in articulation training for HI children. © 2013 IEEE.published_or_final_versio

HKU Scholars Hub