496,632 research outputs found
Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation
Audio-driven talking-head synthesis is a popular research topic for virtual
human-related applications. However, the inflexibility and inefficiency of
existing methods, which necessitate expensive end-to-end training to transfer
emotions from guidance videos to talking-head predictions, are significant
limitations. In this work, we propose the Emotional Adaptation for Audio-driven
Talking-head (EAT) method, which transforms emotion-agnostic talking-head
models into emotion-controllable ones in a cost-effective and efficient manner
through parameter-efficient adaptations. Our approach utilizes a pretrained
emotion-agnostic talking-head transformer and introduces three lightweight
adaptations (the Deep Emotional Prompts, Emotional Deformation Network, and
Emotional Adaptation Module) from different perspectives to enable precise and
realistic emotion controls. Our experiments demonstrate that our approach
achieves state-of-the-art performance on widely-used benchmarks, including LRW
and MEAD. Additionally, our parameter-efficient adaptations exhibit remarkable
generalization ability, even in scenarios where emotional training videos are
scarce or nonexistent. Project website: https://yuangan.github.io/eat/Comment: Accepted to ICCV 2023. Project page: https://yuangan.github.io/eat
VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior
Audio-driven talking head generation has drawn much attention in recent
years, and many efforts have been made in lip-sync, expressive facial
expressions, natural head pose generation, and high video quality. However, no
model has yet led or tied on all these metrics due to the one-to-many mapping
between audio and motion. In this paper, we propose VividTalk, a two-stage
generic framework that supports generating high-visual quality talking head
videos with all the above properties. Specifically, in the first stage, we map
the audio to mesh by learning two motions, including non-rigid expression
motion and rigid head motion. For expression motion, both blendshape and vertex
are adopted as the intermediate representation to maximize the representation
ability of the model. For natural head motion, a novel learnable head pose
codebook with a two-phase training mechanism is proposed. In the second stage,
we proposed a dual branch motion-vae and a generator to transform the meshes
into dense motion and synthesize high-quality video frame-by-frame. Extensive
experiments show that the proposed VividTalk can generate high-visual quality
talking head videos with lip-sync and realistic enhanced by a large margin, and
outperforms previous state-of-the-art works in objective and subjective
comparisons.Comment: 10 pages, 8 figure
CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding
This paper proposes a talking face generation method named "CP-EB" that takes
an audio signal as input and a person image as reference, to synthesize a
photo-realistic people talking video with head poses controlled by a short
video clip and proper eye blinking embedding. It's noted that not only the head
pose but also eye blinking are both important aspects for deep fake detection.
The implicit control of poses by video has already achieved by the state-of-art
work. According to recent research, eye blinking has weak correlation with
input audio which means eye blinks extraction from audio and generation are
possible. Hence, we propose a GAN-based architecture to extract eye blink
feature from input audio and reference video respectively and employ
contrastive training between them, then embed it into the concatenated features
of identity and poses to generate talking face images. Experimental results
show that the proposed method can generate photo-realistic talking face with
synchronous lips motions, natural head poses and blinking eyes.Comment: Accepted by the 21st IEEE International Symposium on Parallel and
Distributed Processing with Applications (IEEE ISPA 2023
LaughTalk: Expressive 3D Talking Head Generation with Laughter
Laughter is a unique expression, essential to affirmative social interactions
of humans. Although current 3D talking head generation methods produce
convincing verbal articulations, they often fail to capture the vitality and
subtleties of laughter and smiles despite their importance in social context.
In this paper, we introduce a novel task to generate 3D talking heads capable
of both articulate speech and authentic laughter. Our newly curated dataset
comprises 2D laughing videos paired with pseudo-annotated and human-validated
3D FLAME parameters and vertices. Given our proposed dataset, we present a
strong baseline with a two-stage training scheme: the model first learns to
talk and then acquires the ability to express laughter. Extensive experiments
demonstrate that our method performs favorably compared to existing approaches
in both talking head generation and expressing laughter signals. We further
explore potential applications on top of our proposed method for rigging
realistic avatars.Comment: Accepted to WACV202
Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition
While dynamic Neural Radiance Fields (NeRF) have shown success in
high-fidelity 3D modeling of talking portraits, the slow training and inference
speed severely obstruct their potential usage. In this paper, we propose an
efficient NeRF-based framework that enables real-time synthesizing of talking
portraits and faster convergence by leveraging the recent success of grid-based
NeRF. Our key insight is to decompose the inherently high-dimensional talking
portrait representation into three low-dimensional feature grids. Specifically,
a Decomposed Audio-spatial Encoding Module models the dynamic head with a 3D
spatial grid and a 2D audio grid. The torso is handled with another 2D grid in
a lightweight Pseudo-3D Deformable Module. Both modules focus on efficiency
under the premise of good rendering quality. Extensive experiments demonstrate
that our method can generate realistic and audio-lips synchronized talking
portrait videos, while also being highly efficient compared to previous
methods.Comment: Project page: https://me.kiui.moe/radnerf
Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head video Generation
Talking head video generation aims to animate a human face in a still image
with dynamic poses and expressions using motion information derived from a
target-driving video, while maintaining the person's identity in the source
image. However, dramatic and complex motions in the driving video cause
ambiguous generation, because the still source image cannot provide sufficient
appearance information for occluded regions or delicate expression variations,
which produces severe artifacts and significantly degrades the generation
quality. To tackle this problem, we propose to learn a global facial
representation space, and design a novel implicit identity representation
conditioned memory compensation network, coined as MCNet, for high-fidelity
talking head generation.~Specifically, we devise a network module to learn a
unified spatial facial meta-memory bank from all training samples, which can
provide rich facial structure and appearance priors to compensate warped source
facial features for the generation. Furthermore, we propose an effective query
mechanism based on implicit identity representations learned from the discrete
keypoints of the source image. It can greatly facilitate the retrieval of more
correlated information from the memory bank for the compensation. Extensive
experiments demonstrate that MCNet can learn representative and complementary
facial memory, and can clearly outperform previous state-of-the-art talking
head generation methods on VoxCeleb1 and CelebV datasets. Please check our
\href{https://github.com/harlanhong/ICCV2023-MCNET}{Project}.Comment: Accepted by ICCV2023, update the reference and figure
An efficient virtual patient image model: interview training in pharmacy
This paper presents the development of a virtual patient simulation by a 3D talking head and its use by pharmacy students as a training aid for patient consultation. The paper concentrates on the virtual patient modeling, its synthesis with a speech engine and facial expression interaction. The virtual patient model is developed in three stages: building a personalized 3D face model; animation of the face model; and speech driven face synthesis. The model is used in conjunction with a training artificial intelligence module that creates several scenarios in which the student oral interview ability is assessed. The final evaluation phase is a randomized controlled trial at three partner universities: The University of Newcastle, Monash University and Charles Stuart University. It shows the potential to revolutionize the way pharmacy students’ training is conducted
An interactive speech training system with virtual reality articulation for Mandarin-speaking hearing impaired children
The present project involved the development of a novel interactive speech training system based on virtual reality articulation and examination of the efficacy of the system for hearing impaired (HI) children. Twenty meaningful Mandarin words were presented to the HI children via a 3-D talking head during articulation training. Electromagnetic Articulography (EMA) and graphic transform technology were used to depict movements of various articulators. In addition, speech corpuses were organized in listening and speaking training modules of the system to help improve language skills of the HI children. Accuracy of virtual reality articulatory movement was evaluated through a series of experiments. Finally, a pilot test was performed to train two HI children using the system. Preliminary results showed improvement in speech production by the HI children, and the system was recognized as acceptable and interesting for children. It can be concluded that the training system is effective and valid in articulation training for HI children. © 2013 IEEE.published_or_final_versio
- …