98 research outputs found
SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory
The challenge of talking face generation from speech lies in aligning two
different modal information, audio and video, such that the mouth region
corresponds to input audio. Previous methods either exploit audio-visual
representation learning or leverage intermediate structural information such as
landmarks and 3D models. However, they struggle to synthesize fine details of
the lips varying at the phoneme level as they do not sufficiently provide
visual information of the lips at the video synthesis step. To overcome this
limitation, our work proposes Audio-Lip Memory that brings in visual
information of the mouth region corresponding to input audio and enforces
fine-grained audio-visual coherence. It stores lip motion features from
sequential ground truth images in the value memory and aligns them with
corresponding audio features so that they can be retrieved using audio input at
inference time. Therefore, using the retrieved lip motion features as visual
hints, it can easily correlate audio with visual dynamics in the synthesis
step. By analyzing the memory, we demonstrate that unique lip features are
stored in each memory slot at the phoneme level, capturing subtle lip motion
based on memory addressing. In addition, we introduce visual-visual
synchronization loss which can enhance lip-syncing performance when used along
with audio-visual synchronization loss in our model. Extensive experiments are
performed to verify that our method generates high-quality video with mouth
shapes that best align with the input audio, outperforming previous
state-of-the-art methods.Comment: Accepted at AAAI 2022 (Oral
CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding
This paper proposes a talking face generation method named "CP-EB" that takes
an audio signal as input and a person image as reference, to synthesize a
photo-realistic people talking video with head poses controlled by a short
video clip and proper eye blinking embedding. It's noted that not only the head
pose but also eye blinking are both important aspects for deep fake detection.
The implicit control of poses by video has already achieved by the state-of-art
work. According to recent research, eye blinking has weak correlation with
input audio which means eye blinks extraction from audio and generation are
possible. Hence, we propose a GAN-based architecture to extract eye blink
feature from input audio and reference video respectively and employ
contrastive training between them, then embed it into the concatenated features
of identity and poses to generate talking face images. Experimental results
show that the proposed method can generate photo-realistic talking face with
synchronous lips motions, natural head poses and blinking eyes.Comment: Accepted by the 21st IEEE International Symposium on Parallel and
Distributed Processing with Applications (IEEE ISPA 2023
Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss
We devise a cascade GAN approach to generate talking face video, which is
robust to different face shapes, view angles, facial characteristics, and noisy
audio conditions. Instead of learning a direct mapping from audio to video
frames, we propose first to transfer audio to high-level structure, i.e., the
facial landmarks, and then to generate video frames conditioned on the
landmarks. Compared to a direct audio-to-image approach, our cascade approach
avoids fitting spurious correlations between audiovisual signals that are
irrelevant to the speech content. We, humans, are sensitive to temporal
discontinuities and subtle artifacts in video. To avoid those pixel jittering
problems and to enforce the network to focus on audiovisual-correlated regions,
we propose a novel dynamically adjustable pixel-wise loss with an attention
mechanism. Furthermore, to generate a sharper image with well-synchronized
facial movements, we propose a novel regression-based discriminator structure,
which considers sequence-level information along with frame-level information.
Thoughtful experiments on several datasets and real-world samples demonstrate
significantly better results obtained by our method than the state-of-the-art
methods in both quantitative and qualitative comparisons
That's What I Said: Fully-Controllable Talking Face Generation
The goal of this paper is to synthesise talking faces with controllable
facial motions. To achieve this goal, we propose two key ideas. The first is to
establish a canonical space where every face has the same motion patterns but
different identities. The second is to navigate a multimodal motion space that
only represents motion-related features while eliminating identity information.
To disentangle identity and motion, we introduce an orthogonality constraint
between the two different latent spaces. From this, our method can generate
natural-looking talking faces with fully controllable facial attributes and
accurate lip synchronisation. Extensive experiments demonstrate that our method
achieves state-of-the-art results in terms of both visual quality and lip-sync
score. To the best of our knowledge, we are the first to develop a talking face
generation framework that can accurately manifest full target facial motions
including lip, head pose, and eye movements in the generated video without any
additional supervision beyond RGB video with audio
FONT: Flow-guided One-shot Talking Head Generation with Natural Head Motions
One-shot talking head generation has received growing attention in recent
years, with various creative and practical applications. An ideal natural and
vivid generated talking head video should contain natural head pose changes.
However, it is challenging to map head pose sequences from driving audio since
there exists a natural gap between audio-visual modalities. In this work, we
propose a Flow-guided One-shot model that achieves NaTural head motions(FONT)
over generated talking heads. Specifically, the head pose prediction module is
designed to generate head pose sequences from the source face and driving
audio. We add the random sampling operation and the structural similarity
constraint to model the diversity in the one-to-many mapping between
audio-visual modality, thus predicting natural head poses. Then we develop a
keypoint predictor that produces unsupervised keypoints from the source face,
driving audio and pose sequences to describe the facial structure information.
Finally, a flow-guided occlusion-aware generator is employed to produce
photo-realistic talking head videos from the estimated keypoints and source
face. Extensive experimental results prove that FONT generates talking heads
with natural head poses and synchronized mouth shapes, outperforming other
compared methods.Comment: Accepted by ICME202
- …