549 research outputs found
Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss
We devise a cascade GAN approach to generate talking face video, which is
robust to different face shapes, view angles, facial characteristics, and noisy
audio conditions. Instead of learning a direct mapping from audio to video
frames, we propose first to transfer audio to high-level structure, i.e., the
facial landmarks, and then to generate video frames conditioned on the
landmarks. Compared to a direct audio-to-image approach, our cascade approach
avoids fitting spurious correlations between audiovisual signals that are
irrelevant to the speech content. We, humans, are sensitive to temporal
discontinuities and subtle artifacts in video. To avoid those pixel jittering
problems and to enforce the network to focus on audiovisual-correlated regions,
we propose a novel dynamically adjustable pixel-wise loss with an attention
mechanism. Furthermore, to generate a sharper image with well-synchronized
facial movements, we propose a novel regression-based discriminator structure,
which considers sequence-level information along with frame-level information.
Thoughtful experiments on several datasets and real-world samples demonstrate
significantly better results obtained by our method than the state-of-the-art
methods in both quantitative and qualitative comparisons
HDTR-Net: A Real-Time High-Definition Teeth Restoration Network for Arbitrary Talking Face Generation Methods
Talking Face Generation (TFG) aims to reconstruct facial movements to achieve
high natural lip movements from audio and facial features that are under
potential connections. Existing TFG methods have made significant advancements
to produce natural and realistic images. However, most work rarely takes visual
quality into consideration. It is challenging to ensure lip synchronization
while avoiding visual quality degradation in cross-modal generation methods. To
address this issue, we propose a universal High-Definition Teeth Restoration
Network, dubbed HDTR-Net, for arbitrary TFG methods. HDTR-Net can enhance teeth
regions at an extremely fast speed while maintaining synchronization, and
temporal consistency. In particular, we propose a Fine-Grained Feature Fusion
(FGFF) module to effectively capture fine texture feature information around
teeth and surrounding regions, and use these features to fine-grain the feature
map to enhance the clarity of teeth. Extensive experiments show that our method
can be adapted to arbitrary TFG methods without suffering from lip
synchronization and frame coherence. Another advantage of HDTR-Net is its
real-time generation ability. Also under the condition of high-definition
restoration of talking face video synthesis, its inference speed is
faster than the current state-of-the-art face restoration based on
super-resolution.Comment: 15pages, 6 figures, PRCV202
- …