352 research outputs found
Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis
Existing automated dubbing methods are usually designed for Professionally
Generated Content (PGC) production, which requires massive training data and
training time to learn a person-specific audio-video mapping. In this paper, we
investigate an audio-driven dubbing method that is more feasible for User
Generated Content (UGC) production. There are two unique challenges to design a
method for UGC: 1) the appearances of speakers are diverse and arbitrary as the
method needs to generalize across users; 2) the available video data of one
speaker are very limited. In order to tackle the above challenges, we first
introduce a new Style Translation Network to integrate the speaking style of
the target and the speaking content of the source via a cross-modal AdaIN
module. It enables our model to quickly adapt to a new speaker. Then, we
further develop a semi-parametric video renderer, which takes full advantage of
the limited training data of the unseen speaker via a video-level
retrieve-warp-refine pipeline. Finally, we propose a temporal regularization
for the semi-parametric renderer, generating more continuous videos. Extensive
experiments show that our method generates videos that accurately preserve
various speaking styles, yet with considerably lower amount of training data
and training time in comparison to existing methods. Besides, our method
achieves a faster testing speed than most recent methods.Comment: TCSVT 202
RADIO: Reference-Agnostic Dubbing Video Synthesis
One of the most challenging problems in audio-driven talking head generation
is achieving high-fidelity detail while ensuring precise synchronization. Given
only a single reference image, extracting meaningful identity attributes
becomes even more challenging, often causing the network to mirror the facial
and lip structures too closely. To address these issues, we introduce RADIO, a
framework engineered to yield high-quality dubbed videos regardless of the pose
or expression in reference images. The key is to modulate the decoder layers
using latent space composed of audio and reference features. Additionally, we
incorporate ViT blocks into the decoder to emphasize high-fidelity details,
especially in the lip region. Our experimental results demonstrate that RADIO
displays high synchronization without the loss of fidelity. Especially in harsh
scenarios where the reference frame deviates significantly from the ground
truth, our method outperforms state-of-the-art methods, highlighting its
robustness. Pre-trained model and codes will be made public after the review.Comment: Under revie
Neural Voice Puppetry: Audio-driven Facial Reenactment
We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis. Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with the audio of the source input. This audio-driven facial reenactment is driven by a deep neural network that employs a latent 3D face model space. Through the underlying 3D representation, the model inherently learns temporal stability while we leverage neural rendering to generate photo-realistic output frames. Our approach generalizes across different people, allowing us to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches. Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head. We demonstrate the capabilities of our method in a series of audio- and text-based puppetry examples. Our method is not only more general than existing works since we are generic to the input person, but we also show superior visual and lip sync quality compared to photo-realistic audio- and video-driven reenactment techniques
Nonparallel Emotional Speech Conversion
We propose a nonparallel data-driven emotional speech conversion method. It
enables the transfer of emotion-related characteristics of a speech signal
while preserving the speaker's identity and linguistic content. Most existing
approaches require parallel data and time alignment, which is not available in
most real applications. We achieve nonparallel training based on an
unsupervised style transfer technique, which learns a translation model between
two distributions instead of a deterministic one-to-one mapping between paired
examples. The conversion model consists of an encoder and a decoder for each
emotion domain. We assume that the speech signal can be decomposed into an
emotion-invariant content code and an emotion-related style code in latent
space. Emotion conversion is performed by extracting and recombining the
content code of the source speech and the style code of the target emotion. We
tested our method on a nonparallel corpora with four emotions. Both subjective
and objective evaluations show the effectiveness of our approach.Comment: Published in INTERSPEECH 2019, 5 pages, 6 figures. Simulation
available at http://www.jian-gao.org/emoga
VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild
We present VideoReTalking, a new system to edit the faces of a real-world
talking head video according to input audio, producing a high-quality and
lip-syncing output video even with a different emotion. Our system disentangles
this objective into three sequential tasks: (1) face video generation with a
canonical expression; (2) audio-driven lip-sync; and (3) face enhancement for
improving photo-realism. Given a talking-head video, we first modify the
expression of each frame according to the same expression template using the
expression editing network, resulting in a video with the canonical expression.
This video, together with the given audio, is then fed into the lip-sync
network to generate a lip-syncing video. Finally, we improve the photo-realism
of the synthesized faces through an identity-aware face enhancement network and
post-processing. We use learning-based approaches for all three steps and all
our modules can be tackled in a sequential pipeline without any user
intervention. Furthermore, our system is a generic approach that does not need
to be retrained to a specific person. Evaluations on two widely-used datasets
and in-the-wild examples demonstrate the superiority of our framework over
other state-of-the-art methods in terms of lip-sync accuracy and visual
quality.Comment: Accepted by SIGGRAPH Asia 2022 Conference Proceedings. Project page:
https://vinthony.github.io/video-retalking
OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions
One-shot talking head generation has no explicit head movement reference,
thus it is difficult to generate talking heads with head motions. Some existing
works only edit the mouth area and generate still talking heads, leading to
unreal talking head performance. Other works construct one-to-one mapping
between audio signal and head motion sequences, introducing ambiguity
correspondences into the mapping since people can behave differently in head
motions when speaking the same content. This unreasonable mapping form fails to
model the diversity and produces either nearly static or even exaggerated head
motions, which are unnatural and strange. Therefore, the one-shot talking head
generation task is actually a one-to-many ill-posed problem and people present
diverse head motions when speaking. Based on the above observation, we propose
OSM-Net, a \textit{one-to-many} one-shot talking head generation network with
natural head motions. OSM-Net constructs a motion space that contains rich and
various clip-level head motion features. Each basis of the space represents a
feature of meaningful head motion in a clip rather than just a frame, thus
providing more coherent and natural motion changes in talking heads. The
driving audio is mapped into the motion space, around which various motion
features can be sampled within a reasonable range to achieve the one-to-many
mapping. Besides, the landmark constraint and time window feature input improve
the accurate expression feature extraction and video generation. Extensive
experiments show that OSM-Net generates more natural realistic head motions
under reasonable one-to-many mapping paradigm compared with other methods.Comment: Paper Under Revie
- …