1,857 research outputs found
VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection
The goal of this work is to reconstruct speech from a silent talking face
video. Recent studies have shown impressive performance on synthesizing speech
from silent talking face videos. However, they have not explicitly considered
on varying identity characteristics of different speakers, which place a
challenge in the video-to-speech synthesis, and this becomes more critical in
unseen-speaker settings. Our approach is to separate the speech content and the
visage-style from a given silent talking face video. By guiding the model to
independently focus on modeling the two representations, we can obtain the
speech of high intelligibility from the model even when the input video of an
unseen subject is given. To this end, we introduce speech-visage selection that
separates the speech content and the speaker identity from the visual features
of the input video. The disentangled representations are jointly incorporated
to synthesize speech through visage-style based synthesizer which generates
speech by coating the visage-styles while maintaining the speech content. Thus,
the proposed framework brings the advantage of synthesizing the speech
containing the right content even with the silent talking face video of an
unseen subject. We validate the effectiveness of the proposed framework on the
GRID, TCD-TIMIT volunteer, and LRW datasets.Comment: Accepted by ECCV 202
DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion
Speech-driven 3D facial animation has gained significant attention for its
ability to create realistic and expressive facial animations in 3D space based
on speech. Learning-based methods have shown promising progress in achieving
accurate facial motion synchronized with speech. However, one-to-many nature of
speech-to-3D facial synthesis has not been fully explored: while the lip
accurately synchronizes with the speech content, other facial attributes beyond
speech-related motions are variable with respect to the speech. To account for
the potential variance in the facial attributes within a single speech, we
propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
DF-3DFace captures the complex one-to-many relationships between speech and 3D
face based on diffusion. It concurrently achieves aligned lip motion by
exploiting audio-mesh synchronization and masked conditioning. Furthermore, the
proposed method jointly models identity and pose in addition to facial motions
so that it can generate 3D face animation without requiring a reference
identity mesh and produce natural head poses. We contribute a new large-scale
3D facial mesh dataset, 3D-HDTF to enable the synthesis of variations in
identities, poses, and facial motions of 3D face mesh. Extensive experiments
demonstrate that our method successfully generates highly variable facial
shapes and motions from speech and simultaneously achieves more realistic
facial animation than the state-of-the-art methods
SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory
The challenge of talking face generation from speech lies in aligning two
different modal information, audio and video, such that the mouth region
corresponds to input audio. Previous methods either exploit audio-visual
representation learning or leverage intermediate structural information such as
landmarks and 3D models. However, they struggle to synthesize fine details of
the lips varying at the phoneme level as they do not sufficiently provide
visual information of the lips at the video synthesis step. To overcome this
limitation, our work proposes Audio-Lip Memory that brings in visual
information of the mouth region corresponding to input audio and enforces
fine-grained audio-visual coherence. It stores lip motion features from
sequential ground truth images in the value memory and aligns them with
corresponding audio features so that they can be retrieved using audio input at
inference time. Therefore, using the retrieved lip motion features as visual
hints, it can easily correlate audio with visual dynamics in the synthesis
step. By analyzing the memory, we demonstrate that unique lip features are
stored in each memory slot at the phoneme level, capturing subtle lip motion
based on memory addressing. In addition, we introduce visual-visual
synchronization loss which can enhance lip-syncing performance when used along
with audio-visual synchronization loss in our model. Extensive experiments are
performed to verify that our method generates high-quality video with mouth
shapes that best align with the input audio, outperforming previous
state-of-the-art methods.Comment: Accepted at AAAI 2022 (Oral
- …