22,544 research outputs found
Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss
We devise a cascade GAN approach to generate talking face video, which is
robust to different face shapes, view angles, facial characteristics, and noisy
audio conditions. Instead of learning a direct mapping from audio to video
frames, we propose first to transfer audio to high-level structure, i.e., the
facial landmarks, and then to generate video frames conditioned on the
landmarks. Compared to a direct audio-to-image approach, our cascade approach
avoids fitting spurious correlations between audiovisual signals that are
irrelevant to the speech content. We, humans, are sensitive to temporal
discontinuities and subtle artifacts in video. To avoid those pixel jittering
problems and to enforce the network to focus on audiovisual-correlated regions,
we propose a novel dynamically adjustable pixel-wise loss with an attention
mechanism. Furthermore, to generate a sharper image with well-synchronized
facial movements, we propose a novel regression-based discriminator structure,
which considers sequence-level information along with frame-level information.
Thoughtful experiments on several datasets and real-world samples demonstrate
significantly better results obtained by our method than the state-of-the-art
methods in both quantitative and qualitative comparisons
Lip Reading Sentences in the Wild
The goal of this work is to recognise phrases and sentences being spoken by a
talking face, with or without the audio. Unlike previous works that have
focussed on recognising a limited number of words or phrases, we tackle lip
reading as an open-world problem - unconstrained natural language sentences,
and in the wild videos.
Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WLAS)
network that learns to transcribe videos of mouth motion to characters; (2) a
curriculum learning strategy to accelerate training and to reduce overfitting;
(3) a 'Lip Reading Sentences' (LRS) dataset for visual speech recognition,
consisting of over 100,000 natural sentences from British television.
The WLAS model trained on the LRS dataset surpasses the performance of all
previous work on standard lip reading benchmark datasets, often by a
significant margin. This lip reading performance beats a professional lip
reader on videos from BBC television, and we also demonstrate that visual
information helps to improve speech recognition performance even when the audio
is available
Capture, Learning, and Synthesis of 3D Speaking Styles
Audio-driven 3D facial animation has been widely explored, but achieving
realistic, human-like performance is still unsolved. This is due to the lack of
available 3D datasets, models, and standard evaluation metrics. To address
this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans
captured at 60 fps and synchronized audio from 12 speakers. We then train a
neural network on our dataset that factors identity from facial motion. The
learned model, VOCA (Voice Operated Character Animation) takes any speech
signal as input - even speech in languages other than English - and
realistically animates a wide range of adult faces. Conditioning on subject
labels during training allows the model to learn a variety of realistic
speaking styles. VOCA also provides animator controls to alter speaking style,
identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball
rotations) during animation. To our knowledge, VOCA is the only realistic 3D
facial animation model that is readily applicable to unseen subjects without
retargeting. This makes VOCA suitable for tasks like in-game video, virtual
reality avatars, or any scenario in which the speaker, speech, or language is
not known in advance. We make the dataset and model available for research
purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201
Speech-driven facial animations improve speech-in-noise comprehension of humans
Understanding speech becomes a demanding task when the environment is noisy. Comprehension of speech in noise can be substantially improved by looking at the speaker’s face, and this audiovisual benefit is even more pronounced in people with hearing impairment. Recent advances in AI have allowed to synthesize photorealistic talking faces from a speech recording and a still image of a person’s face in an end-to-end manner. However, it has remained unknown whether such facial animations improve speech-in-noise comprehension. Here we consider facial animations produced by a recently introduced generative adversarial network (GAN), and show that humans cannot distinguish between the synthesized and the natural videos. Importantly, we then show that the end-to-end synthesized videos significantly aid humans in understanding speech in noise, although the natural facial motions yield a yet higher audiovisual benefit. We further find that an audiovisual speech recognizer (AVSR) benefits from the synthesized facial animations as well. Our results suggest that synthesizing facial motions from speech can be used to aid speech comprehension in difficult listening environments
- …