207 research outputs found
Vocoder-Based Speech Synthesis from Silent Videos
Both acoustic and visual information influence human perception of speech.
For this reason, the lack of audio in a video sequence determines an extremely
low speech intelligibility for untrained lip readers. In this paper, we present
a way to synthesise speech from the silent video of a talker using deep
learning. The system learns a mapping function from raw video frames to
acoustic features and reconstructs the speech with a vocoder synthesis
algorithm. To improve speech reconstruction performance, our model is also
trained to predict text information in a multi-task learning fashion and it is
able to simultaneously reconstruct and recognise speech in real time. The
results in terms of estimated speech quality and intelligibility show the
effectiveness of our method, which exhibits an improvement over existing
video-to-speech approaches.Comment: Accepted to Interspeech 202
RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations
Significant progress has been made in speaker dependent Lip-to-Speech
synthesis, which aims to generate speech from silent videos of talking faces.
Current state-of-the-art approaches primarily employ non-autoregressive
sequence-to-sequence architectures to directly predict mel-spectrograms or
audio waveforms from lip representations. We hypothesize that the direct
mel-prediction hampers training/model efficiency due to the entanglement of
speech content with ambient information and speaker characteristics. To this
end, we propose RobustL2S, a modularized framework for Lip-to-Speech synthesis.
First, a non-autoregressive sequence-to-sequence model maps self-supervised
visual features to a representation of disentangled speech content. A vocoder
then converts the speech features into raw waveforms. Extensive evaluations
confirm the effectiveness of our setup, achieving state-of-the-art performance
on the unconstrained Lip2Wav dataset and the constrained GRID and TCD-TIMIT
datasets. Speech samples from RobustL2S can be found at
https://neha-sherin.github.io/RobustL2S
SVTS: scalable video-to-speech synthesis
Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online. Despite these strong motivations, contemporary video-to-speech works focus mainly on small- to medium-sized corpora with substantial constraints in both vocabulary and setting. In this work, we introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio. We achieve state-of-the art results for GRID and considerably outperform previous approaches on LRW. More importantly, by focusing on spectrogram prediction using a simple feedforward model, we can efficiently and effectively scale our method to very large and unconstrained datasets: To the best of our knowledge, we are the first to show intelligible results on the challenging LRS3 dataset
Leveraging audio-visual speech effectively via deep learning
The rising popularity of neural networks, combined with the recent proliferation of online audio-visual media, has led to a revolution in the way machines encode, recognize, and generate acoustic and visual speech. Despite the ubiquity of naturally paired audio-visual data, only a limited number of works have applied recent advances in deep learning to leverage the duality between audio and video within this domain. This thesis considers the use of neural networks to learn from large unlabelled datasets of audio-visual speech to enable new practical applications. We begin by training a visual speech encoder that predicts latent features extracted from the corresponding audio on a large unlabelled audio-visual corpus. We apply the trained visual encoder to improve performance on lip reading in real-world scenarios. Following this, we extend the idea of video learning from audio by training a model to synthesize raw speech directly from raw video, without the need for text transcriptions. Remarkably, we find that this framework is capable of reconstructing intelligible audio from videos of new, previously unseen speakers. We also experiment with a separate speech reconstruction framework, which leverages recent advances in sequence modeling and spectrogram inversion to improve the realism of the generated speech. We then apply our research in video-to-speech synthesis to advance the state-of-the-art in audio-visual speech enhancement, by proposing a new vocoder-based model that performs particularly well under extremely noisy scenarios. Lastly, we aim to fully realize the potential of paired audio-visual data by proposing two novel frameworks that leverage acoustic and visual speech to train two encoders that learn from each other simultaneously. We leverage these pre-trained encoders for deepfake detection, speech recognition, and lip reading, and find that they consistently yield improvements over training from scratch.Open Acces
- …