25 research outputs found

    A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild

    Get PDF
    In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or videos of specific people seen during the training phase. However, they fail to accurately morph the lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the new audio. We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator. Next, we propose new, rigorous evaluation benchmarks and metrics to accurately measure lip synchronization in unconstrained videos. Extensive quantitative evaluations on our challenging benchmarks show that the lip-sync accuracy of the videos generated by our Wav2Lip model is almost as good as real synced videos. We provide a demo video clearly showing the substantial impact of our Wav2Lip model and evaluation benchmarks on our website: \url{cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild}. The code and models are released at this GitHub repository: \url{github.com/Rudrabha/Wav2Lip}. You can also try out the interactive demo at this link: \url{bhaasha.iiit.ac.in/lipsync}.Comment: 9 pages (including references), 3 figures, Accepted in ACM Multimedia, 202

    Audio-visual deep learning

    No full text
    Human perception and learning are inherently multimodal: we interface with the world through multiple sensory streams, including vision, audition, touch, olfaction and taste. By contrast, automatic approaches for machine perception and learning have traditionally depended on single modalities, by processing, for instance, video, audio or speech separately. The goal of this thesis is instead utilizing the natural co-occurrence of audio and visual information in videos to learn useful tasks. The thesis is structured around four main themes: (i) lip reading and Audio-Visual Speech Recognition (AVSR); (ii) audio-visual speech enhancement and separation; (iii) audio-visual sound source localization and detection; (iv) sign language recognition; Lip reading is the ability to recognise speech by observing the speaker’s lip movements; it is a challenging task and has many important applications including enabling speech impaired individuals to better communicate. We build and improve on recent breakthroughs by exploring the use of Transformer-based architectures, proposing attention based pooling mechanisms for representation aggregation, as well as using sub-word units instead of character tokenisation. These enhancements, combined with improvements to the training protocol, yield substantial performance boosts, resulting in state-of-the art results on the challenging LRS2 and LRS3 datasets. Moreover, we develop a method for exploiting unlabelled speech video by distilling an Automatic Speech Recognition Model into a lip-reading one. Finally we show that it is possible to identify spoken language just by observing a speaker’s lip movements. Speech enhancement and separation increases the signal-to-noise ratio of noisy speech audio, by filtering out interfering voices or background noise. Until recently, works in this area focused on solving the problem by using the audio modality alone. We first propose tackling this problem audio-visually by conditioning on each speaker’s lip movements. We then further improve this approach by making it robust to visual occlusions. Recent works have shown that it is possible to determine the spatial location of sound-making objects in video frames by exploiting correlations between the audio and video signals. We present a method to improve and extend these techniques, by grouping heat maps into distinct object representations that can be used for various downstream tasks, without the need for face detectors. The resulting method is entirely self-supervised and can be used for extending tasks such as active speaker detection and speech separation in new domains, e.g. videos of cartoons or puppets. We then propose a method that uses similar principles in order to train object detection models without relying on human annotation, by deriving all the necessary supervision from audio-visual correspondence cues. Finally we consider the problem of automatic sign-language recognition, which to-date remains unsolved, despite all the progress in related vision and natural language processing tasks. The main blocker is the scarcity of large-scale annotated sign-language datasets. We attempt to solve this problem by using sign-interpreted TV broadcasts footage, combined with subtitles obtained from the corresponding audio speech. Towards achieving this goal we first train Transformer models to identify and temporally localize instances of sings in continuous signed videos, thus automatically generating thousands of annotations for a large sign vocabulary. We then directly tackle the problem of temporally aligning the asynchronous subtitles to the sign language footage.</p

    Voicevector: multimodal enrolment vectors for speaker separation

    Get PDF
    We present a transformer-based architecture for voice separation of a target speaker from multiple other speakers and ambient noise. We achieve this by using two separate neural networks: (A) An enrolment network designed to craft speakerspecific embeddings, exploiting various combinations of audio and visual modalities; and (B) A separation network that accepts both the noisy signal and enrolment vectors as inputs, outputting the clean signal of the target speaker. The novelties are: (i) the enrolment vector can be produced from: audio only, audio-visual data (using lip movements), or visual data alone (using lip movements from silent video); and (ii) the flexibility in conditioning the separation on multiple positive and negative enrolment vectors. We compare to previous methods and obtain superior performanc

    My lips are concealed: audio-visual speech enhancement through obstructions

    No full text
    Our objective is an audio-visual model for separating a single speaker from a mixture of sounds such as other speakers and background noise. Moreover, we wish to hear the speaker even when the visual cues are temporarily absent due to occlusion. To this end we introduce a deep audio-visual speech enhancement network that is able to separate a speaker’s voice by conditioning on both the speaker’s lip movements and/or a representation of their voice. The voice representation can be obtained by either (i) enrollment, or (ii) by self-enrollment — learning the representation on-the-fly given sufficient unobstructed visual input. The model is trained by blending audios, and by introducing artificial occlusions around the mouth region that prevent the visual modality from dominating. The method is speaker-independent, and we demonstrate it on real examples of speakers unheard (and unseen) during training. The method also improves over previous models in particular for cases of occlusion in the visual modality

    Reading to listen at the cocktail party: multi-modal speech separation

    No full text
    The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3

    Sub-word level lip reading with visual attention

    No full text
    The goal of this paper is to learn strong lip reading models that can recognise speech in silent videos. Most prior works deal with the open-set visual speech recognition problem by adapting existing automatic speech recognition techniques on top of trivially pooled visual features. Instead, in this paper, we focus on the unique challenges encountered in lip reading and propose tailored solutions. To this end, we make the following contributions: (1) we propose an attention-based pooling mechanism to aggregate visual speech representations; (2) we use sub-word units for lip reading for the first time and show that this allows us to better model the ambiguities of the task; (3) we propose a model for Visual Speech Detection (VSD), trained on top of the lip reading network. Following the above, we obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets, and even surpass models trained on large-scale industrial datasets by using an order of magnitude less data. Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models, significantly reducing the performance gap between lip reading and automatic speech recognition. Moreover, on the AVA-ActiveSpeaker benchmark, our VSD model surpasses all visual-only baselines and even outperforms several recent audio-visual methods

    Now you're speaking my language: visual language identification

    No full text
    The goal of this work is to train models that can identify a spoken language just by interpreting the speaker’s lip movements. Our contributions are the following: (i) we show that models can learn to discriminate among 14 different languages using only visual speech information; (ii) we compare different designs in sequence modelling and utterance-level aggregation in order to determine the best architecture for this task; (iii) we investigate the factors that contribute discriminative cues and show that our model indeed solves the problem by finding temporal patterns in mouth movements and not by exploiting spurious correlations. We demonstrate this further by evaluating our models on challenging examples from bilingual speakers

    ASR is all you need: cross-modal distillation for lip reading

    No full text

    Reading to listen at the cocktail party: multi-modal speech separation

    No full text
    The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3

    Deep lip reading: a comparison of models and an online application

    No full text
    The goal of this paper is to develop state-of-the-art models for lip reading – visual speech recognition. We develop three architectures and compare their accuracy and training times: (i) a recurrent model using LSTMs; (ii) a fully convolutional model; and (iii) the recently proposed transformer model. The recurrent and fully convolutional models are trained with a Connectionist Temporal Classification loss and use an explicit language model for decoding, the transformer is a sequence-to-sequence model. Our best performing model improves the state-of-the-art word error rate on the challenging BBC-Oxford Lip Reading Sentences 2 (LRS2) benchmark dataset by over 20 percent. As a further contribution we investigate the fully convolutional model when used for online (real time) lip reading of continuous speech, and show that it achieves high performance with low latency.</p
    corecore