8,016 research outputs found

    The Conversation: Deep Audio-Visual Speech Enhancement

    Full text link
    Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and unseen during training, and for unconstrained environments. We demonstrate strong quantitative and qualitative results, isolating extremely challenging real-world examples.Comment: To appear in Interspeech 2018. We provide supplementary material with interactive demonstrations on http://www.robots.ox.ac.uk/~vgg/demo/theconversatio

    A Novel Real-Time, Lightweight Chaotic-Encryption Scheme for Next-Generation Audio-Visual Hearing Aids

    Get PDF
    Next-generation audio-visual (AV) hearing aids stand as a major enabler to realize more intelligible audio. However, high data rate, low latency, low computational complexity, and privacy are some of the major bottlenecks to the successful deployment of such advanced hearing aids. To address these challenges, we propose an integration of 5G Cloud-Radio Access Network (C-RAN), Internet of Things (IoT), and strong privacy algorithms to fully benefit from the possibilities these technologies have to offer. Existing audio-only hearing aids are known to perform poorly in noisy situations where overwhelming noise is present. Current devices make the signal more audible but remain deficient in restoring intelligibility. Thus, there is a need for hearing aids that can selectively amplify the attended talker or filter out acoustic clutter. The proposed 5G IoT-enabled AV hearing-aid framework transmits the encrypted compressed AV information and receives encrypted enhanced reconstructed speech in real time to address cybersecurity attacks such as location privacy and eavesdropping. For security implementation, a real-time lightweight AV encryption is proposed, based on a piece-wise linear chaotic map (PWLSM), Chebyshev map, and a secure hash and S-Box algorithm. For speech enhancement, the received secure AV (including lip-reading) information in the cloud is used to filter noisy audio using both deep learning and analytical acoustic modelling. To offload the computational complexity and real-time optimization issues, the framework runs deep learning and big data optimization processes in the background, on the cloud. The effectiveness and security of the proposed 5G-IoT-enabled AV hearing-aid framework are extensively evaluated using widely known security metrics. Our newly reported, deep learning-driven lip-reading approach for speech enhancement is evaluated under four different dynamic real-world scenarios (cafe, street, public transport, pedestrian area) using benchmark Grid and ChiME3 corpora. Comparative critical analysis in terms of both speech enhancement and AV encryption demonstrates the potential of the envisioned technology to deliver high-quality speech reconstruction and secure mobile AV hearing aid communication. We believe our proposed 5G IoT enabled AV hearing aid framework is an effective and feasible solution and represents a step change in the development of next-generation multimodal digital hearing aids. The ongoing and future work includes more extensive evaluation and comparison with benchmark lightweight encryption algorithms and hardware prototype implementation

    Leveraging audio-visual speech effectively via deep learning

    Get PDF
    The rising popularity of neural networks, combined with the recent proliferation of online audio-visual media, has led to a revolution in the way machines encode, recognize, and generate acoustic and visual speech. Despite the ubiquity of naturally paired audio-visual data, only a limited number of works have applied recent advances in deep learning to leverage the duality between audio and video within this domain. This thesis considers the use of neural networks to learn from large unlabelled datasets of audio-visual speech to enable new practical applications. We begin by training a visual speech encoder that predicts latent features extracted from the corresponding audio on a large unlabelled audio-visual corpus. We apply the trained visual encoder to improve performance on lip reading in real-world scenarios. Following this, we extend the idea of video learning from audio by training a model to synthesize raw speech directly from raw video, without the need for text transcriptions. Remarkably, we find that this framework is capable of reconstructing intelligible audio from videos of new, previously unseen speakers. We also experiment with a separate speech reconstruction framework, which leverages recent advances in sequence modeling and spectrogram inversion to improve the realism of the generated speech. We then apply our research in video-to-speech synthesis to advance the state-of-the-art in audio-visual speech enhancement, by proposing a new vocoder-based model that performs particularly well under extremely noisy scenarios. Lastly, we aim to fully realize the potential of paired audio-visual data by proposing two novel frameworks that leverage acoustic and visual speech to train two encoders that learn from each other simultaneously. We leverage these pre-trained encoders for deepfake detection, speech recognition, and lip reading, and find that they consistently yield improvements over training from scratch.Open Acces
    • …
    corecore