47,513 research outputs found

    FaceFilter: Audio-visual speech separation using still images

    Full text link
    The objective of this paper is to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network. Unlike previous works that used lip movement on video clips or pre-enrolled speaker information as an auxiliary conditional feature, we use a single face image of the target speaker. In this task, the conditional feature is obtained from facial appearance in cross-modal biometric task, where audio and visual identity representations are shared in latent space. Learnt identities from facial images enforce the network to isolate matched speakers and extract the voices from mixed speech. It solves the permutation problem caused by swapped channel outputs, frequently occurred in speech separation tasks. The proposed method is far more practical than video-based speech separation since user profile images are readily available on many platforms. Also, unlike speaker-aware separation methods, it is applicable on separation with unseen speakers who have never been enrolled before. We show strong qualitative and quantitative results on challenging real-world examples.Comment: Under submission as a conference paper. Video examples: https://youtu.be/ku9xoLh62

    Using audio and visual information for single channel speaker separation

    Get PDF
    This work proposes a method to exploit both audio and vi- sual speech information to extract a target speaker from a mix- ture of competing speakers. The work begins by taking an ef- fective audio-only method of speaker separation, namely the soft mask method, and modifying its operation to allow visual speech information to improve the separation process. The au- dio input is taken from a single channel and includes the mix- ture of speakers, where as a separate set of visual features are extracted from each speaker. This allows modification of the separation process to include not only the audio speech but also visual speech from each speaker in the mixture. Experimen- tal results are presented that compare the proposed audio-visual speaker separation with audio-only and visual-only methods us- ing both speech quality and speech intelligibility metrics

    Audio-visual speech recognition with background music using single-channel source separation

    Get PDF
    In this paper, we consider audio-visual speech recognition with background music. The proposed algorithm is an integration of audio-visual speech recognition and single channel source separation (SCSS). We apply the proposed algorithm to recognize spoken speech that is mixed with music signals. First, the SCSS algorithm based on nonnegative matrix factorization (NMF) and spectral masks is used to separate the audio speech signal from the background music in magnitude spectral domain. After speech audio is separated from music, regular audio-visual speech recognition (AVSR) is employed using multi-stream hidden Markov models. Employing two approaches together, we try to improve recognition accuracy by both processing the audio signal with SCSS and supporting the recognition task with visual information. Experimental results show that combining audio-visual speech recognition with source separation gives remarkable improvements in the accuracy of the speech recognition system

    The Conversation: Deep Audio-Visual Speech Enhancement

    Full text link
    Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and unseen during training, and for unconstrained environments. We demonstrate strong quantitative and qualitative results, isolating extremely challenging real-world examples.Comment: To appear in Interspeech 2018. We provide supplementary material with interactive demonstrations on http://www.robots.ox.ac.uk/~vgg/demo/theconversatio

    Reconstructing intelligible audio speech from visual speech features

    Get PDF
    This work describes an investigation into the feasibility of producing intelligible audio speech from only visual speech fea- tures. The proposed method aims to estimate a spectral enve- lope from visual features which is then combined with an arti- ficial excitation signal and used within a model of speech pro- duction to reconstruct an audio signal. Different combinations of audio and visual features are considered, along with both a statistical method of estimation and a deep neural network. The intelligibility of the reconstructed audio speech is measured by human listeners, and then compared to the intelligibility of the video signal only and when combined with the reconstructed audio
    corecore