Search CORE

771,435 research outputs found

Audio-visual speech recognition with background music using single-channel source separation

Author: Erdogan Hakan
Erdoğan Hakan
Grais Emad Mounir
Topkaya İbrahim Saygın
Topkaya Ibrahim Saygin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

In this paper, we consider audio-visual speech recognition with background music. The proposed algorithm is an integration of audio-visual speech recognition and single channel source separation (SCSS). We apply the proposed algorithm to recognize spoken speech that is mixed with music signals. First, the SCSS algorithm based on nonnegative matrix factorization (NMF) and spectral masks is used to separate the audio speech signal from the background music in magnitude spectral domain. After speech audio is separated from music, regular audio-visual speech recognition (AVSR) is employed using multi-stream hidden Markov models. Employing two approaches together, we try to improve recognition accuracy by both processing the audio signal with SCSS and supporting the recognition task with visual information. Experimental results show that combining audio-visual speech recognition with source separation gives remarkable improvements in the accuracy of the speech recognition system

CiteSeerX

Crossref

University of Surrey

Sabanci University Research Database

Surrey Research Insight

The Conversation: Deep Audio-Visual Speech Enhancement

Author: Afouras Triantafyllos
Chung Joon Son
Zisserman Andrew
Publication venue
Publication date: 01/01/2018
Field of study

Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and unseen during training, and for unconstrained environments. We demonstrate strong quantitative and qualitative results, isolating extremely challenging real-world examples.Comment: To appear in Interspeech 2018. We provide supplementary material with interactive demonstrations on http://www.robots.ox.ac.uk/~vgg/demo/theconversatio

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Using audio and visual information for single channel speaker separation

Author: Khan Faheem
Milner Ben
Publication venue
Publication date: 01/01/2015
Field of study

This work proposes a method to exploit both audio and vi- sual speech information to extract a target speaker from a mix- ture of competing speakers. The work begins by taking an ef- fective audio-only method of speaker separation, namely the soft mask method, and modifying its operation to allow visual speech information to improve the separation process. The au- dio input is taken from a single channel and includes the mix- ture of speakers, where as a separate set of visual features are extracted from each speaker. This allows modification of the separation process to include not only the audio speech but also visual speech from each speaker in the mixture. Experimen- tal results are presented that compare the proposed audio-visual speaker separation with audio-only and visual-only methods us- ing both speech quality and speech intelligibility metrics

University of East Anglia digital repository