Search CORE

1,774 research outputs found

Deep Multimodal Learning for Audio-Visual Speech Recognition

Author: Goel Vaibhava
Marcheret Etienne
Mroueh Youssef
Publication venue
Publication date: 22/01/2015
Field of study

In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). First, we study an approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built. While the audio network alone achieves a phone error rate (PER) of

41\%

under clean condition on the IBM large vocabulary audio-visual studio dataset, this fusion model achieves a PER of

35.83\%

demonstrating the tremendous value of the visual channel in phone classification even in audio with high signal to noise ratio. Second, we present a new deep network architecture that uses a bilinear softmax layer to account for class specific correlations between modalities. We show that combining the posteriors from the bilinear networks with those from the fused model mentioned above results in a further significant phone error rate reduction, yielding a final PER of

34.03\%

.Comment: ICASSP 201

arXiv.org e-Print Archive

Crossref

End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

Author: Busso Carlos
Tao Fei
Publication venue
Publication date: 12/09/2018
Field of study

Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advantage of being robust to different speech modes (e.g., whisper speech) or background noise. Recent advances in audiovisual speech processing using deep learning have opened opportunities to capture in a principled way the temporal relationships between acoustic and visual features. This study explores this idea proposing a \emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach models the temporal dynamic of the sequential audiovisual data, improving the accuracy and robustness of the proposed SAD system. Instead of estimating hand-crafted features, the study investigates an end-to-end training approach, where acoustic and visual features are directly learned from the raw data during training. The experimental evaluation considers a large audiovisual corpus with over 60.8 hours of recordings, collected from 105 speakers. The results demonstrate that the proposed framework leads to absolute improvements up to 1.2% under practical scenarios over a VAD baseline using only audio implemented with deep neural network (DNN). The proposed approach achieves 92.7% F1-score when it is evaluated using the sensors from a portable tablet under noisy acoustic environment, which is only 1.0% lower than the performance obtained under ideal conditions (e.g., clean speech obtained with a high definition camera and a close-talking microphone).Comment: Submitted to Speech Communicatio

arXiv.org e-Print Archive

Changes in the McGurk Effect Across Phonetic Contexts

Author: Hampson Michelle
Guenther Frank H.
Cohen Michael A.
Nieto-Castanon Alfonso
Publication venue: Boston University Center for Adaptive Systems and Department of Cognitive and Neural Systems
Publication date: 01/01/2002
Field of study

To investigate the process underlying audiovisual speech perception, the McGurk illusion was examined across a range of phonetic contexts. Two major changes were found. First, the frequency of illusory /g/ fusion percepts increased relative to the frequency of illusory /d/ fusion percepts as vowel context was shifted from /i/ to /a/ to /u/. This trend could not be explained by biases present in perception of the unimodal visual stimuli. However, the change found in the McGurk fusion effect across vowel environments did correspond systematically with changes in second format frequency patterns across contexts. Second, the order of consonants in illusory combination percepts was found to depend on syllable type. This may be due to differences occuring across syllable contexts in the timecourses of inputs from the two modalities as delaying the auditory track of a vowel-consonant stimulus resulted in a change in the order of consonants perceived. Taken together, these results suggest that the speech perception system either fuses audiovisual inputs into a visually compatible percept with a similar second formant pattern to that of the acoustic stimulus or interleaves the information from different modalities, at a phonemic or subphonemic level, based on their relative arrival times.National Institutes of Health (R01 DC02852

Boston University Institutional Repository (OpenBU)

Changes in the McGurk Effect across Phonetic Contexts. I. Fusions

Author: Cohen Michael
Guenther Frank
Hampson Michelle
Publication venue: Boston University Center for Adaptive Systems and Department of Cognitive and Neural Systems
Publication date: 01/11/1999
Field of study

The McGurk effect has generally been studied within a limited range of phonetic contexts. With the goal of characterizing the McGurk effect through a wider range of contexts, a parametric investigation across three different vowel contexts, /i/, /α/, and /u/, and two different syllable types, consonant-vowel (CV) and vowel-consonant (VC), was conducted. This paper discusses context-dependent changes found specifically in the McGurk fusion phenomenon (Part II addresses changes found in combination percepts). After normalizing for differences in the magnitude of the McGurk effect in different contexts, a large qualitative change in the effect across vowel contexts became apparent. In particular, the frequency of illusory /g/ percepts increased relative to the frequency of illusory /d/ percepts as vowel context was shifted from /i/ to /α/ to /u/. This trend was seen in both syllable sets, and held regardless of whether the visual stimulus used was a /g/ or /d/ articulation. This qualitative change in the McGurk fusion effect across vowel environments corresponded systematically with changes in the typical second formant frequency patterns of the syllables presented. The findings are therefore consistent with sensory-based theories of speech perception which emphasize the importance of second formant patterns as cues in multimodal speech perception.National Institue on Deafness and other Communication Disorders (R29 02852); Alfred P. Sloan Foundation and National Institute on Deafness and other Communication Disorders (R29 02852

Boston University Institutional Repository (OpenBU)