Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization

Abstract

Although current automatic speech recognition (ASR) systems perform remarkably well for a variety of recognition tasks in clean audio conditions, their accuracy degrades with increasing levels of environment noise. New approaches are needed to handle the ASR lack of robustness to noise. In this paper, we propose a multi-sensor approach to ASR, where visual information, in addition to the standard audio information, is obtained from the speaker’s face in a second channel. Audio-visual ASR, where both an audio channel and a visual channel are input to the recognition system, has already been demonstrated to outperform traditional audioonly ASR in noise conditions [5] [6]. In addition to audiovisual ASR, the visual modality has been investigated as a means of enhancement, where clean audio features are estimated from audio-visual speech when the audio channel is corrupted by noise [3] [4]. However, in [4] for example, the ASR performance of linear audio-visual enhancement (where clean audio features are estimated via linear filtering of the noisy audio-visual features) remains significantly inferior to the performance of audio-visual ASR. In this paper, we introduce a non-linear enhancement technique called Audio-Visual Codebook Dependent Cepstral Normalization (AVCDCN) and we consider its use with both audioonly ASR and audio-visual ASR. AVCDCN is inspired fro

    Similar works

    Full text

    thumbnail-image

    Available Versions