188 research outputs found
Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments
In this paper, we address the problem of enhancing the speech of a speaker of
interest in a cocktail party scenario when visual information of the speaker of
interest is available. Contrary to most previous studies, we do not learn
visual features on the typically small audio-visual datasets, but use an
already available face landmark detector (trained on a separate image dataset).
The landmarks are used by LSTM-based models to generate time-frequency masks
which are applied to the acoustic mixed-speech spectrogram. Results show that:
(i) landmark motion features are very effective features for this task, (ii)
similarly to previous work, reconstruction of the target speaker's spectrogram
mediated by masking is significantly more accurate than direct spectrogram
reconstruction, and (iii) the best masks depend on both motion landmark
features and the input mixed-speech spectrogram. To the best of our knowledge,
our proposed models are the first models trained and evaluated on the limited
size GRID and TCD-TIMIT datasets, that achieve speaker-independent speech
enhancement in a multi-talker setting
Audio-Visual Target Speaker Enhancement on Multi-Talker Environment using Event-Driven Cameras
We propose a method to address audio-visual target speaker enhancement in
multi-talker environments using event-driven cameras. State of the art
audio-visual speech separation methods shows that crucial information is the
movement of the facial landmarks related to speech production. However, all
approaches proposed so far work offline, using frame-based video input, making
it difficult to process an audio-visual signal with low latency, for online
applications. In order to overcome this limitation, we propose the use of
event-driven cameras and exploit compression, high temporal resolution and low
latency, for low cost and low latency motion feature extraction, going towards
online embedded audio-visual speech processing. We use the event-driven optical
flow estimation of the facial landmarks as input to a stacked Bidirectional
LSTM trained to predict an Ideal Amplitude Mask that is then used to filter the
noisy audio, to obtain the audio signal of the target speaker. The presented
approach performs almost on par with the frame-based approach, with very low
latency and computational cost.Comment: Accepted at ISCAS 202
Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras
In this work, we propose a new method to address audio-visual target speaker extraction in multi-talker environments using event-driven cameras. All audio-visual speech separation approaches use a frame-based video to extract visual features. However, these frame-based cameras usually work at 30 frames per second. This limitation makes it difficult to process an audio-visual signal with low latency. In order to overcome this limitation, we propose using event-driven cameras due to their high temporal resolution and low latency. Recent work showed that the use of landmark motion features is very important in order to get good results on audio-visual speech separation. Thus, we use event-driven vision sensors from which the extraction of motion is available at lower latency computational cost. A stacked Bidirectional LSTM is trained to predict an Ideal Amplitude Mask before post-processing to get a clean audio signal. The performance of our model is close to those yielded in frame-based fashion
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Speech enhancement and speech separation are two related tasks, whose purpose
is to extract either one or more target speech signals, respectively, from a
mixture of sounds generated by several sources. Traditionally, these tasks have
been tackled using signal processing and machine learning techniques applied to
the available acoustic signals. Since the visual aspect of speech is
essentially unaffected by the acoustic environment, visual information from the
target speakers, such as lip movements and facial expressions, has also been
used for speech enhancement and speech separation systems. In order to
efficiently fuse acoustic and visual information, researchers have exploited
the flexibility of data-driven approaches, specifically deep learning,
achieving strong performance. The ceaseless proposal of a large number of
techniques to extract features and fuse multimodal information has highlighted
the need for an overview that comprehensively describes and discusses
audio-visual speech enhancement and separation based on deep learning. In this
paper, we provide a systematic survey of this research topic, focusing on the
main elements that characterise the systems in the literature: acoustic
features; visual features; deep learning methods; fusion techniques; training
targets and objective functions. In addition, we review deep-learning-based
methods for speech reconstruction from silent videos and audio-visual sound
source separation for non-speech signals, since these methods can be more or
less directly applied to audio-visual speech enhancement and separation.
Finally, we survey commonly employed audio-visual speech datasets, given their
central role in the development of data-driven approaches, and evaluation
methods, because they are generally used to compare different systems and
determine their performance
Plain-to-clear speech video conversion for enhanced intelligibility
Clearly articulated speech, relative to plain-style speech, has been shown to improve intelligibility. We examine if visible speech cues in video only can be systematically modified to enhance clear-speech visual features and improve intelligibility. We extract clear-speech visual features of English words varying in vowels produced by multiple male and female talkers. Via a frame-by-frame image-warping based video generation method with a controllable parameter (displacement factor), we apply the extracted clear-speech visual features to videos of plain speech to synthesize clear speech videos. We evaluate the generated videos using a robust, state of the art AI Lip Reader as well as human intelligibility testing. The contributions of this study are: (1) we successfully extract relevant visual cues for video modifications across speech styles, and have achieved enhanced intelligibility for AI; (2) this work suggests that universal talker-independent clear-speech features may be utilized to modify any talker’s visual speech style; (3) we introduce “displacement factor” as a way of systematically scaling the magnitude of displacement modifications between speech styles; and (4) the high definition generated videos make them ideal candidates for human-centric intelligibility and perceptual training studies
An Analysis of Speech Enhancement and Recognition Losses in Limited Resources Multi-talker Single Channel Audio-Visual ASR
In this paper, we analyzed how audio-visual speech enhancement can help to perform the ASR task in a cocktail party scenario. Therefore we considered two simple end-to-end LSTM-based models that perform single-channel audiovisual speech enhancement and phone recognition respectively. Then, we studied how the two models interact, and how to train them jointly affects the final result. We analyzed different training strategies that reveal some interesting and unexpected behaviors. The experiments show that during optimization of the ASR task the speech enhancement capability of the model significantly decreases and vice-versa. Nevertheless the joint optimization of the two tasks shows a remarkable drop of the Phone Error Rate (PER) compared to the audio-visual baseline models trained only to perform phone recognition. We analyzed the behaviors of the proposed models by using two limited-size datasets, and in particular we used the mixed-speech versions of GRID and TCD-TIMIT
- …