6 research outputs found
Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement
Recently, audio-visual speech enhancement has been tackled in the
unsupervised settings based on variational auto-encoders (VAEs), where during
training only clean data is used to train a generative model for speech, which
at test time is combined with a noise model, e.g. nonnegative matrix
factorization (NMF), whose parameters are learned without supervision.
Consequently, the proposed model is agnostic to the noise type. When visual
data are clean, audio-visual VAE-based architectures usually outperform the
audio-only counterpart. The opposite happens when the visual data are corrupted
by clutter, e.g. the speaker not facing the camera. In this paper, we propose
to find the optimal combination of these two architectures through time. More
precisely, we introduce the use of a latent sequential variable with Markovian
dependencies to switch between different VAE architectures through time in an
unsupervised manner: leading to switching variational auto-encoder (SwVAE). We
propose a variational factorization to approximate the computationally
intractable posterior distribution. We also derive the corresponding
variational expectation-maximization algorithm to estimate the parameters of
the model and enhance the speech signal. Our experiments demonstrate the
promising performance of SwVAE.Comment: 2021 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP
Deep Variational Generative Models for Audio-visual Speech Separation
In this paper, we are interested in audio-visual speech separation given a
single-channel audio recording as well as visual information (lips movements)
associated with each speaker. We propose an unsupervised technique based on
audio-visual generative modeling of clean speech. More specifically, during
training, a latent variable generative model is learned from clean speech
spectrograms using a variational auto-encoder (VAE). To better utilize the
visual information, the posteriors of the latent variables are inferred from
mixed speech (instead of clean speech) as well as the visual data. The visual
modality also serves as a prior for latent variables, through a visual network.
At test time, the learned generative model (both for speaker-independent and
speaker-dependent scenarios) is combined with an unsupervised non-negative
matrix factorization (NMF) variance model for background noise. All the latent
variables and noise parameters are then estimated by a Monte Carlo
expectation-maximization algorithm. Our experiments show that the proposed
unsupervised VAE-based method yields better separation performance than
NMF-based approaches as well as a supervised deep learning-based technique
Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement
International audienceRecently, audiovisual speech enhancement has been tackled in the unsupervised settings based on variational autoencoders (VAEs), where during training only clean data is used to train a generative model for speech, which at test time is combined with a noise model, e.g. nonnegative matrix factorization (NMF), whose parameters are learned without supervision. Consequently, the proposed model is agnostic to the noise type. When visual data are clean, audiovisual VAE-based architectures usually outperform the audio-only counterpart. The opposite happens when the visual data are corrupted by clutter, e.g. the speaker not facing the camera. In this paper, we propose to find the optimal combination of these two architectures through time. More precisely, we introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time in an unsupervised manner: leading to switching variational auto-encoder (SwVAE). We propose a variational factorization to approximate the computationally intractable posterior distribution. We also derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal. Our experiments demonstrate the promising performance of SwVAE
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Speech enhancement and speech separation are two related tasks, whose purpose
is to extract either one or more target speech signals, respectively, from a
mixture of sounds generated by several sources. Traditionally, these tasks have
been tackled using signal processing and machine learning techniques applied to
the available acoustic signals. Since the visual aspect of speech is
essentially unaffected by the acoustic environment, visual information from the
target speakers, such as lip movements and facial expressions, has also been
used for speech enhancement and speech separation systems. In order to
efficiently fuse acoustic and visual information, researchers have exploited
the flexibility of data-driven approaches, specifically deep learning,
achieving strong performance. The ceaseless proposal of a large number of
techniques to extract features and fuse multimodal information has highlighted
the need for an overview that comprehensively describes and discusses
audio-visual speech enhancement and separation based on deep learning. In this
paper, we provide a systematic survey of this research topic, focusing on the
main elements that characterise the systems in the literature: acoustic
features; visual features; deep learning methods; fusion techniques; training
targets and objective functions. In addition, we review deep-learning-based
methods for speech reconstruction from silent videos and audio-visual sound
source separation for non-speech signals, since these methods can be more or
less directly applied to audio-visual speech enhancement and separation.
Finally, we survey commonly employed audio-visual speech datasets, given their
central role in the development of data-driven approaches, and evaluation
methods, because they are generally used to compare different systems and
determine their performance
A variance modeling framework based on variational autoencoders for speech enhancement
International audienceIn this paper we address the problem of enhancing speech signals in noisy mixtures using a source separation approach. We explore the use of neural networks as an alternative to a popular speech variance model based on supervised non-negative matrix factorization (NMF). More precisely, we use a variational autoencoder as a speaker-independent supervised generative speech model, highlighting the conceptual similarities that this approach shares with its NMF-based counterpart. In order to be free of generalization issues regarding the noisy recording environments, we follow the approach of having a supervised model only for the target speech signal, the noise model being based on unsupervised NMF. We develop a Monte Carlo expectation-maximization algorithm for inferring the latent variables in the variational autoencoder and estimating the unsupervised model parameters. Experiments show that the proposed method outperforms a semi-supervised NMF baseline and a state-of-the-art fully supervised deep learning approach