689 research outputs found
A Statistically Principled and Computationally Efficient Approach to Speech Enhancement using Variational Autoencoders
Recent studies have explored the use of deep generative models of speech
spectra based of variational autoencoders (VAEs), combined with unsupervised
noise models, to perform speech enhancement. These studies developed iterative
algorithms involving either Gibbs sampling or gradient descent at each step,
making them computationally expensive. This paper proposes a variational
inference method to iteratively estimate the power spectrogram of the clean
speech. Our main contribution is the analytical derivation of the variational
steps in which the en-coder of the pre-learned VAE can be used to estimate the
varia-tional approximation of the true posterior distribution, using the very
same assumption made to train VAEs. Experiments show that the proposed method
produces results on par with the afore-mentioned iterative methods using
sampling, while decreasing the computational cost by a factor 36 to reach a
given performance .Comment: Submitted to INTERSPEECH 201
Notes on the use of variational autoencoders for speech and audio spectrogram modeling
International audienceVariational autoencoders (VAEs) are powerful (deep) generative artificial neural networks. They have been recently used in several papers for speech and audio processing, in particular for the modeling of speech/audio spectrograms. In these papers, very poor theoretical support is given to justify the chosen data representation and decoder likelihood function or the corresponding cost function used for training the VAE. Yet, a nice theoretical statistical framework exists and has been extensively presented and discussed in papers dealing with nonnegative matrix factorization (NMF) of audio spectrograms and its application to audio source separation. In the present paper, we show how this statistical framework applies to VAE-based speech/audio spectrogram modeling. This provides the latter insights on the choice and interpretability of data representation and model parameterization
Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization
In this paper we address speaker-independent multichannel speech enhancement
in unknown noisy environments. Our work is based on a well-established
multichannel local Gaussian modeling framework. We propose to use a neural
network for modeling the speech spectro-temporal content. The parameters of
this supervised model are learned using the framework of variational
autoencoders. The noisy recording environment is supposed to be unknown, so the
noise spectro-temporal modeling remains unsupervised and is based on
non-negative matrix factorization (NMF). We develop a Monte Carlo
expectation-maximization algorithm and we experimentally show that the proposed
approach outperforms its NMF-based counterpart, where speech is modeled using
supervised NMF.Comment: 5 pages, 2 figures, audio examples and code available online at
https://team.inria.fr/perception/icassp-2019-mvae
Weighted variance variational autoencoder for speech enhancement
We address speech enhancement based on variational autoencoders, which
involves learning a speech prior distribution in the time-frequency (TF)
domain. A zero-mean complexvalued Gaussian distribution is usually assumed for
the generative model, where the speech information is encoded in the variance
as a function of a latent variable. While this is the commonly used approach,
in this paper we propose a weighted variance generative model, where the
contribution of each TF point in parameter learning is weighted. We impose a
Gamma prior distribution on the weights, which would effectively lead to a
Student's t-distribution instead of Gaussian for speech modeling. We develop
efficient training and speech enhancement algorithms based on the proposed
generative model. Our experimental results on spectrogram modeling and speech
enhancement demonstrate the effectiveness and robustness of the proposed
approach compared to the standard unweighted variance model
Audio-visual speech enhancement with a deep Kalman filter generative model
Deep latent variable generative models based on variational autoencoder (VAE)
have shown promising performance for audiovisual speech enhancement (AVSE). The
underlying idea is to learn a VAEbased audiovisual prior distribution for clean
speech data, and then combine it with a statistical noise model to recover a
speech signal from a noisy audio recording and video (lip images) of the target
speaker. Existing generative models developed for AVSE do not take into account
the sequential nature of speech data, which prevents them from fully
incorporating the power of visual data. In this paper, we present an
audiovisual deep Kalman filter (AV-DKF) generative model which assumes a
first-order Markov chain model for the latent variables and effectively fuses
audiovisual data. Moreover, we develop an efficient inference methodology to
estimate speech signals at test time. We conduct a set of experiments to
compare different variants of generative models for speech enhancement. The
results demonstrate the superiority of the AV-DKF model compared with both its
audio-only version and the non-sequential audio-only and audiovisual VAE-based
models
Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement
Recently, audio-visual speech enhancement has been tackled in the
unsupervised settings based on variational auto-encoders (VAEs), where during
training only clean data is used to train a generative model for speech, which
at test time is combined with a noise model, e.g. nonnegative matrix
factorization (NMF), whose parameters are learned without supervision.
Consequently, the proposed model is agnostic to the noise type. When visual
data are clean, audio-visual VAE-based architectures usually outperform the
audio-only counterpart. The opposite happens when the visual data are corrupted
by clutter, e.g. the speaker not facing the camera. In this paper, we propose
to find the optimal combination of these two architectures through time. More
precisely, we introduce the use of a latent sequential variable with Markovian
dependencies to switch between different VAE architectures through time in an
unsupervised manner: leading to switching variational auto-encoder (SwVAE). We
propose a variational factorization to approximate the computationally
intractable posterior distribution. We also derive the corresponding
variational expectation-maximization algorithm to estimate the parameters of
the model and enhance the speech signal. Our experiments demonstrate the
promising performance of SwVAE.Comment: 2021 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP
Deep Variational Generative Models for Audio-visual Speech Separation
In this paper, we are interested in audio-visual speech separation given a
single-channel audio recording as well as visual information (lips movements)
associated with each speaker. We propose an unsupervised technique based on
audio-visual generative modeling of clean speech. More specifically, during
training, a latent variable generative model is learned from clean speech
spectrograms using a variational auto-encoder (VAE). To better utilize the
visual information, the posteriors of the latent variables are inferred from
mixed speech (instead of clean speech) as well as the visual data. The visual
modality also serves as a prior for latent variables, through a visual network.
At test time, the learned generative model (both for speaker-independent and
speaker-dependent scenarios) is combined with an unsupervised non-negative
matrix factorization (NMF) variance model for background noise. All the latent
variables and noise parameters are then estimated by a Monte Carlo
expectation-maximization algorithm. Our experiments show that the proposed
unsupervised VAE-based method yields better separation performance than
NMF-based approaches as well as a supervised deep learning-based technique
- …