93 research outputs found
A Statistically Principled and Computationally Efficient Approach to Speech Enhancement using Variational Autoencoders
Recent studies have explored the use of deep generative models of speech
spectra based of variational autoencoders (VAEs), combined with unsupervised
noise models, to perform speech enhancement. These studies developed iterative
algorithms involving either Gibbs sampling or gradient descent at each step,
making them computationally expensive. This paper proposes a variational
inference method to iteratively estimate the power spectrogram of the clean
speech. Our main contribution is the analytical derivation of the variational
steps in which the en-coder of the pre-learned VAE can be used to estimate the
varia-tional approximation of the true posterior distribution, using the very
same assumption made to train VAEs. Experiments show that the proposed method
produces results on par with the afore-mentioned iterative methods using
sampling, while decreasing the computational cost by a factor 36 to reach a
given performance .Comment: Submitted to INTERSPEECH 201
Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation
This paper describes an efficient unsupervised learning method for a neural
source separation model that utilizes a probabilistic generative model of
observed multichannel mixtures proposed for blind source separation (BSS). For
this purpose, amortized variational inference (AVI) has been used for directly
solving the inverse problem of BSS with full-rank spatial covariance analysis
(FCA). Although this unsupervised technique called neural FCA is in principle
free from the domain mismatch problem, it is computationally demanding due to
the full rankness of the spatial model in exchange for robustness against
relatively short reverberations. To reduce the model complexity without
sacrificing performance, we propose neural FastFCA based on the
jointly-diagonalizable yet full-rank spatial model. Our neural separation model
introduced for AVI alternately performs neural network blocks and single steps
of an efficient iterative algorithm called iterative source steering. This
alternating architecture enables the separation model to quickly separate the
mixture spectrogram by leveraging both the deep neural network and the
multichannel optimization algorithm. The training objective with AVI is derived
to maximize the marginalized likelihood of the observed mixtures. The
experiment using mixture signals of two to four sound sources shows that neural
FastFCA outperforms conventional BSS methods and reduces the computational time
to about 2% of that for the neural FCA.Comment: 5 pages, 2 figures, accepted to EUSIPCO 202
A Deep Generative Model of Speech Complex Spectrograms
This paper proposes an approach to the joint modeling of the short-time
Fourier transform magnitude and phase spectrograms with a deep generative
model. We assume that the magnitude follows a Gaussian distribution and the
phase follows a von Mises distribution. To improve the consistency of the phase
values in the time-frequency domain, we also apply the von Mises distribution
to the phase derivatives, i.e., the group delay and the instantaneous
frequency. Based on these assumptions, we explore and compare several
combinations of loss functions for training our models. Built upon the
variational autoencoder framework, our model consists of three convolutional
neural networks acting as an encoder, a magnitude decoder, and a phase decoder.
In addition to the latent variables, we propose to also condition the phase
estimation on the estimated magnitude. Evaluated for a time-domain speech
reconstruction task, our models could generate speech with a high perceptual
quality and a high intelligibility
Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization
In this paper we address speaker-independent multichannel speech enhancement
in unknown noisy environments. Our work is based on a well-established
multichannel local Gaussian modeling framework. We propose to use a neural
network for modeling the speech spectro-temporal content. The parameters of
this supervised model are learned using the framework of variational
autoencoders. The noisy recording environment is supposed to be unknown, so the
noise spectro-temporal modeling remains unsupervised and is based on
non-negative matrix factorization (NMF). We develop a Monte Carlo
expectation-maximization algorithm and we experimentally show that the proposed
approach outperforms its NMF-based counterpart, where speech is modeled using
supervised NMF.Comment: 5 pages, 2 figures, audio examples and code available online at
https://team.inria.fr/perception/icassp-2019-mvae
- …