6 research outputs found
A Statistically Principled and Computationally Efficient Approach to Speech Enhancement using Variational Autoencoders
Recent studies have explored the use of deep generative models of speech
spectra based of variational autoencoders (VAEs), combined with unsupervised
noise models, to perform speech enhancement. These studies developed iterative
algorithms involving either Gibbs sampling or gradient descent at each step,
making them computationally expensive. This paper proposes a variational
inference method to iteratively estimate the power spectrogram of the clean
speech. Our main contribution is the analytical derivation of the variational
steps in which the en-coder of the pre-learned VAE can be used to estimate the
varia-tional approximation of the true posterior distribution, using the very
same assumption made to train VAEs. Experiments show that the proposed method
produces results on par with the afore-mentioned iterative methods using
sampling, while decreasing the computational cost by a factor 36 to reach a
given performance .Comment: Submitted to INTERSPEECH 201
Speaker Re-identification with Speaker Dependent Speech Enhancement
While the use of deep neural networks has significantly boosted speaker
recognition performance, it is still challenging to separate speakers in poor
acoustic environments. Here speech enhancement methods have traditionally
allowed improved performance. The recent works have shown that adapting speech
enhancement can lead to further gains. This paper introduces a novel approach
that cascades speech enhancement and speaker recognition. In the first step, a
speaker embedding vector is generated , which is used in the second step to
enhance the speech quality and re-identify the speakers. Models are trained in
an integrated framework with joint optimisation. The proposed approach is
evaluated using the Voxceleb1 dataset, which aims to assess speaker recognition
in real world situations. In addition three types of noise at different
signal-noise-ratios were added for this work. The obtained results show that
the proposed approach using speaker dependent speech enhancement can yield
better speaker recognition and speech enhancement performances than two
baselines in various noise conditions.Comment: Acceptted for presentation at Interspeech202
Robust Speaker Recognition Using Speech Enhancement And Attention Model
In this paper, a novel architecture for speaker recognition is proposed by
cascading speech enhancement and speaker processing. Its aim is to improve
speaker recognition performance when speech signals are corrupted by noise.
Instead of individually processing speech enhancement and speaker recognition,
the two modules are integrated into one framework by a joint optimisation using
deep neural networks. Furthermore, to increase robustness against noise, a
multi-stage attention mechanism is employed to highlight the speaker related
features learned from context information in time and frequency domain. To
evaluate speaker identification and verification performance of the proposed
approach, we test it on the dataset of VoxCeleb1, one of mostly used benchmark
datasets. Moreover, the robustness of our proposed approach is also tested on
VoxCeleb1 data when being corrupted by three types of interferences, general
noise, music, and babble, at different signal-to-noise ratio (SNR) levels. The
obtained results show that the proposed approach using speech enhancement and
multi-stage attention models outperforms two strong baselines not using them in
most acoustic conditions in our experiments.Comment: Acceptted by Odyssey 202
Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization
In this paper we address speaker-independent multichannel speech enhancement
in unknown noisy environments. Our work is based on a well-established
multichannel local Gaussian modeling framework. We propose to use a neural
network for modeling the speech spectro-temporal content. The parameters of
this supervised model are learned using the framework of variational
autoencoders. The noisy recording environment is supposed to be unknown, so the
noise spectro-temporal modeling remains unsupervised and is based on
non-negative matrix factorization (NMF). We develop a Monte Carlo
expectation-maximization algorithm and we experimentally show that the proposed
approach outperforms its NMF-based counterpart, where speech is modeled using
supervised NMF.Comment: 5 pages, 2 figures, audio examples and code available online at
https://team.inria.fr/perception/icassp-2019-mvae
Notes on the use of variational autoencoders for speech and audio spectrogram modeling
International audienceVariational autoencoders (VAEs) are powerful (deep) generative artificial neural networks. They have been recently used in several papers for speech and audio processing, in particular for the modeling of speech/audio spectrograms. In these papers, very poor theoretical support is given to justify the chosen data representation and decoder likelihood function or the corresponding cost function used for training the VAE. Yet, a nice theoretical statistical framework exists and has been extensively presented and discussed in papers dealing with nonnegative matrix factorization (NMF) of audio spectrograms and its application to audio source separation. In the present paper, we show how this statistical framework applies to VAE-based speech/audio spectrogram modeling. This provides the latter insights on the choice and interpretability of data representation and model parameterization
Speech enhancement with variational autoencoders and alpha-stable distributions
International audiencehis paper focuses on single-channel semi-supervised speech en-hancement. We learn a speaker-independent deep generative speechmodel using the framework of variational autoencoders. The noisemodel remains unsupervised because we do not assume prior knowl-edge of the noisy recording environment. In this context, our con-tribution is to propose a noise model based on alpha-stable distribu-tions, instead of the more conventional Gaussian non-negative ma-trix factorization approach found in previous studies. We develop aMonte Carlo expectation-maximization algorithm for estimating themodel parameters at test time. Experimental results show the supe-riority of the proposed approach both in terms of perceptual qualityand intelligibility of the enhanced speech signal