6 research outputs found

    A Statistically Principled and Computationally Efficient Approach to Speech Enhancement using Variational Autoencoders

    Full text link
    Recent studies have explored the use of deep generative models of speech spectra based of variational autoencoders (VAEs), combined with unsupervised noise models, to perform speech enhancement. These studies developed iterative algorithms involving either Gibbs sampling or gradient descent at each step, making them computationally expensive. This paper proposes a variational inference method to iteratively estimate the power spectrogram of the clean speech. Our main contribution is the analytical derivation of the variational steps in which the en-coder of the pre-learned VAE can be used to estimate the varia-tional approximation of the true posterior distribution, using the very same assumption made to train VAEs. Experiments show that the proposed method produces results on par with the afore-mentioned iterative methods using sampling, while decreasing the computational cost by a factor 36 to reach a given performance .Comment: Submitted to INTERSPEECH 201

    Speaker Re-identification with Speaker Dependent Speech Enhancement

    Get PDF
    While the use of deep neural networks has significantly boosted speaker recognition performance, it is still challenging to separate speakers in poor acoustic environments. Here speech enhancement methods have traditionally allowed improved performance. The recent works have shown that adapting speech enhancement can lead to further gains. This paper introduces a novel approach that cascades speech enhancement and speaker recognition. In the first step, a speaker embedding vector is generated , which is used in the second step to enhance the speech quality and re-identify the speakers. Models are trained in an integrated framework with joint optimisation. The proposed approach is evaluated using the Voxceleb1 dataset, which aims to assess speaker recognition in real world situations. In addition three types of noise at different signal-noise-ratios were added for this work. The obtained results show that the proposed approach using speaker dependent speech enhancement can yield better speaker recognition and speech enhancement performances than two baselines in various noise conditions.Comment: Acceptted for presentation at Interspeech202

    Robust Speaker Recognition Using Speech Enhancement And Attention Model

    Get PDF
    In this paper, a novel architecture for speaker recognition is proposed by cascading speech enhancement and speaker processing. Its aim is to improve speaker recognition performance when speech signals are corrupted by noise. Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks. Furthermore, to increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain. To evaluate speaker identification and verification performance of the proposed approach, we test it on the dataset of VoxCeleb1, one of mostly used benchmark datasets. Moreover, the robustness of our proposed approach is also tested on VoxCeleb1 data when being corrupted by three types of interferences, general noise, music, and babble, at different signal-to-noise ratio (SNR) levels. The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.Comment: Acceptted by Odyssey 202

    Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization

    Get PDF
    In this paper we address speaker-independent multichannel speech enhancement in unknown noisy environments. Our work is based on a well-established multichannel local Gaussian modeling framework. We propose to use a neural network for modeling the speech spectro-temporal content. The parameters of this supervised model are learned using the framework of variational autoencoders. The noisy recording environment is supposed to be unknown, so the noise spectro-temporal modeling remains unsupervised and is based on non-negative matrix factorization (NMF). We develop a Monte Carlo expectation-maximization algorithm and we experimentally show that the proposed approach outperforms its NMF-based counterpart, where speech is modeled using supervised NMF.Comment: 5 pages, 2 figures, audio examples and code available online at https://team.inria.fr/perception/icassp-2019-mvae

    Notes on the use of variational autoencoders for speech and audio spectrogram modeling

    Get PDF
    International audienceVariational autoencoders (VAEs) are powerful (deep) generative artificial neural networks. They have been recently used in several papers for speech and audio processing, in particular for the modeling of speech/audio spectrograms. In these papers, very poor theoretical support is given to justify the chosen data representation and decoder likelihood function or the corresponding cost function used for training the VAE. Yet, a nice theoretical statistical framework exists and has been extensively presented and discussed in papers dealing with nonnegative matrix factorization (NMF) of audio spectrograms and its application to audio source separation. In the present paper, we show how this statistical framework applies to VAE-based speech/audio spectrogram modeling. This provides the latter insights on the choice and interpretability of data representation and model parameterization

    Speech enhancement with variational autoencoders and alpha-stable distributions

    Get PDF
    International audiencehis paper focuses on single-channel semi-supervised speech en-hancement. We learn a speaker-independent deep generative speechmodel using the framework of variational autoencoders. The noisemodel remains unsupervised because we do not assume prior knowl-edge of the noisy recording environment. In this context, our con-tribution is to propose a noise model based on alpha-stable distribu-tions, instead of the more conventional Gaussian non-negative ma-trix factorization approach found in previous studies. We develop aMonte Carlo expectation-maximization algorithm for estimating themodel parameters at test time. Experimental results show the supe-riority of the proposed approach both in terms of perceptual qualityand intelligibility of the enhanced speech signal
    corecore