3 research outputs found

    Variational Bayesian Inference for Source Separation and Robust Feature Extraction

    Get PDF
    International audienceWe consider the task of separating and classifying individual sound sources mixed together. The main challenge is to achieve robust classification despite residual distortion of the separated source signals. A promising paradigm is to estimate the uncertainty about the separated source signals and to propagate it through the subsequent feature extraction and classification stages. We argue that variational Bayesian (VB) inference offers a mathematically rigorous way of deriving uncertainty estimators, which contrasts with state-of-the-art estimators based on heuristics or on maximum likelihood (ML) estimation. We propose a general VB source separation algorithm, which makes it possible to jointly exploit spatial and spectral models of the sources. This algorithm achieves 6% and 5% relative error reduction compared to ML uncertainty estimation on the CHiME noise-robust speaker identification and speech recognition benchmarks, respectively, and it opens the way for more complex VB approximations of uncertainty.Dans cet article, nous considérons le problème de l'extraction des descripteurs de chaque source dans un enregistrement audio multi-sources à l'aide d'un algorithme général de séparation de sources. La difficulté consiste à estimer l'incertitude sur les sources et à la propager aux descripteurs, afin de les estimer de façon robuste en dépit des erreurs de séparation. Les méthodes de l'état de l'art estiment l'incertitude de façon heuristique, tandis que nous proposons d'intégrer sur les paramètres de l'algorithme de séparation de sources. Nous décrivons dans ce but une méthode d'inférence variationnelle bayésienne pour l'estimation de la distribution a posteriori des sources et nous calculons ensuite l'espérance des descripteurs par propagation de l'incertitude selon la méthode d'identification des moments. Nous évaluons la précision des descripteurs en terme d'erreur quadratique moyenne et conduisons des expériences de reconnaissance du locuteur afin d'observer la performance qui en découle pour un problème réel. Dans les deux cas, la méthode proposée donne les meilleurs résultats

    Deep neural networks for monaural source separation

    Get PDF
    PhD ThesisIn monaural source separation (MSS) only one recording is available and the spatial information, generally, cannot be extracted. It is also an undetermined inverse problem. Rcently, the development of the deep neural network (DNN) provides the framework to address this problem. How to select the types of neural network models and training targets is the research question. Moreover, in real room environments, the reverberations from floor, walls, ceiling and furnitures in a room are challenging, which distort the received mixture and degrade the separation performance. In many real-world applications, due to the size of hardware, the number of microphones cannot always be multiple. Hence, deep learning based MSS is the focus of this thesis. The first contribution is on improving the separation performance by enhancing the generalization ability of the deep learning-base MSS methods. According to no free lunch (NFL) theorem, it is impossible to find the neural network model which can estimate the training target perfectly in all cases. From the acquired speech mixture, the information of clean speech signal could be over- or underestimated. Besides, the discriminative criterion objective function can be used to address ambiguous information problem in the training stage of deep learning. Based on this, the adaptive discriminative criterion is proposed and better separation performance is obtained. In addition to this, another alternative method is using the sequentially trained neural network models within different training targets to further estimate iv Abstract v the clean speech signal. By using different training targets, the generalization ability of the neural network models is improved, and thereby better separation performance. The second contribution is addressing MSS problem in reverberant room environments. To achieve this goal, a novel time-frequency (T-F) mask, e.g. dereverberation mask (DM) is proposed to estimate the relationship between the reverberant noisy speech mixture and the dereverberated mixture. Then, a separation mask is exploited to extract the desired clean speech signal from the noisy speech mixture. The DM can be integrated with ideal ratio mask (IRM) to generate ideal enhanced mask (IEM) to address both dereverberation and separation problems. Based on the DM and the IEM, a two-stage approach is proposed with different system structures. In the final contribution, both phase information of clean speech signal and long short-term memory (LSTM) recurrent neural network (RNN) are introduced. A novel complex signal approximation (SA)-based method is proposed with the complex domain of signals. By utilizing the LSTM RNN as the neural network model, the temporal information is better used, and the desired speech signal can be estimated more accurately. Besides, the phase information of clean speech signal is applied to mitigate the negative influence from noisy phase information. The proposed MSS algorithms are evaluated with various challenging datasets such as the TIMIT, IEEE corpora and NOISEX database. The algorithms are assessed with state-of-the-art techniques and performance measures to confirm that the proposed MSS algorithms provide novel solution
    corecore