423 research outputs found

    Probabilistic Modeling Paradigms for Audio Source Separation

    Get PDF
    This is the author's final version of the article, first published as E. Vincent, M. G. Jafari, S. A. Abdallah, M. D. Plumbley, M. E. Davies. Probabilistic Modeling Paradigms for Audio Source Separation. In W. Wang (Ed), Machine Audition: Principles, Algorithms and Systems. Chapter 7, pp. 162-185. IGI Global, 2011. ISBN 978-1-61520-919-4. DOI: 10.4018/978-1-61520-919-4.ch007file: VincentJafariAbdallahPD11-probabilistic.pdf:v\VincentJafariAbdallahPD11-probabilistic.pdf:PDF owner: markp timestamp: 2011.02.04file: VincentJafariAbdallahPD11-probabilistic.pdf:v\VincentJafariAbdallahPD11-probabilistic.pdf:PDF owner: markp timestamp: 2011.02.04Most sound scenes result from the superposition of several sources, which can be separately perceived and analyzed by human listeners. Source separation aims to provide machine listeners with similar skills by extracting the sounds of individual sources from a given scene. Existing separation systems operate either by emulating the human auditory system or by inferring the parameters of probabilistic sound models. In this chapter, the authors focus on the latter approach and provide a joint overview of established and recent models, including independent component analysis, local time-frequency models and spectral template-based models. They show that most models are instances of one of the following two general paradigms: linear modeling or variance modeling. They compare the merits of either paradigm and report objective performance figures. They also,conclude by discussing promising combinations of probabilistic priors and inference algorithms that could form the basis of future state-of-the-art systems

    Unsupervised Learning Algorithm for Noise Suppression and Speech Enhancement Applications

    Get PDF
    Smart and intelligent devices are being integrated more and more into day-to-day life to perform a multitude of tasks. These tasks include, but are not limited to, job automation, smart utility management, etc., with the aim to improve quality of life and to make normal day-to-day chores as effortless as possible. These smart devices may or may not be connected to the internet to accomplish tasks. Additionally, human-machine interaction with such devices may be touch-screen based or based on voice commands. To understand and act upon received voice commands, these devices require to enhance and distinguish the (clean) speech signal from the recorded noisy signal (that is contaminated by interference and background noise). The enhanced speech signal is then analyzed locally or in cloud to extract the command. This speech enhancement task may effectively be achieved if the number of recording microphones is large. But incorporating many microphones is only possible in large and expensive devices. With multiple microphones present, the computational complexity of speech enhancement algorithms is high, along with its power consumption requirements. However, if the device under consideration is small with limited power and computational capabilities, having multiple microphones is not possible. For example, hearing aids and cochlear implant devices. Thus, most of these devices have been developed with a single microphone. As a result of this handicap, developing a speech enhancement algorithm for assisted learning devices with a single microphone, while keeping computational complexity and power consumption of the said algorithm low, is a challenging problem. There has been considerable research to solve this problem with good speech enhancement performance. However, most real-time speech enhancement algorithms lose their effectiveness if the level of noise present in the recorded speech is high. This dissertation deals with this problem, i.e., the objective is to develop a method that enhances performance by reducing the input signal noise level. To this end, it is proposed to include a pre-processing step before applying speech enhancement algorithms. This pre-processing performs noise suppression in the transformed domain by generating an approximation of the noisy signals’ short-time Fourier transform. The approximated signal with improved input signal to noise ratio is then used by other speech enhancement algorithms to recover the underlying clean signal. This approximation is performed by using the proposed Block-Principal Component Analysis (Block-PCA) algorithm. To illustrate efficacy of the methodology, a detailed performance analysis under multiple noise types and noise levels is followed, which demonstrates that the inclusion of the pre-processing step improves considerably the performance of speech enhancement algorithms when compared to other approaches with no pre-processing steps

    Speech enhancement Algorithm based on super-Gaussian modeling and orthogonal polynomials

    Get PDF
    © 2020 Lippincott Williams and Wilkins. All rights reserved. Different types of noise from the surrounding always interfere with speech and produce annoying signals for the human auditory system. To exchange speech information in a noisy environment, speech quality and intelligibility must be maintained, which is a challenging task. In most speech enhancement algorithms, the speech signal is characterized by Gaussian or super-Gaussian models, and noise is characterized by a Gaussian prior. However, these assumptions do not always hold in real-life situations, thereby negatively affecting the estimation, and eventually, the performance of the enhancement algorithm. Accordingly, this paper focuses on deriving an optimum low-distortion estimator with models that fit well with speech and noise data signals. This estimator provides minimum levels of speech distortion and residual noise with additional improvements in speech perceptual aspects via four key steps. First, a recent transform based on an orthogonal polynomial is used to transform the observation signal into a transform domain. Second, the noise classification based on feature extraction is adopted to find accurate and mutable models for noise signals. Third, two stages of nonlinear and linear estimators based on the minimum mean square error (MMSE) and new models for speech and noise are derived to estimate a clean speech signal. Finally, the estimated speech signal in the time domain is determined by considering the inverse of the orthogonal transform. The results show that the average classification accuracy of the proposed approach is 99.43%. In addition, the proposed algorithm significantly outperforms existing speech estimators in terms of quality and intelligibility measures

    New Stategies for Single-channel Speech Separation

    Get PDF

    Enhanced IVA for audio separation in highly reverberant environments

    Get PDF
    Blind Audio Source Separation (BASS), inspired by the "cocktail-party problem", has been a leading research application for blind source separation (BSS). This thesis concerns the enhancement of frequency domain convolutive blind source separation (FDCBSS) techniques for audio separation in highly reverberant room environments. Independent component analysis (ICA) is a higher order statistics (HOS) approach commonly used in the BSS framework. When applied to audio FDCBSS, ICA based methods suffer from the permutation problem across the frequency bins of each source. Independent vector analysis (IVA) is an FD-BSS algorithm that theoretically solves the permutation problem by using a multivariate source prior, where the sources are considered to be random vectors. The algorithm allows independence between multivariate source signals, and retains dependency between the source signals within each source vector. The source prior adopted to model the nonlinear dependency structure within the source vectors is crucial to the separation performance of the IVA algorithm. The focus of this thesis is on improving the separation performance of the IVA algorithm in the application of BASS. An alternative multivariate Student's t distribution is proposed as the source prior for the batch IVA algorithm. A Student's t probability density function can better model certain frequency domain speech signals due to its tail dependency property. Then, the nonlinear score function, for the IVA, is derived from the proposed source prior. A novel energy driven mixed super Gaussian and Student's t source prior is proposed for the IVA and FastIVA algorithms. The Student's t distribution, in the mixed source prior, can model the high amplitude data points whereas the super Gaussian distribution can model the lower amplitude information in the speech signals. The ratio of both distributions can be adjusted according to the energy of the observed mixtures to adapt for different types of speech signals. A particular multivariate generalized Gaussian distribution is adopted as the source prior for the online IVA algorithm. The nonlinear score function derived from this proposed source prior contains fourth order relationships between different frequency bins, which provides a more informative and stronger dependency structure and thereby improves the separation performance. An adaptive learning scheme is developed to improve the performance of the online IVA algorithm. The scheme adjusts the learning rate as a function of proximity to the target solutions. The scheme is also accompanied with a novel switched source prior technique taking the best performance properties of the super Gaussian source prior and the generalized Gaussian source prior as the algorithm converges. The methods and techniques, proposed in this thesis, are evaluated with real speech source signals in different simulated and real reverberant acoustic environments. A variety of measures are used within the evaluation criteria of the various algorithms. The experimental results demonstrate improved performance of the proposed methods and their robustness in a wide range of situations

    Incorporating prior information in nonnegative matrix factorization for audio source separation

    Get PDF
    In this work, we propose solutions to the problem of audio source separation from a single recording. The audio source signals can be speech, music or any other audio signals. We assume training data for the individual source signals that are present in the mixed signal are available. The training data are used to build a representative model for each source. In most cases, these models are sets of basis vectors in magnitude or power spectral domain. The proposed algorithms basically depend on decomposing the spectrogram of the mixed signal with the trained basis models for all observed sources in the mixed signal. Nonnegative matrix factorization (NMF) is used to train the basis models for the source signals. NMF is then used to decompose the mixed signal spectrogram as a weighted linear combination of the trained basis vectors for each observed source in the mixed signal. After decomposing the mixed signal, spectral masks are built and used to reconstruct the source signals. In this thesis, we improve the performance of NMF for source separation by incorporating more constraints and prior information related to the source signals to the NMF decomposition results. The NMF decomposition weights are encouraged to satisfy some prior information that is related to the nature of the source signals. The priors are modeled using Gaussian mixture models or hidden Markov models. These priors basically represent valid weight combination sequences that the basis vectors can receive for a certain type of source signal. The prior models are incorporated with the NMF cost function using either log-likelihood or minimum mean squared error estimation (MMSE). We also incorporate the prior information as a post processing. We incorporate the smoothness prior on the NMF solutions by using post smoothing processing. We also introduce post enhancement using MMSE estimation to obtain better separation for the source signals. In this thesis, we also improve the NMF training for the basis models. In cases when enough training data are not available, we introduce two di erent adaptation methods for the trained basis to better t the sources in the mixed signal. We also improve the training procedures for the sources by learning more discriminative dictionaries for the source signals. In addition, to consider a larger context in the models, we concatenate neighboring spectra together and train basis sets from them instead of a single frame which makes it possible to directly model the relation between consequent spectral frames. Experimental results show that the proposed approaches improve the performance of using NMF in source separation applications

    Super-resolution:A comprehensive survey

    Get PDF
    • …
    corecore