1,603 research outputs found

    Alpha-Stable Multichannel Audio Source Separation

    Get PDF
    International audienceIn this paper, we focus on modeling multichannel audio signals in the short-time Fourier transform domain for the purpose of source separation. We propose a probabilistic model based on a class of heavy-tailed distributions, in which the observed mixtures and the latent sources are jointly modeled by using a certain class of multivariate alpha-stable distributions. As opposed to the conventional Gaussian models, where the observations are constrained to lie just within a few standard deviations near the mean, the pro- posed heavy-tailed model allows us to account for spurious data or important uncertainties in the model. We develop a Monte Carlo Expectation-Maximization algorithm for making inference in the proposed model. We show that our approach leads to significant improvements in audio source separation under corrupted mixtures and in spatial audio object coding

    A Statistically Principled and Computationally Efficient Approach to Speech Enhancement using Variational Autoencoders

    Full text link
    Recent studies have explored the use of deep generative models of speech spectra based of variational autoencoders (VAEs), combined with unsupervised noise models, to perform speech enhancement. These studies developed iterative algorithms involving either Gibbs sampling or gradient descent at each step, making them computationally expensive. This paper proposes a variational inference method to iteratively estimate the power spectrogram of the clean speech. Our main contribution is the analytical derivation of the variational steps in which the en-coder of the pre-learned VAE can be used to estimate the varia-tional approximation of the true posterior distribution, using the very same assumption made to train VAEs. Experiments show that the proposed method produces results on par with the afore-mentioned iterative methods using sampling, while decreasing the computational cost by a factor 36 to reach a given performance .Comment: Submitted to INTERSPEECH 201

    Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization

    Get PDF
    In this paper we address speaker-independent multichannel speech enhancement in unknown noisy environments. Our work is based on a well-established multichannel local Gaussian modeling framework. We propose to use a neural network for modeling the speech spectro-temporal content. The parameters of this supervised model are learned using the framework of variational autoencoders. The noisy recording environment is supposed to be unknown, so the noise spectro-temporal modeling remains unsupervised and is based on non-negative matrix factorization (NMF). We develop a Monte Carlo expectation-maximization algorithm and we experimentally show that the proposed approach outperforms its NMF-based counterpart, where speech is modeled using supervised NMF.Comment: 5 pages, 2 figures, audio examples and code available online at https://team.inria.fr/perception/icassp-2019-mvae

    Data-driven multivariate and multiscale methods for brain computer interface

    Get PDF
    This thesis focuses on the development of data-driven multivariate and multiscale methods for brain computer interface (BCI) systems. The electroencephalogram (EEG), the most convenient means to measure neurophysiological activity due to its noninvasive nature, is mainly considered. The nonlinearity and nonstationarity inherent in EEG and its multichannel recording nature require a new set of data-driven multivariate techniques to estimate more accurately features for enhanced BCI operation. Also, a long term goal is to enable an alternative EEG recording strategy for achieving long-term and portable monitoring. Empirical mode decomposition (EMD) and local mean decomposition (LMD), fully data-driven adaptive tools, are considered to decompose the nonlinear and nonstationary EEG signal into a set of components which are highly localised in time and frequency. It is shown that the complex and multivariate extensions of EMD, which can exploit common oscillatory modes within multivariate (multichannel) data, can be used to accurately estimate and compare the amplitude and phase information among multiple sources, a key for the feature extraction of BCI system. A complex extension of local mean decomposition is also introduced and its operation is illustrated on two channel neuronal spike streams. Common spatial pattern (CSP), a standard feature extraction technique for BCI application, is also extended to complex domain using the augmented complex statistics. Depending on the circularity/noncircularity of a complex signal, one of the complex CSP algorithms can be chosen to produce the best classification performance between two different EEG classes. Using these complex and multivariate algorithms, two cognitive brain studies are investigated for more natural and intuitive design of advanced BCI systems. Firstly, a Yarbus-style auditory selective attention experiment is introduced to measure the user attention to a sound source among a mixture of sound stimuli, which is aimed at improving the usefulness of hearing instruments such as hearing aid. Secondly, emotion experiments elicited by taste and taste recall are examined to determine the pleasure and displeasure of a food for the implementation of affective computing. The separation between two emotional responses is examined using real and complex-valued common spatial pattern methods. Finally, we introduce a novel approach to brain monitoring based on EEG recordings from within the ear canal, embedded on a custom made hearing aid earplug. The new platform promises the possibility of both short- and long-term continuous use for standard brain monitoring and interfacing applications

    Scalable Source Localization with Multichannel Alpha-Stable Distributions

    Get PDF
    International audienceIn this paper, we focus on the problem of sound source localization and we propose a technique that exploits the known and arbitrary geometry of the microphone array. While most probabilistic techniques presented in the past rely on Gaussian models, we go further in this direction and detail a method for source localization that is based on the recently proposed alpha-stable harmonizable processes. They include Cauchy and Gaussian as special cases and their remarkable feature is to allow a simple modeling of impulsive and real world sounds with few parameters. The approach we present builds on the classical convolutive mixing model and has the particularities of requiring going through the data only once, to also work in the underdetermined case of more sources than microphones and to allow massively parallelizable implementations operating in the time-frequency domain. We show that the method yields interesting performance for acoustic imaging in realistic simulations

    Notes on the use of variational autoencoders for speech and audio spectrogram modeling

    Get PDF
    International audienceVariational autoencoders (VAEs) are powerful (deep) generative artificial neural networks. They have been recently used in several papers for speech and audio processing, in particular for the modeling of speech/audio spectrograms. In these papers, very poor theoretical support is given to justify the chosen data representation and decoder likelihood function or the corresponding cost function used for training the VAE. Yet, a nice theoretical statistical framework exists and has been extensively presented and discussed in papers dealing with nonnegative matrix factorization (NMF) of audio spectrograms and its application to audio source separation. In the present paper, we show how this statistical framework applies to VAE-based speech/audio spectrogram modeling. This provides the latter insights on the choice and interpretability of data representation and model parameterization

    Projection-based demixing of spatial audio

    Get PDF
    International audienceWe propose a method to unmix multichannel audio signals into their different constitutive spatial objects. To achievethis, we characterize an audio object through both a spatial and a spectro-temporal modelling. The particularity of the spatialmodel we pick is that it neither assumes an object has only one underlying source point, nor does it attempt to model the complexroom acoustics. Instead, it focuses on a listener perspective, and takes each object as the superposition of many contributionswith different incoming directions and inter-channel delays. Our spectro-temporal probabilistic model is based on the recentlyproposed α-harmonisable processes, which are adequate for signals with large dynamics, such as audio. Then, the mainoriginality of this work is to provide a new way to estimate and exploit inter-channel dependences of an object for the purposeof demixing. In the Gaussian α = 2 case, previous research focused on covariance structures. This approach is no longervalid for α < 2 where covariances are not defined. Instead, we show how simple linear combinations of the mixture channelscan be used to learn the model parameters, and the method we propose consists in pooling the estimates based on manyprojections to correctly account for the original multichannel audio. Intuitively, each such downmix of the mixture provides anew perspective where some objects are cancelled or enhanced. Finally, we also explain how to recover the different spatial audioobjects when all parameters have been computed. Performance of the method is illustrated on the separation of stereophonic musicsignals. Index Terms—source separation, probabilistic models, non-negative matrix factorization, musical source separatio

    Underdetermined convolutive source separation using two dimensional non-negative factorization techniques

    Get PDF
    PhD ThesisIn this thesis the underdetermined audio source separation has been considered, that is, estimating the original audio sources from the observed mixture when the number of audio sources is greater than the number of channels. The separation has been carried out using two approaches; the blind audio source separation and the informed audio source separation. The blind audio source separation approach depends on the mixture signal only and it assumes that the separation has been accomplished without any prior information (or as little as possible) about the sources. The informed audio source separation uses the exemplar in addition to the mixture signal to emulate the targeted speech signal to be separated. Both approaches are based on the two dimensional factorization techniques that decompose the signal into two tensors that are convolved in both the temporal and spectral directions. Both approaches are applied on the convolutive mixture and the high-reverberant convolutive mixture which are more realistic than the instantaneous mixture. In this work a novel algorithm based on the nonnegative matrix factor two dimensional deconvolution (NMF2D) with adaptive sparsity has been proposed to separate the audio sources that have been mixed in an underdetermined convolutive mixture. Additionally, a novel Gamma Exponential Process has been proposed for estimating the convolutive parameters and number of components of the NMF2D/ NTF2D, and to initialize the NMF2D parameters. In addition, the effects of different window length have been investigated to determine the best fit model that suit the characteristics of the audio signal. Furthermore, a novel algorithm, namely the fusion K models of full-rank weighted nonnegative tensor factor two dimensional deconvolution (K-wNTF2D) has been proposed. The K-wNTF2D is developed for its ability in modelling both the spectral and temporal changes, and the spatial covariance matrix that addresses the high reverberation problem. Variable sparsity that derived from the Gibbs distribution is optimized under the Itakura-Saito divergence and adapted into the K-wNTF2D model. The tensors of this algorithm have been initialized by a novel initialization method, namely the SVD two-dimensional deconvolution (SVD2D). Finally, two novel informed source separation algorithms, namely, the semi-exemplar based algorithm and the exemplar-based algorithm, have been proposed. These algorithms based on the NMF2D model and the proposed two dimensional nonnegative matrix partial co-factorization (2DNMPCF) model. The idea of incorporating the exemplar is to inform the proposed separation algorithms about the targeted signal to be separated by initializing its parameters and guide the proposed separation algorithms. The adaptive sparsity is derived for both ii of the proposed algorithms. Also, a multistage of the proposed exemplar based algorithm has been proposed in order to further enhance the separation performance. Results have shown that the proposed separation algorithms are very promising, more flexible, and offer an alternative model to the conventional methods

    Generalized Wiener filtering with fractional power spectrograms

    Get PDF
    International audienceIn the recent years, many studies have focused on the single-sensor separation of independent waveforms using so-called soft-masking strategies, where the short term Fourier transform of the mixture is multiplied element-wise by a ratio of spectrogram models. When the signals are wide-sense stationary, this strategy is theoretically justified as an optimal Wiener filtering: the power spectrograms of the sources are supposed to add up to yield the power spectrogram of the mixture. However, experience shows that using fractional spectrograms instead, such as the amplitude, yields good performance in practice, because they experimentally better fit the additivity assumption. To the best of our knowledge, no probabilistic interpretation of this filtering procedure was available to date. In this paper, we show that assuming the additivity of fractional spectrograms for the purpose of building soft-masks can be understood as separating locally stationary alpha-stable harmonizable processes, alpha-harmonizable in short, thus justifying the procedure theoretically

    PROJET - Spatial Audio Separation Using Projections

    Get PDF
    International audienceWe propose a projection-based method for the unmixing of multi-channel audio signals into their different constituent spatial objects. Here, spatial objects are modelled using a unified framework which handles both point sources and diffuse sources. We then propose a novel methodology to estimate and take advantage of the spatial dependencies of an object. Where previous research has processed the original multichannel mixtures directly and has been principally focused on the use of inter-channel covariance structures, here we instead process projections of the multichannel signal on many different spatial directions. These linear combinations consist of observations where some spatial objects are cancelled or enhanced. We then propose an algorithm which takes these projections as the observations, discarding dependencies between them. Since each one contains global information regarding all channels of the original multichannel mixture, this provides an effective means of learning the parameters of the original audio, while avoiding the need for joint-processing of all the channels. We further show how to recover the separated spatial objects and demonstrate the use of the technique on stereophonic music signals
    • …
    corecore