54 research outputs found

    An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony

    Get PDF
    In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique

    An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony

    Get PDF
    In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique

    Comparison of single channel blind dereverberation methods for speech signals

    Get PDF
    Reverberation is an effect caused by echoes from objects when an audio wave travels from an audio source to a listener. This channel effect can be modeled by a finite impulse response lter which is called a room impulse response (RIR) in case of speech recordings in a room. Reverberation especially with a long filter causes high degradation in recorded speech signals and may affect applications such as Automatic Speech Recognition (ASR), hands-free teleconferencing and many others significantly. It may even cause ASR performance to decrease even in a system trained using a database with reverberated speech. If the reverberation environment is known, the echoes can be removed using simple methods. However, in most of the cases, it is unknown and the process needs to be done blind, without knowing the reverberation environment. In the literature, this problem is called the blind dereverberation problem. Although, there are several methods proposed to solve the blind dereverberation problem, due to the difficulty caused by not knowing the signal and the filter, the echoes are hard to remove completely from speech signals. This thesis aims to compare some of these existing methods such as Laplacian based weighted prediction error (L-WPE), Gaussian weighted prediction error (G-WPE), NMF based temporal spectral modeling (NMF+N-CTF), delayed linear prediction (DLP) and proposes a new method that we call sparsity penalized weighted least squares (SPWLS). In our experiments, we obtained the best results with L-WPE followed by G-WPE methods, whereas the new SPWLS method initialized with G-WPE method obtained slightly better signal-to-noise ratio and perceptual quality values when the room impulse responses are long

    Speech Dereverberation Based on Multi-Channel Linear Prediction

    Get PDF
    Room reverberation can severely degrade the auditory quality and intelligibility of the speech signals received by distant microphones in an enclosed environment. In recent years, various dereverberation algorithms have been developed to tackle this problem, such as beamforming and inverse filtering of the room transfer function. However, this kind of methods relies heavily on the precise estimation of either the direction of arrival (DOA) or room acoustic characteristics. Thus, their performance is very much limited. A more promising category of dereverberation algorithms has been developed based on multi-channel linear predictor (MCLP). This idea was first proposed in time domain where speech signal is highly correlated in a short period of time. To ensure a good suppression of the reverberation, the prediction filter length is required to be longer than the reverberation time. As a result, the complexity of this algorithm is often unacceptable because of large covariance matrix calculation. To overcome this disadvantage, this thesis focuses on the MCLP dereverberation methods performed in the short-time Fourier transform (STFT) domain. Recently, the weighted prediction error (WPE) algorithm has been developed and widely applied to speech dereverberation. In WPE algorithm, MCLP is used in the STFT domain to estimate the late reverberation components from previous frames of the reverberant speech. The enhanced speech is obtained by subtracting the late reverberation from the reverberant speech. Each STFT coefficient is assumed to be independent and obeys Gaussian distribution. A maximum likelihood (ML) problem is formulated in each frequency bin to calculate the predictor coefficients. In this thesis, the original WPE algorithm is improved in two aspects. First, two advanced statistical models, generalized Gaussian distribution (GGD) and Laplacian distribution, are employed instead of the classic Gaussian distribution. Both of them are shown to give better modeling of the histogram of the clean speech. Second, we focus on improving the estimation of the variances of the STFT coefficients of the desired signal. In the original WPE algorithm, the variances are estimated in each frequency bin independently without considering the cross-frequency correlation. Thus, we integrate the nonnegative matrix factorization (NMF) into the WPE algorithm to refine the estimation of the variances and hence obtain a better dereverberation performance. Another category of MCLP based dereverberation algorithm has been proposed in literature by exploiting the sparsity of the STFT coefficients of the desired signal for calculating the predictor coefficients. In this thesis, we also investigate an efficient algorithm based on the maximization of the group sparsity of desired signal using mixed norms. Inspired by the idea of sparse linear predictor (SLP), we propose to include a sparse constraint for the predictor coefficients in order to further improve the dereverberation performance. A weighting parameter is also introduced to achieve a trade-off between the sparsity of the desired signal and the predictor coefficients. Computer simulation of the proposed dereverberation algorithms is conducted. Our experimental results show that the proposed algorithms can significantly improve the quality of reverberant speech signal under different reverberation times. Subjective evaluation also gives a more intuitive demonstration of the enhanced speech intelligibility. Performance comparison also shows that our algorithms outperform some of the state-of-the-art dereverberation techniques

    Speech Modeling and Robust Estimation for Diagnosis of Parkinson’s Disease

    Get PDF

    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation

    Full text link
    This paper describes an efficient unsupervised learning method for a neural source separation model that utilizes a probabilistic generative model of observed multichannel mixtures proposed for blind source separation (BSS). For this purpose, amortized variational inference (AVI) has been used for directly solving the inverse problem of BSS with full-rank spatial covariance analysis (FCA). Although this unsupervised technique called neural FCA is in principle free from the domain mismatch problem, it is computationally demanding due to the full rankness of the spatial model in exchange for robustness against relatively short reverberations. To reduce the model complexity without sacrificing performance, we propose neural FastFCA based on the jointly-diagonalizable yet full-rank spatial model. Our neural separation model introduced for AVI alternately performs neural network blocks and single steps of an efficient iterative algorithm called iterative source steering. This alternating architecture enables the separation model to quickly separate the mixture spectrogram by leveraging both the deep neural network and the multichannel optimization algorithm. The training objective with AVI is derived to maximize the marginalized likelihood of the observed mixtures. The experiment using mixture signals of two to four sound sources shows that neural FastFCA outperforms conventional BSS methods and reduces the computational time to about 2% of that for the neural FCA.Comment: 5 pages, 2 figures, accepted to EUSIPCO 202

    End-to-end non-negative auto-encoders: a deep neural alternative to non-negative audio modeling

    Get PDF
    Over the last decade, non-negative matrix factorization (NMF) has emerged as one of the most popular approaches to modeling audio signals. NMF allows us to factorize the magnitude spectrogram to learn representative spectral bases that can be used for a wide range of applications. With the recent advances in deep learning, neural networks (NNs) have surpassed NMF in terms of performance. However, these NNs are trained discriminatively and lack several key characteristics like re-usability and robustness, compared to NMF. In this dissertation, we develop and investigate the idea of end-to-end non-negative autoencoders (NAEs) as an updated deep learning based alternative framework to non-negative audio modeling. We show that end-to-end NAEs combine the modeling advantages of non-negative matrix factorization and the generalizability of neural networks while delivering significant improvements in performance. To this end, we first interpret NMF as a NAE and show that the two approaches are equivalent semantically and in terms of source separation performance. We exploit the availability of sophisticated neural network architectures to propose several extensions to NAEs. We also demonstrate that these modeling improvements significantly boost the performance of NAEs. In audio processing applications, the short-time fourier transform~(STFT) is used as a universal first step and we design algorithms and neural networks to operate on the magnitude spectrograms. We interpret the sequence of steps involved in computing the STFT as additional neural network layers. This enables us to propose end-to-end processing pipelines that operate directly on the raw waveforms. In the context of source separation, we show that end-to-end processing gives a significant improvement in performance compared to existing spectrogram based methods. Furthermore, to train these end-to-end models, we investigate the use of cost functions that are derived from objective evaluation metrics as measured on waveforms. We present subjective listening test results that reveal insights into the performance of these cost functions for end-to-end source separation. Combining the adaptive front-end layers with NAEs, we propose end-to-end NAEs and show how they can be used for end-to-end generative source separation. Our experiments indicate that these models deliver separation performance comparable to that of discriminative NNs, while retaining the modularity of NMF and the modeling flexibility of neural networks. Finally, we present an approach to train these end-to-end NAEs using mixtures only, without access to clean training examples

    Binary Sparse Coding of Convolutive Mixtures for Sound Localization and Separation via Spatialization

    Get PDF
    We propose a sparse coding approach to address the problem of source-sensor localization and speech reconstruction. This approach relies on designing a dictionary of spatialized signals by projecting the microphone array recordings into the array manifolds characterized for different locations in a reverberant enclosure using the image model. Sparse representation over this dictionary enables identifying the subspace of the actual recordings and its correspondence to the source and sensor locations. The speech signal is reconstructed by inverse filtering the acoustic channels associated to the array manifolds. We provide rigorous analysis on the optimality of speech reconstruction by elucidating the links between inverse filtering and source separation followed by deconvolution. This procedure is evaluated for localization, reconstruction and recognition of simultaneous speech sources using real data recordings. The results demonstrate the effectiveness of the proposed approach and compare favorably against beamforming and independent component analysis techniques

    SkipConvGAN: Monaural Speech Dereverberation using Generative Adversarial Networks via Complex Time-Frequency Masking

    Full text link
    With the advancements in deep learning approaches, the performance of speech enhancing systems in the presence of background noise have shown significant improvements. However, improving the system's robustness against reverberation is still a work in progress, as reverberation tends to cause loss of formant structure due to smearing effects in time and frequency. A wide range of deep learning-based systems either enhance the magnitude response and reuse the distorted phase or enhance complex spectrogram using a complex time-frequency mask. Though these approaches have demonstrated satisfactory performance, they do not directly address the lost formant structure caused by reverberation. We believe that retrieving the formant structure can help improve the efficiency of existing systems. In this study, we propose SkipConvGAN - an extension of our prior work SkipConvNet. The proposed system's generator network tries to estimate an efficient complex time-frequency mask, while the discriminator network aids in driving the generator to restore the lost formant structure. We evaluate the performance of our proposed system on simulated and real recordings of reverberant speech from the single-channel task of the REVERB challenge corpus. The proposed system shows a consistent improvement across multiple room configurations over other deep learning-based generative adversarial frameworks.Comment: Published in: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 30
    • …
    corecore