24 research outputs found

    DNN-Assisted Speech Enhancement Approaches Incorporating Phase Information

    Get PDF
    Speech enhancement is a widely adopted technique that removes the interferences in a corrupted speech to improve the speech quality and intelligibility. Speech enhancement methods can be implemented in either time domain or time-frequency (T-F) domain. Among various proposed methods, the time-frequency domain methods, which synthesize the enhanced speech with the estimated magnitude spectrogram and the noisy phase spectrogram, gain the most popularity in the past few decades. However, this kind of techniques tend to ignore the importance of phase processing. To overcome this problem, the thesis aims to jointly enhance the magnitude and phase spectra by means of the most recent deep neural networks (DNNs). More specifically, three major contributions are presented in this thesis. First, we present new schemes based on the basic Kalman filter (KF) to remove the background noise in the noisy speech in time domain, where the KF acts as joint estimator for both the magnitude and phase spectra of speech. A DNN-augmented basic KF is first proposed, where DNN is applied for estimating key parameters in the KF, namely the linear prediction coefficients (LPCs). By training the DNN with a large database and making use of the powerful learning ability of DNN, the proposed algorithm is able to estimate LPCs from noisy speech more accurately and robustly, leading to an improved performance as compared to traditional KF based approaches in speech enhancement. We further present a high-frequency (HF) component restoration algorithm to extenuate the degradation in the HF regions of the Kalman-filtered speech, in which the DNN-based bandwidth extension is applied to estimate the magnitude of HF component from the low-frequency (LF) counterpart. By incorporating the restoration algorithm, the enhanced speech suffers less distortion in the HF component. Moreover, we propose a hybrid speech enhancement system that exploits DNN for speech reconstruction and Kalman filtering for further denoising. Two separate networks are adopted in the estimation of magnitude spectrogram and LPCs of the clean speech, respectively. The estimated clean magnitude spectrogram is combined with the phase of the noisy speech to reconstruct the estimated clean speech. A KF with the estimated parameters is then utilized to remove the residual noise in the reconstructed speech. The proposed hybrid system takes advantages of both the DNN-based reconstruction and traditional Kalman filtering, and can work reliably in either matched or unmatched acoustic environments. Next, we incorporate the DNN-based parameter estimation scheme in two advanced KFs: subband KF and colored-noise KF. The DNN-augmented subband KF method decomposes the noisy speech into several subbands, and performs Kalman filtering to each subband speech, where the parameters of the KF are estimated by the trained DNN. The final enhanced speech is then obtained by synthesizing the enhanced subband speeches. In the DNN-augmented colored-noise KF system, both clean speech and noise are modelled as autoregressive (AR) processes, whose parameters comprise the LPCs and the driving noise variances. The LPCs are obtained through training a multi-objective DNN, while the driving noise variances are obtained by solving an optimization problem aiming to minimize the difference between the modelled and observed AR spectra of the noisy speech. The colored-noise Kalman filter with DNN-estimated parameters is then applied to the noisy speech for denoising. A post-subtraction technique is adopted to further remove the residual noise in the Kalman-filtered speech. Extensive computer simulations show that the two proposed advanced KF systems achieve significant performance gains when compared to conventional Kalman filter based algorithms as well as recent DNN-based methods under both seen and unseen noise conditions. Finally, we focus on the T-F domain speech enhancement with masking technique, which aims to retain the speech dominant components and suppress the noise dominant parts of the noisy speech. We first derive a new type of mask, namely constrained ratio mask (CRM), to better control the trade-off between speech distortion and residual noise in the enhanced speech. The CRM is estimated with a trained DNN based on the input noisy feature set and is applied to the noisy magnitude spectrogram for denoising. We further extend the CRM to the complex spectrogram estimation, where the enhanced magnitude spectrogram is obtained with the CRM, while the estimated phase spectrogram is reconstructed with the noisy phase spectrogram and the phase derivatives. Performance evaluation reveals our proposed CRM outperforms several traditional masks in terms of objective metrics. Moreover, the enhanced speech resulting from the CRM based complex spectrogram estimation has a better speech quality than that obtained without using phase reconstruction

    An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony

    Get PDF
    In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user鈥檚 speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique

    An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony

    Get PDF
    In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user鈥檚 speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique

    Speech Modeling and Robust Estimation for Diagnosis of Parkinson鈥檚 Disease

    Get PDF

    Data-driven Speech Intelligibility Enhancement and Prediction for Hearing Aids

    Get PDF
    Hearing impairment is a widespread problem around the world. It is estimated that one in six people are living with some degree of hearing loss. Moderate and severe hearing impairment has been recognised as one of the major causes of disability, which is associated with declines in the quality of life, mental illness and dementia. However, investigation shows that only 10-20\% of older people with significant hearing impairment wear hearing aids. One of the main factors causing the low uptake is that current devices struggle to help hearing aid users understand speech in noisy environments. For the purpose of compensating for the elevated hearing thresholds and dysfunction of source separation processing caused by the impaired auditory system, amplification and denoising have been the major focuses of current hearing aid studies to improve the intelligibility of speech in noise. Also, it is important to derive a metric that can fairly predict speech intelligibility for the better development of hearing aid techniques. This thesis aims to enhance the speech intelligibility of hearing impaired listeners. Motivated by the success of data-driven approaches in many speech processing applications, this work proposes the differentiable hearing aid speech processing (DHASP) framework to optimise both the amplification and denoising modules within a hearing aid processor. This is accomplished by setting an intelligibility-based optimisation objective and taking advantage of large-scale speech databases to train the hearing aid processor to maximise the intelligibility for the listeners. The first set of experiments is conducted on both clean and noisy speech databases, and the results from objective evaluation suggest that the amplification fittings optimised within the DHASP framework can outperform a widely used and well-recognised fitting. The second set of experiments is conducted on a large-scale database with simulated domestic noisy scenes. The results from both objective and subjective evaluations show that the DHASP-optimised hearing aid processor incorporating a deep neural network-based denoising module can achieve competitive performance in terms of intelligibility enhancement. A precise intelligibility predictor can provide reliable evaluation results to save the cost of expensive and time-consuming subjective evaluation. Inspired by the findings that automatic speech recognition (ASR) models show similar recognition results as humans in some experiments, this work exploits ASR models for intelligibility prediction. An intrusive approach using ASR hidden representations and a non-intrusive approach using ASR uncertainty are proposed and explained in the third and fourth experimental chapters. Experiments are conducted on two databases, one with monaural speech in speech-spectrum-shaped noise with normal hearing listeners, and the other one with processed binaural speech in domestic noise with hearing impaired listeners. Results suggest that both the intrusive and non-intrusive approaches can achieve top performances and outperform a number of widely used intelligibility prediction approaches. In conclusion, this thesis covers both the enhancement and prediction of speech intelligibility for hearing aids. The proposed hearing aid processor optimised within the proposed DHASP framework can significantly improve the intelligibility of speech in noise for hearing impaired listeners. Also, it is shown that the proposed ASR-based intelligibility prediction approaches can achieve state-of-the-art performances against a number of widely used intelligibility predictors
    corecore