286 research outputs found

    On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

    Full text link
    This paper introduces a new method for multi-channel time domain speech separation in reverberant environments. A fully-convolutional neural network structure has been used to directly separate speech from multiple microphone recordings, with no need of conventional spatial feature extraction. To reduce the influence of reverberation on spatial feature extraction, a dereverberation pre-processing method has been applied to further improve the separation performance. A spatialized version of wsj0-2mix dataset has been simulated to evaluate the proposed system. Both source separation and speech recognition performance of the separated signals have been evaluated objectively. Experiments show that the proposed fully-convolutional network improves the source separation metric and the word error rate (WER) by more than 13% and 50% relative, respectively, over a reference system with conventional features. Applying dereverberation as pre-processing to the proposed system can further reduce the WER by 29% relative using an acoustic model trained on clean and reverberated data.Comment: Presented at IEEE ICASSP 202

    Using deep learning methods for supervised speech enhancement in noisy and reverberant environments

    Get PDF
    In real world environments, the speech signals received by our ears are usually a combination of different sounds that include not only the target speech, but also acoustic interference like music, background noise, and competing speakers. This interference has negative effect on speech perception and degrades the performance of speech processing applications such as automatic speech recognition (ASR), speaker identification, and hearing aid devices. One way to solve this problem is using source separation algorithms to separate the desired speech from the interfering sounds. Many source separation algorithms have been proposed to improve the performance of ASR systems and hearing aid devices, but it is still challenging for these systems to work efficiently in noisy and reverberant environments. On the other hand, humans have a remarkable ability to separate desired sounds and listen to a specific talker among noise and other talkers. Inspired by the capabilities of human auditory system, a popular method known as auditory scene analysis (ASA) was proposed to separate different sources in a two stage process of segmentation and grouping. The main goal of source separation in ASA is to estimate time frequency masks that optimally match and separate noise signals from a mixture of speech and noise. In this work, multiple algorithms are proposed to improve upon source separation in noisy and reverberant acoustic environment. First, a simple and novel algorithm is proposed to increase the discriminability between two sound sources by scaling (magnifying) the head-related transfer function of the interfering source. Experimental results from applications of this algorithm show a significant increase in the quality of the recovered target speech. Second, a time frequency masking-based source separation algorithm is proposed that can separate a male speaker from a female speaker in reverberant conditions by using the spatial cues of the source signals. Furthermore, the proposed algorithm has the ability to preserve the location of the sources after separation. Three major aims are proposed for supervised speech separation based on deep neural networks to estimate either the time frequency masks or the clean speech spectrum. Firstly, a novel monaural acoustic feature set based on a gammatone filterbank is presented to be used as the input of the deep neural network (DNN) based speech separation model, which shows significant improvement in objective speech intelligibility and speech quality in different testing conditions. Secondly, a complementary binaural feature set is proposed to increase the ability of source separation in adverse environment with non-stationary background noise and high reverberation using 2-channel recordings. Experimental results show that the combination of spatial features with this complementary feature set improves significantly the speech intelligibility and speech quality in noisy and reverberant conditions. Thirdly, a novel dilated convolution neural network is proposed to improve the generalization of the monaural supervised speech enhancement model to different untrained speakers, unseen noises and simulated rooms. This model increases the speech intelligibility and speech quality of the recovered speech significantly, while being computationally more efficient and requiring less memory in comparison to other models. In addition, the proposed model is modified with recurrent layers and dilated causal convolution layers for real-time processing. This model is causal which makes it suitable for implementation in hearing aid devices and ASR system, while having fewer trainable parameters and using only information about previous time frames in output prediction. The main goal of the proposed algorithms are to increase the intelligibility and the quality of the recovered speech from noisy and reverberant environments, which has the potential to improve both speech processing applications and signal processing strategies for hearing aid and cochlear implant technology

    Deep neural networks for monaural source separation

    Get PDF
    PhD ThesisIn monaural source separation (MSS) only one recording is available and the spatial information, generally, cannot be extracted. It is also an undetermined inverse problem. Rcently, the development of the deep neural network (DNN) provides the framework to address this problem. How to select the types of neural network models and training targets is the research question. Moreover, in real room environments, the reverberations from floor, walls, ceiling and furnitures in a room are challenging, which distort the received mixture and degrade the separation performance. In many real-world applications, due to the size of hardware, the number of microphones cannot always be multiple. Hence, deep learning based MSS is the focus of this thesis. The first contribution is on improving the separation performance by enhancing the generalization ability of the deep learning-base MSS methods. According to no free lunch (NFL) theorem, it is impossible to find the neural network model which can estimate the training target perfectly in all cases. From the acquired speech mixture, the information of clean speech signal could be over- or underestimated. Besides, the discriminative criterion objective function can be used to address ambiguous information problem in the training stage of deep learning. Based on this, the adaptive discriminative criterion is proposed and better separation performance is obtained. In addition to this, another alternative method is using the sequentially trained neural network models within different training targets to further estimate iv Abstract v the clean speech signal. By using different training targets, the generalization ability of the neural network models is improved, and thereby better separation performance. The second contribution is addressing MSS problem in reverberant room environments. To achieve this goal, a novel time-frequency (T-F) mask, e.g. dereverberation mask (DM) is proposed to estimate the relationship between the reverberant noisy speech mixture and the dereverberated mixture. Then, a separation mask is exploited to extract the desired clean speech signal from the noisy speech mixture. The DM can be integrated with ideal ratio mask (IRM) to generate ideal enhanced mask (IEM) to address both dereverberation and separation problems. Based on the DM and the IEM, a two-stage approach is proposed with different system structures. In the final contribution, both phase information of clean speech signal and long short-term memory (LSTM) recurrent neural network (RNN) are introduced. A novel complex signal approximation (SA)-based method is proposed with the complex domain of signals. By utilizing the LSTM RNN as the neural network model, the temporal information is better used, and the desired speech signal can be estimated more accurately. Besides, the phase information of clean speech signal is applied to mitigate the negative influence from noisy phase information. The proposed MSS algorithms are evaluated with various challenging datasets such as the TIMIT, IEEE corpora and NOISEX database. The algorithms are assessed with state-of-the-art techniques and performance measures to confirm that the proposed MSS algorithms provide novel solution
    • …