2 research outputs found
Deep neural networks for monaural source separation
PhD ThesisIn monaural source separation (MSS) only one recording is available and the
spatial information, generally, cannot be extracted. It is also an undetermined inverse problem. Rcently, the development of the deep neural network
(DNN) provides the framework to address this problem. How to select the
types of neural network models and training targets is the research question.
Moreover, in real room environments, the reverberations from floor, walls,
ceiling and furnitures in a room are challenging, which distort the received
mixture and degrade the separation performance. In many real-world applications, due to the size of hardware, the number of microphones cannot
always be multiple. Hence, deep learning based MSS is the focus of this
thesis.
The first contribution is on improving the separation performance by enhancing the generalization ability of the deep learning-base MSS methods.
According to no free lunch (NFL) theorem, it is impossible to find the neural
network model which can estimate the training target perfectly in all cases.
From the acquired speech mixture, the information of clean speech signal
could be over- or underestimated. Besides, the discriminative criterion objective function can be used to address ambiguous information problem in
the training stage of deep learning. Based on this, the adaptive discriminative criterion is proposed and better separation performance is obtained. In
addition to this, another alternative method is using the sequentially trained
neural network models within different training targets to further estimate
iv
Abstract v
the clean speech signal. By using different training targets, the generalization ability of the neural network models is improved, and thereby better
separation performance.
The second contribution is addressing MSS problem in reverberant room
environments. To achieve this goal, a novel time-frequency (T-F) mask, e.g.
dereverberation mask (DM) is proposed to estimate the relationship between
the reverberant noisy speech mixture and the dereverberated mixture. Then,
a separation mask is exploited to extract the desired clean speech signal from
the noisy speech mixture. The DM can be integrated with ideal ratio mask
(IRM) to generate ideal enhanced mask (IEM) to address both dereverberation and separation problems. Based on the DM and the IEM, a two-stage
approach is proposed with different system structures.
In the final contribution, both phase information of clean speech signal
and long short-term memory (LSTM) recurrent neural network (RNN) are
introduced. A novel complex signal approximation (SA)-based method is
proposed with the complex domain of signals. By utilizing the LSTM RNN
as the neural network model, the temporal information is better used, and
the desired speech signal can be estimated more accurately. Besides, the
phase information of clean speech signal is applied to mitigate the negative
influence from noisy phase information.
The proposed MSS algorithms are evaluated with various challenging
datasets such as the TIMIT, IEEE corpora and NOISEX database. The
algorithms are assessed with state-of-the-art techniques and performance
measures to confirm that the proposed MSS algorithms provide novel solution