4 research outputs found
Joint separation and denoising of noisy multi-talker speech using recurrent neural networks and permutation invariant training
In this paper we propose to use utterance-level Permutation Invariant
Training (uPIT) for speaker independent multi-talker speech separation and
denoising, simultaneously. Specifically, we train deep bi-directional Long
Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) using uPIT, for
single-channel speaker independent multi-talker speech separation in multiple
noisy conditions, including both synthetic and real-life noise signals. We
focus our experiments on generalizability and noise robustness of models that
rely on various types of a priori knowledge e.g. in terms of noise type and
number of simultaneous speakers. We show that deep bi-directional LSTM RNNs
trained using uPIT in noisy environments can improve the Signal-to-Distortion
Ratio (SDR) as well as the Extended Short-Time Objective Intelligibility
(ESTOI) measure, on the speaker independent multi-talker speech separation and
denoising task, for various noise types and Signal-to-Noise Ratios (SNRs).
Specifically, we first show that LSTM RNNs can achieve large SDR and ESTOI
improvements, when evaluated using known noise types, and that a single model
is capable of handling multiple noise types with only a slight decrease in
performance. Furthermore, we show that a single LSTM RNN can handle both
two-speaker and three-speaker noisy mixtures, without a priori knowledge about
the exact number of speakers. Finally, we show that LSTM RNNs trained using
uPIT generalize well to noise types not seen during training.Comment: To appear in MLSP 201
Single-Microphone Speech Enhancement and Separation Using Deep Learning
The cocktail party problem comprises the challenging task of understanding a
speech signal in a complex acoustic environment, where multiple speakers and
background noise signals simultaneously interfere with the speech signal of
interest. A signal processing algorithm that can effectively increase the
speech intelligibility and quality of speech signals in such complicated
acoustic situations is highly desirable. Especially for applications involving
mobile communication devices and hearing assistive devices. Due to the
re-emergence of machine learning techniques, today, known as deep learning, the
challenges involved with such algorithms might be overcome. In this PhD thesis,
we study and develop deep learning-based techniques for two sub-disciplines of
the cocktail party problem: single-microphone speech enhancement and
single-microphone multi-talker speech separation. Specifically, we conduct
in-depth empirical analysis of the generalizability capability of modern deep
learning-based single-microphone speech enhancement algorithms. We show that
performance of such algorithms is closely linked to the training data, and good
generalizability can be achieved with carefully designed training data.
Furthermore, we propose uPIT, a deep learning-based algorithm for
single-microphone speech separation and we report state-of-the-art results on a
speaker-independent multi-talker speech separation task. Additionally, we show
that uPIT works well for joint speech separation and enhancement without
explicit prior knowledge about the noise type or number of speakers. Finally,
we show that deep learning-based speech enhancement algorithms designed to
minimize the classical short-time spectral amplitude mean squared error leads
to enhanced speech signals which are essentially optimal in terms of STOI, a
state-of-the-art speech intelligibility estimator.Comment: PhD Thesis. 233 page