50 research outputs found
Toward the pre-cocktail party problem with TasTas
Deep neural network with dual-path bi-directional long short-term memory
(BiLSTM) block has been proved to be very effective in sequence modeling,
especially in speech separation, e.g. DPRNN-TasNet \cite{luo2019dual}, TasTas
\cite{shi2020speech}. In this paper, we propose two improvements of TasTas
\cite{shi2020speech} for end-to-end approach to monaural speech separation in
pre-cocktail party problems, which consists of 1) generate new training data
through the original training batch in real time, and 2) train each module in
TasTas separately. The new approach is called TasTas, which takes the mixed
utterance of five speakers and map it to five separated utterances, where each
utterance contains only one speaker's voice. For the objective, we train the
network by directly optimizing the utterance level scale-invariant
signal-to-distortion ratio (SI-SDR) in a permutation invariant training (PIT)
style. Our experiments on the public WSJ0-5mix data corpus results in 11.14dB
SDR improvement, which shows our proposed networks can lead to performance
improvement on the speaker separation task. We have open-sourced our
re-implementation of the DPRNN-TasNet in
https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation,
and our TasTas is realized based on this implementation of DPRNN-TasNet, it
is believed that the results in this paper can be reproduced with ease.Comment: arXiv admin note: substantial text overlap with arXiv:2001.08998,
arXiv:1902.04891, arXiv:1902.00651, arXiv:2008.0314
Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss
Deep neural network with dual-path bi-directional long short-term memory
(BiLSTM) block has been proved to be very effective in sequence modeling,
especially in speech separation. This work investigates how to extend dual-path
BiLSTM to result in a new state-of-the-art approach, called TasTas, for
multi-talker monaural speech separation (a.k.a cocktail party problem). TasTas
introduces two simple but effective improvements, one is an iterative
multi-stage refinement scheme, and the other is to correct the speech with
imperfect separation through a loss of speaker identity consistency between the
separated speech and original speech, to boost the performance of dual-path
BiLSTM based networks. TasTas takes the mixed utterance of two speakers and
maps it to two separated utterances, where each utterance contains only one
speaker's voice. Our experiments on the notable benchmark WSJ0-2mix data corpus
result in 20.55dB SDR improvement, 20.35dB SI-SDR improvement, 3.69 of PESQ,
and 94.86\% of ESTOI, which shows that our proposed networks can lead to big
performance improvement on the speaker separation task. We have open sourced
our re-implementation of the DPRNN-TasNet here
(https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation),
and our TasTas is realized based on this implementation of DPRNN-TasNet, it is
believed that the results in this paper can be reproduced with ease.Comment: To appear in Interspeech 2020. arXiv admin note: substantial text
overlap with arXiv:2001.08998, arXiv:1902.04891, arXiv:1902.0065
Deep neural network techniques for monaural speech enhancement: state of the art analysis
Deep neural networks (DNN) techniques have become pervasive in domains such
as natural language processing and computer vision. They have achieved great
success in these domains in task such as machine translation and image
generation. Due to their success, these data driven techniques have been
applied in audio domain. More specifically, DNN models have been applied in
speech enhancement domain to achieve denosing, dereverberation and
multi-speaker separation in monaural speech enhancement. In this paper, we
review some dominant DNN techniques being employed to achieve speech
separation. The review looks at the whole pipeline of speech enhancement from
feature extraction, how DNN based tools are modelling both global and local
features of speech and model training (supervised and unsupervised). We also
review the use of speech-enhancement pre-trained models to boost speech
enhancement process. The review is geared towards covering the dominant trends
with regards to DNN application in speech enhancement in speech obtained via a
single speaker.Comment: conferenc