5 research outputs found
Toward the pre-cocktail party problem with TasTas
Deep neural network with dual-path bi-directional long short-term memory
(BiLSTM) block has been proved to be very effective in sequence modeling,
especially in speech separation, e.g. DPRNN-TasNet \cite{luo2019dual}, TasTas
\cite{shi2020speech}. In this paper, we propose two improvements of TasTas
\cite{shi2020speech} for end-to-end approach to monaural speech separation in
pre-cocktail party problems, which consists of 1) generate new training data
through the original training batch in real time, and 2) train each module in
TasTas separately. The new approach is called TasTas, which takes the mixed
utterance of five speakers and map it to five separated utterances, where each
utterance contains only one speaker's voice. For the objective, we train the
network by directly optimizing the utterance level scale-invariant
signal-to-distortion ratio (SI-SDR) in a permutation invariant training (PIT)
style. Our experiments on the public WSJ0-5mix data corpus results in 11.14dB
SDR improvement, which shows our proposed networks can lead to performance
improvement on the speaker separation task. We have open-sourced our
re-implementation of the DPRNN-TasNet in
https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation,
and our TasTas is realized based on this implementation of DPRNN-TasNet, it
is believed that the results in this paper can be reproduced with ease.Comment: arXiv admin note: substantial text overlap with arXiv:2001.08998,
arXiv:1902.04891, arXiv:1902.00651, arXiv:2008.0314
An empirical study of Conv-TasNet
Conv-TasNet is a recently proposed waveform-based deep neural network that
achieves state-of-the-art performance in speech source separation. Its
architecture consists of a learnable encoder/decoder and a separator that
operates on top of this learned space. Various improvements have been proposed
to Conv-TasNet. However, they mostly focus on the separator, leaving its
encoder/decoder as a (shallow) linear operator. In this paper, we conduct an
empirical study of Conv-TasNet and propose an enhancement to the
encoder/decoder that is based on a (deep) non-linear variant of it. In
addition, we experiment with the larger and more diverse LibriTTS dataset and
investigate the generalization capabilities of the studied models when trained
on a much larger dataset. We propose cross-dataset evaluation that includes
assessing separations from the WSJ0-2mix, LibriTTS and VCTK databases. Our
results show that enhancements to the encoder/decoder can improve average
SI-SNR performance by more than 1 dB. Furthermore, we offer insights into the
generalization capabilities of Conv-TasNet and the potential value of
improvements to the encoder/decoder.Comment: In proceedings of ICASSP202