27 research outputs found
Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss
Deep neural network with dual-path bi-directional long short-term memory
(BiLSTM) block has been proved to be very effective in sequence modeling,
especially in speech separation. This work investigates how to extend dual-path
BiLSTM to result in a new state-of-the-art approach, called TasTas, for
multi-talker monaural speech separation (a.k.a cocktail party problem). TasTas
introduces two simple but effective improvements, one is an iterative
multi-stage refinement scheme, and the other is to correct the speech with
imperfect separation through a loss of speaker identity consistency between the
separated speech and original speech, to boost the performance of dual-path
BiLSTM based networks. TasTas takes the mixed utterance of two speakers and
maps it to two separated utterances, where each utterance contains only one
speaker's voice. Our experiments on the notable benchmark WSJ0-2mix data corpus
result in 20.55dB SDR improvement, 20.35dB SI-SDR improvement, 3.69 of PESQ,
and 94.86\% of ESTOI, which shows that our proposed networks can lead to big
performance improvement on the speaker separation task. We have open sourced
our re-implementation of the DPRNN-TasNet here
(https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation),
and our TasTas is realized based on this implementation of DPRNN-TasNet, it is
believed that the results in this paper can be reproduced with ease.Comment: To appear in Interspeech 2020. arXiv admin note: substantial text
overlap with arXiv:2001.08998, arXiv:1902.04891, arXiv:1902.0065
Toward the pre-cocktail party problem with TasTas
Deep neural network with dual-path bi-directional long short-term memory
(BiLSTM) block has been proved to be very effective in sequence modeling,
especially in speech separation, e.g. DPRNN-TasNet \cite{luo2019dual}, TasTas
\cite{shi2020speech}. In this paper, we propose two improvements of TasTas
\cite{shi2020speech} for end-to-end approach to monaural speech separation in
pre-cocktail party problems, which consists of 1) generate new training data
through the original training batch in real time, and 2) train each module in
TasTas separately. The new approach is called TasTas, which takes the mixed
utterance of five speakers and map it to five separated utterances, where each
utterance contains only one speaker's voice. For the objective, we train the
network by directly optimizing the utterance level scale-invariant
signal-to-distortion ratio (SI-SDR) in a permutation invariant training (PIT)
style. Our experiments on the public WSJ0-5mix data corpus results in 11.14dB
SDR improvement, which shows our proposed networks can lead to performance
improvement on the speaker separation task. We have open-sourced our
re-implementation of the DPRNN-TasNet in
https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation,
and our TasTas is realized based on this implementation of DPRNN-TasNet, it
is believed that the results in this paper can be reproduced with ease.Comment: arXiv admin note: substantial text overlap with arXiv:2001.08998,
arXiv:1902.04891, arXiv:1902.00651, arXiv:2008.0314
Sandglasset: A Light Multi-Granularity Self-attentive Network For Time-Domain Speech Separation
One of the leading single-channel speech separation (SS) models is based on a
TasNet with a dual-path segmentation technique, where the size of each segment
remains unchanged throughout all layers. In contrast, our key finding is that
multi-granularity features are essential for enhancing contextual modeling and
computational efficiency. We introduce a self-attentive network with a novel
sandglass-shape, namely Sandglasset, which advances the state-of-the-art (SOTA)
SS performance at significantly smaller model size and computational cost.
Forward along each block inside Sandglasset, the temporal granularity of the
features gradually becomes coarser until reaching half of the network blocks,
and then successively turns finer towards the raw signal level. We also unfold
that residual connections between features with the same granularity are
critical for preserving information after passing through the bottleneck layer.
Experiments show our Sandglasset with only 2.3M parameters has achieved the
best results on two benchmark SS datasets -- WSJ0-2mix and WSJ0-3mix, where the
SI-SNRi scores have been improved by absolute 0.8 dB and 2.4 dB, respectively,
comparing to the prior SOTA results.Comment: Accepted in ICASSP 202
Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR
Most approaches to multi-talker overlapped speech separation and recognition
assume that the number of simultaneously active speakers is given, but in
realistic situations, it is typically unknown. To cope with this, we extend an
iterative speech extraction system with mechanisms to count the number of
sources and combine it with a single-talker speech recognizer to form the first
end-to-end multi-talker automatic speech recognition system for an unknown
number of active speakers. Our experiments show very promising performance in
counting accuracy, source separation and speech recognition on simulated clean
mixtures from WSJ0-2mix and WSJ0-3mix. Among others, we set a new
state-of-the-art word error rate on the WSJ0-2mix database. Furthermore, our
system generalizes well to a larger number of speakers than it ever saw during
training, as shown in experiments with the WSJ0-4mix database.Comment: 5 pages, INTERSPEECH 202