3 research outputs found
Interrupted and cascaded permutation invariant training for speech separation
Permutation Invariant Training (PIT) has long been a stepping stone method
for training speech separation model in handling the label ambiguity problem.
With PIT selecting the minimum cost label assignments dynamically, very few
studies considered the separation problem to be optimizing both the model
parameters and the label assignments, but focused on searching for good model
architecture and parameters. In this paper, we investigate instead for a given
model architecture the various flexible label assignment strategies for
training the model, rather than directly using PIT. Surprisingly, we discover a
significant performance boost compared to PIT is possible if the model is
trained with fixed label assignments and a good set of labels is chosen. With
fixed label training cascaded between two sections of PIT, we achieved the
state-of-the-art performance on WSJ0-2mix without changing the model
architecture at all
Stabilizing Label Assignment for Speech Separation by Self-supervised Pre-training
Speech separation has been well developed, with the very successful
permutation invariant training (PIT) approach, although the frequent label
assignment switching happening during PIT training remains to be a problem when
better convergence speed and achievable performance are desired. In this paper,
we propose to perform self-supervised pre-training to stabilize the label
assignment in training the speech separation model. Experiments over several
types of self-supervised approaches, several typical speech separation models
and two different datasets showed that very good improvements are achievable if
a proper self-supervised approach is chosen.Comment: Interspeech 202
Time-Domain Speech Extraction with Spatial Information and Multi Speaker Conditioning Mechanism
In this paper, we present a novel multi-channel speech extraction system to
simultaneously extract multiple clean individual sources from a mixture in
noisy and reverberant environments. The proposed method is built on an improved
multi-channel time-domain speech separation network which employs speaker
embeddings to identify and extract multiple targets without label permutation
ambiguity. To efficiently inform the speaker information to the extraction
model, we propose a new speaker conditioning mechanism by designing an
additional speaker branch for receiving external speaker embeddings.
Experiments on 2-channel WHAMR! data show that the proposed system improves by
9% relative the source separation performance over a strong multi-channel
baseline, and it increases the speech recognition accuracy by more than 16%
relative over the same baseline.Comment: Accepted for ICASSP 202