Search CORE

3 research outputs found

Interrupted and cascaded permutation invariant training for speech separation

Author: Lee Hung-yi
Lee Lin-shan
Mao Yao-Wen
Wu Szu-Lin
Yang Gene-Ping
Publication venue
Publication date: 28/10/2019
Field of study

Permutation Invariant Training (PIT) has long been a stepping stone method for training speech separation model in handling the label ambiguity problem. With PIT selecting the minimum cost label assignments dynamically, very few studies considered the separation problem to be optimizing both the model parameters and the label assignments, but focused on searching for good model architecture and parameters. In this paper, we investigate instead for a given model architecture the various flexible label assignment strategies for training the model, rather than directly using PIT. Surprisingly, we discover a significant performance boost compared to PIT is possible if the model is trained with fixed label assignments and a good set of labels is chosen. With fixed label training cascaded between two sections of PIT, we achieved the state-of-the-art performance on WSJ0-2mix without changing the model architecture at all

arXiv.org e-Print Archive

Stabilizing Label Assignment for Speech Separation by Self-supervised Pre-training

Author: Chen Yi-Chen
Chuang Shun-Po
Huang Sung-Feng
Lee Hung-yi
Liu Da-Rong
Yang Gene-Ping
Publication venue
Publication date: 08/06/2021
Field of study

Speech separation has been well developed, with the very successful permutation invariant training (PIT) approach, although the frequent label assignment switching happening during PIT training remains to be a problem when better convergence speed and achievable performance are desired. In this paper, we propose to perform self-supervised pre-training to stabilize the label assignment in training the speech separation model. Experiments over several types of self-supervised approaches, several typical speech separation models and two different datasets showed that very good improvements are achievable if a proper self-supervised approach is chosen.Comment: Interspeech 202

arXiv.org e-Print Archive

Time-Domain Speech Extraction with Spatial Information and Multi Speaker Conditioning Mechanism

Author: Barker Jon
Doddipatla Rama
Zhang Jisi
Zorila Catalin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/02/2021
Field of study

In this paper, we present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture in noisy and reverberant environments. The proposed method is built on an improved multi-channel time-domain speech separation network which employs speaker embeddings to identify and extract multiple targets without label permutation ambiguity. To efficiently inform the speaker information to the extraction model, we propose a new speaker conditioning mechanism by designing an additional speaker branch for receiving external speaker embeddings. Experiments on 2-channel WHAMR! data show that the proposed system improves by 9% relative the source separation performance over a strong multi-channel baseline, and it increases the speech recognition accuracy by more than 16% relative over the same baseline.Comment: Accepted for ICASSP 202

arXiv.org e-Print Archive