Search CORE

33 research outputs found

An iterative model-based approach to cochannel speech separation

Author
Publication venue: Springer
Publication date: 26/06/2013
Field of study

Springer - Publisher Connector

An iterative model-based approach to cochannel speech separation

Author: A Narayanan
A Nádas
A Reddy
AP Varga
CH Taal
DeLiang Wang
DL Wang
G Hu
G Hu
G Kim
GJ Mysore
J Barker
JR Hershey
K Hu
Ke Hu
M Cooke
MH Radfar
MH Radfar
MH Radfar
P Mowlaee
P Mowlaee
P Mowlaee
P Smaragdis
R Saeidi
R Weiss
S Rennie
S Roweis
Y Shao
Y Shao
Y Shao
YT Yeung
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Toward the pre-cocktail party problem with TasTas $+$

Author: Han Jiqing
Shi Anyan
Shi Ziqiang
Publication venue
Publication date: 13/09/2020
Field of study

Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation, e.g. DPRNN-TasNet \cite{luo2019dual}, TasTas \cite{shi2020speech}. In this paper, we propose two improvements of TasTas \cite{shi2020speech} for end-to-end approach to monaural speech separation in pre-cocktail party problems, which consists of 1) generate new training data through the original training batch in real time, and 2) train each module in TasTas separately. The new approach is called TasTas

+

, which takes the mixed utterance of five speakers and map it to five separated utterances, where each utterance contains only one speaker's voice. For the objective, we train the network by directly optimizing the utterance level scale-invariant signal-to-distortion ratio (SI-SDR) in a permutation invariant training (PIT) style. Our experiments on the public WSJ0-5mix data corpus results in 11.14dB SDR improvement, which shows our proposed networks can lead to performance improvement on the speaker separation task. We have open-sourced our re-implementation of the DPRNN-TasNet in https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation, and our TasTas

+

is realized based on this implementation of DPRNN-TasNet, it is believed that the results in this paper can be reproduced with ease.Comment: arXiv admin note: substantial text overlap with arXiv:2001.08998, arXiv:1902.04891, arXiv:1902.00651, arXiv:2008.0314

arXiv.org e-Print Archive

Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks

Author: Jensen Jesper
Kolbæk Morten
Tan Zheng-Hua
Yu Dong
Publication venue
Publication date: 11/07/2017
Field of study

In this paper we propose the utterance-level Permutation Invariant Training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep learning based solution for speaker independent multi-talker speech separation. Specifically, uPIT extends the recently proposed Permutation Invariant Training (PIT) technique with an utterance-level cost function, hence eliminating the need for solving an additional permutation problem during inference, which is otherwise required by frame-level PIT. We achieve this using Recurrent Neural Networks (RNNs) that, during training, minimize the utterance-level separation error, hence forcing separated frames belonging to the same speaker to be aligned to the same output stream. In practice, this allows RNNs, trained with uPIT, to separate multi-talker mixed speech without any prior knowledge of signal duration, number of speakers, speaker identity or gender. We evaluated uPIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks and found that uPIT outperforms techniques based on Non-negative Matrix Factorization (NMF) and Computational Auditory Scene Analysis (CASA), and compares favorably with Deep Clustering (DPCL) and the Deep Attractor Network (DANet). Furthermore, we found that models trained with uPIT generalize well to unseen speakers and languages. Finally, we found that a single model, trained with uPIT, can handle both two-speaker, and three-speaker speech mixtures

arXiv.org e-Print Archive

Crossref

VBN

Listening and grouping: an online autoregressive approach for monaural speech separation

Author: Dai Li-Rong
Li Zheng-Xi
McLoughlin Ian
Song Yan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/01/2019
Field of study

This paper proposes an autoregressive approach to harness the power of deep learning for multi-speaker monaural speech separation. It exploits a causal temporal context in both mixture and past estimated separated signals and performs online separation that is compatible with real-time applications. The approach adopts a learned listening and grouping architecture motivated by computational auditory scene analysis, with a grouping stage that effectively addresses the label permutation problem at both frame and segment levels. Experimental results on the benchmark WSJ0-2mix dataset show that the new approach can outperform the majority of state-of-the-art methods in both closed-set and open-set conditions in terms of signal-to-distortion ratio (SDR) improvement and perceptual evaluation of speech quality (PESQ), even approaches that exploit whole-utterance statistics for separation, with relatively fewer model parameters

Crossref

Kent Academic Repository

Deep neural network techniques for monaural speech enhancement: state of the art analysis

Author: Ochieng Peter
Publication venue
Publication date: 20/06/2023
Field of study

Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in these domains in task such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement domain to achieve denosing, dereverberation and multi-speaker separation in monaural speech enhancement. In this paper, we review some dominant DNN techniques being employed to achieve speech separation. The review looks at the whole pipeline of speech enhancement from feature extraction, how DNN based tools are modelling both global and local features of speech and model training (supervised and unsupervised). We also review the use of speech-enhancement pre-trained models to boost speech enhancement process. The review is geared towards covering the dominant trends with regards to DNN application in speech enhancement in speech obtained via a single speaker.Comment: conferenc

arXiv.org e-Print Archive