11,311 research outputs found
Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks
In this paper we propose the utterance-level Permutation Invariant Training
(uPIT) technique. uPIT is a practically applicable, end-to-end, deep learning
based solution for speaker independent multi-talker speech separation.
Specifically, uPIT extends the recently proposed Permutation Invariant Training
(PIT) technique with an utterance-level cost function, hence eliminating the
need for solving an additional permutation problem during inference, which is
otherwise required by frame-level PIT. We achieve this using Recurrent Neural
Networks (RNNs) that, during training, minimize the utterance-level separation
error, hence forcing separated frames belonging to the same speaker to be
aligned to the same output stream. In practice, this allows RNNs, trained with
uPIT, to separate multi-talker mixed speech without any prior knowledge of
signal duration, number of speakers, speaker identity or gender. We evaluated
uPIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks
and found that uPIT outperforms techniques based on Non-negative Matrix
Factorization (NMF) and Computational Auditory Scene Analysis (CASA), and
compares favorably with Deep Clustering (DPCL) and the Deep Attractor Network
(DANet). Furthermore, we found that models trained with uPIT generalize well to
unseen speakers and languages. Finally, we found that a single model, trained
with uPIT, can handle both two-speaker, and three-speaker speech mixtures
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Raw Multi-Channel Audio Source Separation using Multi-Resolution Convolutional Auto-Encoders
Supervised multi-channel audio source separation requires extracting useful
spectral, temporal, and spatial features from the mixed signals. The success of
many existing systems is therefore largely dependent on the choice of features
used for training. In this work, we introduce a novel multi-channel,
multi-resolution convolutional auto-encoder neural network that works on raw
time-domain signals to determine appropriate multi-resolution features for
separating the singing-voice from stereo music. Our experimental results show
that the proposed method can achieve multi-channel audio source separation
without the need for hand-crafted features or any pre- or post-processing
Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation
In deep neural networks with convolutional layers, each layer typically has
fixed-size/single-resolution receptive field (RF). Convolutional layers with a
large RF capture global information from the input features, while layers with
small RF size capture local details with high resolution from the input
features. In this work, we introduce novel deep multi-resolution fully
convolutional neural networks (MR-FCNN), where each layer has different RF
sizes to extract multi-resolution features that capture the global and local
details information from its input features. The proposed MR-FCNN is applied to
separate a target audio source from a mixture of many audio sources.
Experimental results show that using MR-FCNN improves the performance compared
to feedforward deep neural networks (DNNs) and single resolution deep fully
convolutional neural networks (FCNNs) on the audio source separation problem.Comment: arXiv admin note: text overlap with arXiv:1703.0801
- …