14,561 research outputs found
Deep Clustering and Conventional Networks for Music Separation: Stronger Together
Deep clustering is the first method to handle general audio separation
scenarios with multiple sources of the same type and an arbitrary number of
sources, performing impressively in speaker-independent speech separation
tasks. However, little is known about its effectiveness in other challenging
situations such as music source separation. Contrary to conventional networks
that directly estimate the source signals, deep clustering generates an
embedding for each time-frequency bin, and separates sources by clustering the
bins in the embedding space. We show that deep clustering outperforms
conventional networks on a singing voice separation task, in both matched and
mismatched conditions, even though conventional networks have the advantage of
end-to-end training for best signal approximation, presumably because its more
flexible objective engenders better regularization. Since the strengths of deep
clustering and conventional network architectures appear complementary, we
explore combining them in a single hybrid network trained via an approach akin
to multi-task learning. Remarkably, the combination significantly outperforms
either of its components.Comment: Published in ICASSP 201
TasNet: time-domain audio separation network for real-time, single-channel speech separation
Robust speech processing in multi-talker environments requires effective
speech separation. Recent deep learning systems have made significant progress
toward solving this problem, yet it remains challenging particularly in
real-time, short latency applications. Most methods attempt to construct a mask
for each source in time-frequency representation of the mixture signal which is
not necessarily an optimal representation for speech separation. In addition,
time-frequency decomposition results in inherent problems such as
phase/magnitude decoupling and long time window which is required to achieve
sufficient frequency resolution. We propose Time-domain Audio Separation
Network (TasNet) to overcome these limitations. We directly model the signal in
the time-domain using an encoder-decoder framework and perform the source
separation on nonnegative encoder outputs. This method removes the frequency
decomposition step and reduces the separation problem to estimation of source
masks on encoder outputs which is then synthesized by the decoder. Our system
outperforms the current state-of-the-art causal and noncausal speech separation
algorithms, reduces the computational cost of speech separation, and
significantly reduces the minimum required latency of the output. This makes
TasNet suitable for applications where low-power, real-time implementation is
desirable such as in hearable and telecommunication devices.Comment: Camera ready version for ICASSP 2018, Calgary, Canad
Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks
In this paper we propose the utterance-level Permutation Invariant Training
(uPIT) technique. uPIT is a practically applicable, end-to-end, deep learning
based solution for speaker independent multi-talker speech separation.
Specifically, uPIT extends the recently proposed Permutation Invariant Training
(PIT) technique with an utterance-level cost function, hence eliminating the
need for solving an additional permutation problem during inference, which is
otherwise required by frame-level PIT. We achieve this using Recurrent Neural
Networks (RNNs) that, during training, minimize the utterance-level separation
error, hence forcing separated frames belonging to the same speaker to be
aligned to the same output stream. In practice, this allows RNNs, trained with
uPIT, to separate multi-talker mixed speech without any prior knowledge of
signal duration, number of speakers, speaker identity or gender. We evaluated
uPIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks
and found that uPIT outperforms techniques based on Non-negative Matrix
Factorization (NMF) and Computational Auditory Scene Analysis (CASA), and
compares favorably with Deep Clustering (DPCL) and the Deep Attractor Network
(DANet). Furthermore, we found that models trained with uPIT generalize well to
unseen speakers and languages. Finally, we found that a single model, trained
with uPIT, can handle both two-speaker, and three-speaker speech mixtures
- …