235 research outputs found
Deep Clustering and Conventional Networks for Music Separation: Stronger Together
Deep clustering is the first method to handle general audio separation
scenarios with multiple sources of the same type and an arbitrary number of
sources, performing impressively in speaker-independent speech separation
tasks. However, little is known about its effectiveness in other challenging
situations such as music source separation. Contrary to conventional networks
that directly estimate the source signals, deep clustering generates an
embedding for each time-frequency bin, and separates sources by clustering the
bins in the embedding space. We show that deep clustering outperforms
conventional networks on a singing voice separation task, in both matched and
mismatched conditions, even though conventional networks have the advantage of
end-to-end training for best signal approximation, presumably because its more
flexible objective engenders better regularization. Since the strengths of deep
clustering and conventional network architectures appear complementary, we
explore combining them in a single hybrid network trained via an approach akin
to multi-task learning. Remarkably, the combination significantly outperforms
either of its components.Comment: Published in ICASSP 201
Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask
Singing voice separation based on deep learning relies on the usage of
time-frequency masking. In many cases the masking process is not a learnable
function or is not encapsulated into the deep learning optimization.
Consequently, most of the existing methods rely on a post processing step using
the generalized Wiener filtering. This work proposes a method that learns and
optimizes (during training) a source-dependent mask and does not need the
aforementioned post processing step. We introduce a recurrent inference
algorithm, a sparse transformation step to improve the mask generation process,
and a learned denoising filter. Obtained results show an increase of 0.49 dB
for the signal to distortion ratio and 0.30 dB for the signal to interference
ratio, compared to previous state-of-the-art approaches for monaural singing
voice separation
A Recurrent Encoder-Decoder Approach with Skip-filtering Connections for Monaural Singing Voice Separation
The objective of deep learning methods based on encoder-decoder architectures
for music source separation is to approximate either ideal time-frequency masks
or spectral representations of the target music source(s). The spectral
representations are then used to derive time-frequency masks. In this work we
introduce a method to directly learn time-frequency masks from an observed
mixture magnitude spectrum. We employ recurrent neural networks and train them
using prior knowledge only for the magnitude spectrum of the target source. To
assess the performance of the proposed method, we focus on the task of singing
voice separation. The results from an objective evaluation show that our
proposed method provides comparable results to deep learning based methods
which operate over complicated signal representations. Compared to previous
methods that approximate time-frequency masks, our method has increased
performance of signal to distortion ratio by an average of 3.8 dB
Adversarial Semi-Supervised Audio Source Separation applied to Singing Voice Extraction
The state of the art in music source separation employs neural networks
trained in a supervised fashion on multi-track databases to estimate the
sources from a given mixture. With only few datasets available, often extensive
data augmentation is used to combat overfitting. Mixing random tracks, however,
can even reduce separation performance as instruments in real music are
strongly correlated. The key concept in our approach is that source estimates
of an optimal separator should be indistinguishable from real source signals.
Based on this idea, we drive the separator towards outputs deemed as realistic
by discriminator networks that are trained to tell apart real from separator
samples. This way, we can also use unpaired source and mixture recordings
without the drawbacks of creating unrealistic music mixtures. Our framework is
widely applicable as it does not assume a specific network architecture or
number of sources. To our knowledge, this is the first adoption of adversarial
training for music source separation. In a prototype experiment for singing
voice separation, separation performance increases with our approach compared
to purely supervised training.Comment: 5 pages, 2 figures, 1 table. Final version of manuscript accepted for
2018 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). Implementation available at
https://github.com/f90/AdversarialAudioSeparatio
- …