17 research outputs found
Adversarial Semi-Supervised Audio Source Separation applied to Singing Voice Extraction
The state of the art in music source separation employs neural networks
trained in a supervised fashion on multi-track databases to estimate the
sources from a given mixture. With only few datasets available, often extensive
data augmentation is used to combat overfitting. Mixing random tracks, however,
can even reduce separation performance as instruments in real music are
strongly correlated. The key concept in our approach is that source estimates
of an optimal separator should be indistinguishable from real source signals.
Based on this idea, we drive the separator towards outputs deemed as realistic
by discriminator networks that are trained to tell apart real from separator
samples. This way, we can also use unpaired source and mixture recordings
without the drawbacks of creating unrealistic music mixtures. Our framework is
widely applicable as it does not assume a specific network architecture or
number of sources. To our knowledge, this is the first adoption of adversarial
training for music source separation. In a prototype experiment for singing
voice separation, separation performance increases with our approach compared
to purely supervised training.Comment: 5 pages, 2 figures, 1 table. Final version of manuscript accepted for
2018 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). Implementation available at
https://github.com/f90/AdversarialAudioSeparatio
CASS: Cross Adversarial Source Separation via Autoencoder
This paper introduces a cross adversarial source separation (CASS) framework
via autoencoder, a new model that aims at separating an input signal consisting
of a mixture of multiple components into individual components defined via
adversarial learning and autoencoder fitting. CASS unifies popular generative
networks like auto-encoders (AEs) and generative adversarial networks (GANs) in
a single framework. The basic building block that filters the input signal and
reconstructs the -th target component is a pair of deep neural networks
and as an encoder for dimension reduction and
a decoder for component reconstruction, respectively. The decoder
as a generator is enhanced by a discriminator network
that favors signal structures of the -th component in the
-th given dataset as guidance through adversarial learning. In contrast with
existing practices in AEs which trains each Auto-Encoder independently, or in
GANs that share the same generator, we introduce cross adversarial training
that emphasizes adversarial relation between any arbitrary network pairs
, achieving state-of-the-art performance
especially when target components share similar data structures