21 research outputs found
Multichannel Music Separation with Deep Neural Networks
International audienceThis article addresses the problem of multichannel music separation. We propose a framework where the source spectra are estimated using deep neural networks and combined with spatial covariance matrices to encode the source spatial characteristics. The parameters are estimated in an iterative expectation-maximization fashion and used to derive a multichannel Wiener filter. We evaluate the proposed framework for the task of music separation on a large dataset. Experimental results show that the method we describe performs consistently well in separating singing voice and other instruments from realistic musical mixtures
Multi-scale Multi-band DenseNets for Audio Source Separation
This paper deals with the problem of audio source separation. To handle the
complex and ill-posed nature of the problems of audio source separation, the
current state-of-the-art approaches employ deep neural networks to obtain
instrumental spectra from a mixture. In this study, we propose a novel network
architecture that extends the recently developed densely connected
convolutional network (DenseNet), which has shown excellent results on image
classification tasks. To deal with the specific problem of audio source
separation, an up-sampling layer, block skip connection and band-dedicated
dense blocks are incorporated on top of DenseNet. The proposed approach takes
advantage of long contextual information and outperforms state-of-the-art
results on SiSEC 2016 competition by a large margin in terms of
signal-to-distortion ratio. Moreover, the proposed architecture requires
significantly fewer parameters and considerably less training time compared
with other methods.Comment: to appear at WASPAA 201
Adversarial Semi-Supervised Audio Source Separation applied to Singing Voice Extraction
The state of the art in music source separation employs neural networks
trained in a supervised fashion on multi-track databases to estimate the
sources from a given mixture. With only few datasets available, often extensive
data augmentation is used to combat overfitting. Mixing random tracks, however,
can even reduce separation performance as instruments in real music are
strongly correlated. The key concept in our approach is that source estimates
of an optimal separator should be indistinguishable from real source signals.
Based on this idea, we drive the separator towards outputs deemed as realistic
by discriminator networks that are trained to tell apart real from separator
samples. This way, we can also use unpaired source and mixture recordings
without the drawbacks of creating unrealistic music mixtures. Our framework is
widely applicable as it does not assume a specific network architecture or
number of sources. To our knowledge, this is the first adoption of adversarial
training for music source separation. In a prototype experiment for singing
voice separation, separation performance increases with our approach compared
to purely supervised training.Comment: 5 pages, 2 figures, 1 table. Final version of manuscript accepted for
2018 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). Implementation available at
https://github.com/f90/AdversarialAudioSeparatio
Self-Supervised Music Source Separation Using Vector-Quantized Source Category Estimates
Music source separation is focused on extracting distinct sonic elements from
composite tracks. Historically, many methods have been grounded in supervised
learning, necessitating labeled data, which is occasionally constrained in its
diversity. More recent methods have delved into N-shot techniques that utilize
one or more audio samples to aid in the separation. However, a challenge with
some of these methods is the necessity for an audio query during inference,
making them less suited for genres with varied timbres and effects. This paper
offers a proof-of-concept for a self-supervised music source separation system
that eliminates the need for audio queries at inference time. In the training
phase, while it adopts a query-based approach, we introduce a modification by
substituting the continuous embedding of query audios with Vector Quantized
(VQ) representations. Trained end-to-end with up to N classes as determined by
the VQ's codebook size, the model seeks to effectively categorise instrument
classes. During inference, the input is partitioned into N sources, with some
potentially left unutilized based on the mix's instrument makeup. This
methodology suggests an alternative avenue for considering source separation
across diverse music genres. We provide examples and additional results online.Comment: 4 pages, 2 figures, 1 table; Accepted at the 37th Conference on
Neural Information Processing Systems (2023), Machine Learning for Audio
Worksho