49,430 research outputs found
Deep Learning for Music Information Retrieval in Limited Data Scenarios.
PhD ThesisWhile deep learning (DL) models have achieved impressive results in settings
where large amounts of annotated training data are available, over tting often
degrades performance when data is more limited. To improve the generalisation
of DL models, we investigate \data-driven priors" that exploit additional unlabelled
data or labelled data from related tasks. Unlike techniques such as data
augmentation, these priors are applicable across a range of machine listening
tasks, since their design does not rely on problem-speci c knowledge.
We rst consider scenarios in which parts of samples can be missing, aiming to
make more datasets available for model training. In an initial study focusing on
audio source separation (ASS), we exploit additionally available unlabelled music
and solo source recordings by using generative adversarial networks (GANs),
resulting in higher separation quality. We then present a fully adversarial
framework for learning generative models with missing data. Our discriminator
consists of separately trainable components that can be combined to train the
generator with the same objective as in the original GAN framework. We apply
our framework to image generation, image segmentation and ASS, demonstrating
superior performance compared to the original GAN.
To improve performance on any given MIR task, we also aim to leverage
datasets which are annotated for similar tasks. We use multi-task learning (MTL)
to perform singing voice detection and singing voice separation with one model,
improving performance on both tasks. Furthermore, we employ meta-learning
on a diverse collection of ten MIR tasks to nd a weight initialisation for a
\universal MIR model" so that training the model on any MIR task with this
initialisation quickly leads to good performance.
Since our data-driven priors encode knowledge shared across tasks and
datasets, they are suited for high-dimensional, end-to-end models, instead of small
models relying on task-speci c feature engineering, such as xed spectrogram
representations of audio commonly used in machine listening. To this end, we
propose \Wave-U-Net", an adaptation of the U-Net, which can perform ASS
directly on the raw waveform while performing favourably to its spectrogrambased
counterpart. Finally, we derive \Seq-U-Net" as a causal variant of Wave-
U-Net, which performs comparably to Wavenet and Temporal Convolutional
Network (TCN) on a variety of sequence modelling tasks, while being more
computationally e cient.
Recommended from our members
Musical source separation with deep learning and large-scale datasets
Throughout this thesis we will explore automatic music source separation by utilizing modern (at the time of writing) techniques and tools from machine learning and big data processing. The bulk of this work was carried out between 2016 and 2019.
In Chapter 2 we conduct a review of source separation literature. We start by outlining a subset of applications of source separation in some depth. We describe some of the early, pioneering work in automatic source separation: Auditory Scene Analysis, and its digital counterpart, Computational Auditory Scene Analysis.
We then introduce matrix decomposition-based methods such as Independent Component Analysis and Non-Negative Matrix factorization, and pitch informed methods where the separation algorithm is guided by pitch information that is known a priori. We brie y discuss user-guided methods, before conducting a thorough review of Deep Learning based source separation, including recurrent, convolutional, deep clustering-based, and Generative Adversarial Networks.
We then proceed to describe common evaluation metrics
and training datasets. Finally, we list a number of current challenges and drawbacks of current systems.
Chapter 3 focuses on datasets for musical source separation. First we show the growth of dataset sizes for both machine learning in general and music information retrieval specifically. We give several examples of the complexities and idiosyncrasies that are intrinsic to music datasets. We then proceed to present a method for extracting ground truth data for source separation from large unstructured musical catalogs.
In Chapter 4 we design a novel deep learning-based source separation algorithm. Motivation is provided by means of a musicological study1 that showed the high importance of vocals relative to other musical factors, in the minds of listeners. At the core of the vocal separation algorithm is the U-Net, a deep learning architecture that uses skip connections to preserve fine-grained detail. It was originally developed in the biomedical imaging domain, and later adapted to image-to-image translation. We adapt it to the source separation domain by treating spectrograms as images, and we use the dataset mining methods from Chapter 3 to generate sufficiently large training data. We evaluate our model objectively using standard evaluation metrics, subjectively using \crowdsourced" human subjects. To the best of our knowledge, this is the first use of U-Nets for source separation.
In the introduction above we proposed joint learning to optimize source separation and other objectives. In Chapter 5 we investigate one such instance: multi-task learning of vocal removal and vocal pitch tracking. We combine the vocal separation model from Chapter 4 with a state of the art pitch salience estimation model2, exploring several ways of combining the two models. We find that vocal pitch estimation benefits from joint learning when the two tasks are trained in sequence, with the source separation model preceding the pitch estimation model. We also report benefits from fine-tuning by iteratively applying the model.
Chapter 6 extends the U-Net model to multiple instruments. In order to minimize the phase artifacts that were a common issue in Chapter 4, we modify the model to operate in the complex domain. We run experiments with several loss functions: Time-domain loss, magnitude-only frequency domain loss, and joint time and frequency-domain loss. Our experiments are evaluated both objectively and subjectively, and we carry out extensive qualitative analysis to investigate the effects of complex masking.
Finally, we conclude the thesis in Chapter 7 by summarizing this work and highlighting several future directions of research
- …