180 research outputs found

    Basic Filters for Convolutional Neural Networks Applied to Music: Training or Design?

    Full text link
    When convolutional neural networks are used to tackle learning problems based on music or, more generally, time series data, raw one-dimensional data are commonly pre-processed to obtain spectrogram or mel-spectrogram coefficients, which are then used as input to the actual neural network. In this contribution, we investigate, both theoretically and experimentally, the influence of this pre-processing step on the network's performance and pose the question, whether replacing it by applying adaptive or learned filters directly to the raw data, can improve learning success. The theoretical results show that approximately reproducing mel-spectrogram coefficients by applying adaptive filters and subsequent time-averaging is in principle possible. We also conducted extensive experimental work on the task of singing voice detection in music. The results of these experiments show that for classification based on Convolutional Neural Networks the features obtained from adaptive filter banks followed by time-averaging perform better than the canonical Fourier-transform-based mel-spectrogram coefficients. Alternative adaptive approaches with center frequencies or time-averaging lengths learned from training data perform equally well.Comment: Completely revised version; 21 pages, 4 figure

    Polyphonic Sound Event Detection by using Capsule Neural Networks

    Full text link
    Artificial sound event detection (SED) has the aim to mimic the human ability to perceive and understand what is happening in the surroundings. Nowadays, Deep Learning offers valuable techniques for this goal such as Convolutional Neural Networks (CNNs). The Capsule Neural Network (CapsNet) architecture has been recently introduced in the image processing field with the intent to overcome some of the known limitations of CNNs, specifically regarding the scarce robustness to affine transformations (i.e., perspective, size, orientation) and the detection of overlapped images. This motivated the authors to employ CapsNets to deal with the polyphonic-SED task, in which multiple sound events occur simultaneously. Specifically, we propose to exploit the capsule units to represent a set of distinctive properties for each individual sound event. Capsule units are connected through a so-called "dynamic routing" that encourages learning part-whole relationships and improves the detection performance in a polyphonic context. This paper reports extensive evaluations carried out on three publicly available datasets, showing how the CapsNet-based algorithm not only outperforms standard CNNs but also allows to achieve the best results with respect to the state of the art algorithms

    Deep Learning for Music Information Retrieval in Limited Data Scenarios.

    Get PDF
    PhD ThesisWhile deep learning (DL) models have achieved impressive results in settings where large amounts of annotated training data are available, over tting often degrades performance when data is more limited. To improve the generalisation of DL models, we investigate \data-driven priors" that exploit additional unlabelled data or labelled data from related tasks. Unlike techniques such as data augmentation, these priors are applicable across a range of machine listening tasks, since their design does not rely on problem-speci c knowledge. We rst consider scenarios in which parts of samples can be missing, aiming to make more datasets available for model training. In an initial study focusing on audio source separation (ASS), we exploit additionally available unlabelled music and solo source recordings by using generative adversarial networks (GANs), resulting in higher separation quality. We then present a fully adversarial framework for learning generative models with missing data. Our discriminator consists of separately trainable components that can be combined to train the generator with the same objective as in the original GAN framework. We apply our framework to image generation, image segmentation and ASS, demonstrating superior performance compared to the original GAN. To improve performance on any given MIR task, we also aim to leverage datasets which are annotated for similar tasks. We use multi-task learning (MTL) to perform singing voice detection and singing voice separation with one model, improving performance on both tasks. Furthermore, we employ meta-learning on a diverse collection of ten MIR tasks to nd a weight initialisation for a \universal MIR model" so that training the model on any MIR task with this initialisation quickly leads to good performance. Since our data-driven priors encode knowledge shared across tasks and datasets, they are suited for high-dimensional, end-to-end models, instead of small models relying on task-speci c feature engineering, such as xed spectrogram representations of audio commonly used in machine listening. To this end, we propose \Wave-U-Net", an adaptation of the U-Net, which can perform ASS directly on the raw waveform while performing favourably to its spectrogrambased counterpart. Finally, we derive \Seq-U-Net" as a causal variant of Wave- U-Net, which performs comparably to Wavenet and Temporal Convolutional Network (TCN) on a variety of sequence modelling tasks, while being more computationally e cient.

    Deep learning-based music source separation

    Get PDF
    Diese Dissertation befasst sich mit dem Problem der Trennung von Musikquellen durch den Einsatz von deep learning Methoden. Die auf deep learning basierende Trennung von Musikquellen wird unter drei Gesichtspunkten untersucht. Diese Perspektiven sind: die Signalverarbeitung, die neuronale Architektur und die Signaldarstellung. Aus der ersten Perspektive, soll verstanden werden, welche deep learning Modelle, die auf DNNs basieren, für die Aufgabe der Musikquellentrennung lernen, und ob es einen analogen Signalverarbeitungsoperator gibt, der die Funktionalität dieser Modelle charakterisiert. Zu diesem Zweck wird ein neuartiger Algorithmus vorgestellt. Der Algorithmus wird als NCA bezeichnet und destilliert ein optimiertes Trennungsmodell, das aus nicht-linearen Operatoren besteht, in einen einzigen linearen Operator, der leicht zu interpretieren ist. Aus der zweiten Perspektive, soll eine neuronale Netzarchitektur vorgeschlagen werden, die das zuvor erwähnte Konzept der Filterberechnung und -optimierung beinhaltet. Zu diesem Zweck wird die als Masker and Denoiser (MaD) bezeichnete neuronale Netzarchitektur vorgestellt. Die vorgeschlagene Architektur realisiert die Filteroperation unter Verwendung skip-filtering connections Verbindungen. Zusätzlich werden einige Inferenzstrategien und Optimierungsziele vorgeschlagen und diskutiert. Die Leistungsfähigkeit von MaD bei der Musikquellentrennung wird durch eine Reihe von Experimenten bewertet, die sowohl objektive als auch subjektive Bewertungsverfahren umfassen. Abschließend, der Schwerpunkt der dritten Perspektive liegt auf dem Einsatz von DNNs zum Erlernen von solchen Signaldarstellungen, für die Trennung von Musikquellen hilfreich sind. Zu diesem Zweck wird eine neue Methode vorgeschlagen. Die vorgeschlagene Methode verwendet ein neuartiges Umparametrisierungsschema und eine Kombination von Optimierungszielen. Die Umparametrisierung basiert sich auf sinusförmigen Funktionen, die interpretierbare DNN-Darstellungen fördern. Der durchgeführten Experimente deuten an, dass die vorgeschlagene Methode beim Erlernen interpretierbarer Darstellungen effizient eingesetzt werden kann, wobei der Filterprozess noch auf separate Musikquellen angewendet werden kann. Die Ergebnisse der durchgeführten Experimente deuten an, dass die vorgeschlagene Methode beim Erlernen interpretierbarer Darstellungen effizient eingesetzt werden kann, wobei der Filterprozess noch auf separate Musikquellen angewendet werden kann. Darüber hinaus der Einsatz von optimal transport (OT) Entfernungen als Optimierungsziele sind für die Berechnung additiver und klar strukturierter Signaldarstellungen.This thesis addresses the problem of music source separation using deep learning methods. The deep learning-based separation of music sources is examined from three angles. These angles are: the signal processing, the neural architecture, and the signal representation. From the first angle, it is aimed to understand what deep learning models, using deep neural networks (DNNs), learn for the task of music source separation, and if there is an analogous signal processing operator that characterizes the functionality of these models. To do so, a novel algorithm is presented. The algorithm, referred to as the neural couplings algorithm (NCA), distills an optimized separation model consisting of non-linear operators into a single linear operator that is easy to interpret. Using the NCA, it is shown that DNNs learn data-driven filters for singing voice separation, that can be assessed using signal processing. Moreover, by enabling DNNs to learn how to predict filters for source separation, DNNs capture the structure of the target source and learn robust filters. From the second angle, it is aimed to propose a neural network architecture that incorporates the aforementioned concept of filter prediction and optimization. For this purpose, the neural network architecture referred to as the Masker-and-Denoiser (MaD) is presented. The proposed architecture realizes the filtering operation using skip-filtering connections. Additionally, a few inference strategies and optimization objectives are proposed and discussed. The performance of MaD in music source separation is assessed by conducting a series of experiments that include both objective and subjective evaluation processes. Experimental results suggest that the MaD architecture, with some of the studied strategies, is applicable to realistic music recordings, and the MaD architecture has been considered one of the state-of-the-art approaches in the Signal Separation and Evaluation Campaign (SiSEC) 2018. Finally, the focus of the third angle is to employ DNNs for learning signal representations that are helpful for separating music sources. To that end, a new method is proposed using a novel re-parameterization scheme and a combination of optimization objectives. The re-parameterization is based on sinusoidal functions that promote interpretable DNN representations. Results from the conducted experimental procedure suggest that the proposed method can be efficiently employed in learning interpretable representations, where the filtering process can still be applied to separate music sources. Furthermore, the usage of optimal transport (OT) distances as optimization objectives is useful for computing additive and distinctly structured signal representations for various types of music sources

    Scattering Transform for Playing Technique Recognition

    Get PDF
    Playing techniques are expressive elements in music performances that carry important information about music expressivity and interpretation. When displaying playing techniques in the time–frequency domain, we observe that each has a distinctive spectro-temporal pattern. Based on the patterns of regularity, we group commonly-used playing techniques into two families: pitch modulation-based techniques (PMTs) and pitch evolution-based techniques (PETs). The former are periodic modulations that elaborate on stable pitches, including vibrato, tremolo, trill, and flutter-tongue; while the latter contain monotonic pitch changes, such as acciaccatura, portamento, and glissando. In this thesis, we present a general framework based on the scattering transform for playing technique recognition. We propose two variants of the scattering transform, the adaptive scattering and the direction-invariant joint scattering. The former provides highly-compact representations that are invariant to pitch transpositions for representing PMTs. The latter captures the spectro-temporal patterns exhibited by PETs. Using the proposed scattering representations as input, our recognition system achieves start-of-the-art results. We provide a formal interpretation of the role of each scattering component confirmed by explanatory visualisations. Whereas previously published datasets for playing technique analysis focused primarily on techniques recorded in isolation, we publicly release a new dataset to evaluate the proposed framework. The dataset, named CBFdataset, is the first dataset on the Chinese bamboo flute (CBF), containing full-length CBF performances and expert annotations of playing techniques. To provide evidence on the generalisability of the proposed framework, we test it over three additional datasets with a variety of playing techniques. Finally, to explore the applicability of the proposed scattering representations to general audio classification problems, we introduce two additional applications: one applies the adaptive scattering for identifying performers in polyphonic orchestral music and the other uses the joint scattering for detecting and classifying chick calls
    corecore