197 research outputs found

    Deep Prior-Based Audio Inpainting Using Multi-Resolution Harmonic Convolutional Neural Networks

    Get PDF
    In this manuscript, we propose a novel method to perform audio inpainting, i.e., the restoration of audio signals presenting multiple missing parts. Audio inpainting can be interpreted in the context of inverse problems as the task of reconstructing an audio signal from its corrupted observation. For this reason, our method is based on a deep prior approach, a recently proposed technique that proved to be effective in the solution of many inverse problems, among which image inpainting. Deep prior allows one to consider the structure of a neural network as an implicit prior and to adopt it as a regularizer. Differently from the classical deep learning paradigm, deep prior performs a single-element training and thus it can be applied to corrupted audio signals independently from the available training data sets. In the context of audio inpainting, a network presenting relevant audio priors will possibly generate a restored version of an audio signal, only provided with its corrupted observation. Our method exploits a time-frequency representation of audio signals and makes use of a multi-resolution convolutional autoencoder, that has been enhanced to perform the harmonic convolution operation. Results show that the proposed technique is able to provide a coherent and meaningful reconstruction of the corrupted audio. It is also able to outperform the methods considered for comparison, in its domain of application

    Audio-Visual Egocentric Action Recognition

    Get PDF

    GAN-based Generation and Automatic Selection of Explanations for Neural Networks

    Get PDF
    One way to interpret trained deep neural networks (DNNs) is by inspecting characteristics that neurons in the model respond to, such as by iteratively optimising the model input (e.g., an image) to maximally activate specific neurons. However, this requires a careful selection of hyper-parameters to generate interpretable examples for each neuron of interest, and current methods rely on a manual, qualitative evaluation of each setting, which is prohibitively slow. We introduce a new metric that uses Fréchet Inception Distance (FID) to encourage similarity between model activations for real and generated data. This provides an efficient way to evaluate a set of generated examples for each setting of hyper-parameters. We also propose a novel GAN-based method for generating explanations that enables an efficient search through the input space and imposes a strong prior favouring realistic outputs. We apply our approach to a classification model trained to predict whether a music audio recording contains singing voice. Our results suggest that this proposed metric successfully selects hyper-parameters leading to interpretable examples, avoiding the need for manual evaluation. Moreover, we see that examples synthesised to maximise or minimise the predicted probability of singing voice presence exhibit vocal or non-vocal characteristics, respectively, suggesting that our approach is able to generate suitable explanations for understanding concepts learned by a neural network

    Deep learning-based music source separation

    Get PDF
    Diese Dissertation befasst sich mit dem Problem der Trennung von Musikquellen durch den Einsatz von deep learning Methoden. Die auf deep learning basierende Trennung von Musikquellen wird unter drei Gesichtspunkten untersucht. Diese Perspektiven sind: die Signalverarbeitung, die neuronale Architektur und die Signaldarstellung. Aus der ersten Perspektive, soll verstanden werden, welche deep learning Modelle, die auf DNNs basieren, für die Aufgabe der Musikquellentrennung lernen, und ob es einen analogen Signalverarbeitungsoperator gibt, der die Funktionalität dieser Modelle charakterisiert. Zu diesem Zweck wird ein neuartiger Algorithmus vorgestellt. Der Algorithmus wird als NCA bezeichnet und destilliert ein optimiertes Trennungsmodell, das aus nicht-linearen Operatoren besteht, in einen einzigen linearen Operator, der leicht zu interpretieren ist. Aus der zweiten Perspektive, soll eine neuronale Netzarchitektur vorgeschlagen werden, die das zuvor erwähnte Konzept der Filterberechnung und -optimierung beinhaltet. Zu diesem Zweck wird die als Masker and Denoiser (MaD) bezeichnete neuronale Netzarchitektur vorgestellt. Die vorgeschlagene Architektur realisiert die Filteroperation unter Verwendung skip-filtering connections Verbindungen. Zusätzlich werden einige Inferenzstrategien und Optimierungsziele vorgeschlagen und diskutiert. Die Leistungsfähigkeit von MaD bei der Musikquellentrennung wird durch eine Reihe von Experimenten bewertet, die sowohl objektive als auch subjektive Bewertungsverfahren umfassen. Abschließend, der Schwerpunkt der dritten Perspektive liegt auf dem Einsatz von DNNs zum Erlernen von solchen Signaldarstellungen, für die Trennung von Musikquellen hilfreich sind. Zu diesem Zweck wird eine neue Methode vorgeschlagen. Die vorgeschlagene Methode verwendet ein neuartiges Umparametrisierungsschema und eine Kombination von Optimierungszielen. Die Umparametrisierung basiert sich auf sinusförmigen Funktionen, die interpretierbare DNN-Darstellungen fördern. Der durchgeführten Experimente deuten an, dass die vorgeschlagene Methode beim Erlernen interpretierbarer Darstellungen effizient eingesetzt werden kann, wobei der Filterprozess noch auf separate Musikquellen angewendet werden kann. Die Ergebnisse der durchgeführten Experimente deuten an, dass die vorgeschlagene Methode beim Erlernen interpretierbarer Darstellungen effizient eingesetzt werden kann, wobei der Filterprozess noch auf separate Musikquellen angewendet werden kann. Darüber hinaus der Einsatz von optimal transport (OT) Entfernungen als Optimierungsziele sind für die Berechnung additiver und klar strukturierter Signaldarstellungen.This thesis addresses the problem of music source separation using deep learning methods. The deep learning-based separation of music sources is examined from three angles. These angles are: the signal processing, the neural architecture, and the signal representation. From the first angle, it is aimed to understand what deep learning models, using deep neural networks (DNNs), learn for the task of music source separation, and if there is an analogous signal processing operator that characterizes the functionality of these models. To do so, a novel algorithm is presented. The algorithm, referred to as the neural couplings algorithm (NCA), distills an optimized separation model consisting of non-linear operators into a single linear operator that is easy to interpret. Using the NCA, it is shown that DNNs learn data-driven filters for singing voice separation, that can be assessed using signal processing. Moreover, by enabling DNNs to learn how to predict filters for source separation, DNNs capture the structure of the target source and learn robust filters. From the second angle, it is aimed to propose a neural network architecture that incorporates the aforementioned concept of filter prediction and optimization. For this purpose, the neural network architecture referred to as the Masker-and-Denoiser (MaD) is presented. The proposed architecture realizes the filtering operation using skip-filtering connections. Additionally, a few inference strategies and optimization objectives are proposed and discussed. The performance of MaD in music source separation is assessed by conducting a series of experiments that include both objective and subjective evaluation processes. Experimental results suggest that the MaD architecture, with some of the studied strategies, is applicable to realistic music recordings, and the MaD architecture has been considered one of the state-of-the-art approaches in the Signal Separation and Evaluation Campaign (SiSEC) 2018. Finally, the focus of the third angle is to employ DNNs for learning signal representations that are helpful for separating music sources. To that end, a new method is proposed using a novel re-parameterization scheme and a combination of optimization objectives. The re-parameterization is based on sinusoidal functions that promote interpretable DNN representations. Results from the conducted experimental procedure suggest that the proposed method can be efficiently employed in learning interpretable representations, where the filtering process can still be applied to separate music sources. Furthermore, the usage of optimal transport (OT) distances as optimization objectives is useful for computing additive and distinctly structured signal representations for various types of music sources

    On the encoding of natural music in computational models and human brains

    Get PDF
    This article discusses recent developments and advances in the neuroscience of music to understand the nature of musical emotion. In particular, it highlights how system identification techniques and computational models of music have advanced our understanding of how the human brain processes the textures and structures of music and how the processed information evokes emotions. Musical models relate physical properties of stimuli to internal representations called features, and predictive models relate features to neural or behavioral responses and test their predictions against independent unseen data. The new frameworks do not require orthogonalized stimuli in controlled experiments to establish reproducible knowledge, which has opened up a new wave of naturalistic neuroscience. The current review focuses on how this trend has transformed the domain of the neuroscience of music
    corecore