44,493 research outputs found

    Transcription of piano music with deep learning

    Get PDF
    Transcription of music is a complex process of transcribing an audio recording into a symbolic notation. The goal of this thesis was to examine transcription of piano music with deep learning, for which three models of deep neural networks were implemented: multilayer perceptron, convolutional neural network and deep belief network. Through the use of deep belief network, unsupervised pretraining for automatic extraction of musical features from audio signals was also tested. Learning of these models and evaluation of transcription was performed with MAPS database for piano music transcription. A comparison between Fast Fourier Transform and Constant Q Transform for data pre-processing was also carried out. Final results show that deep learning with an appropriate learning schedule is potentially a powerful tool for automatic transcription of music

    Transcription of piano music with deep learning

    Get PDF
    Transcription of music is a complex process of transcribing an audio recording into a symbolic notation. The goal of this thesis was to examine transcription of piano music with deep learning, for which three models of deep neural networks were implemented: multilayer perceptron, convolutional neural network and deep belief network. Through the use of deep belief network, unsupervised pretraining for automatic extraction of musical features from audio signals was also tested. Learning of these models and evaluation of transcription was performed with MAPS database for piano music transcription. A comparison between Fast Fourier Transform and Constant Q Transform for data pre-processing was also carried out. Final results show that deep learning with an appropriate learning schedule is potentially a powerful tool for automatic transcription of music

    Audio-based music classification with a pretrained convolutional network

    Get PDF
    Recently the ‘Million Song Dataset’, containing audio features and metadata for one million songs, was made available. In this paper, we build a convolutional network that is then trained to perform artist recognition, genre recognition and key detection. The network is tailored to summarize the audio features over musically significant timescales. It is infeasible to train the network on all available data in a supervised fashion, so we use unsupervised pretraining to be able to harness the entire dataset: we train a convolutional deep belief network on all data, and then use the learnt parameters to initialize a convolutional multilayer perceptron with the same architecture. The MLP is then trained on a labeled subset of the data for each task. We also train the same MLP with randomly initialized weights. We find that our convolutional approach improves accuracy for the genre recognition and artist recognition tasks. Unsupervised pretraining improves convergence speed in all cases. For artist recognition it improves accuracy as well

    Multiscale approaches to music audio feature learning

    Get PDF
    Content-based music information retrieval tasks are typically solved with a two-stage approach: features are extracted from music audio signals, and are then used as input to a regressor or classifier. These features can be engineered or learned from data. Although the former approach was dominant in the past, feature learning has started to receive more attention from the MIR community in recent years. Recent results in feature learning indicate that simple algorithms such as K-means can be very effective, sometimes surpassing more complicated approaches based on restricted Boltzmann machines, autoencoders or sparse coding. Furthermore, there has been increased interest in multiscale representations of music audio recently. Such representations are more versatile because music audio exhibits structure on multiple timescales, which are relevant for different MIR tasks to varying degrees. We develop and compare three approaches to multiscale audio feature learning using the spherical K-means algorithm. We evaluate them in an automatic tagging task and a similarity metric learning task on the Magnatagatune dataset

    A Deep Representation for Invariance And Music Classification

    Get PDF
    Representations in the auditory cortex might be based on mechanisms similar to the visual ventral stream; modules for building invariance to transformations and multiple layers for compositionality and selectivity. In this paper we propose the use of such computational modules for extracting invariant and discriminative audio representations. Building on a theory of invariance in hierarchical architectures, we propose a novel, mid-level representation for acoustical signals, using the empirical distributions of projections on a set of templates and their transformations. Under the assumption that, by construction, this dictionary of templates is composed from similar classes, and samples the orbit of variance-inducing signal transformations (such as shift and scale), the resulting signature is theoretically guaranteed to be unique, invariant to transformations and stable to deformations. Modules of projection and pooling can then constitute layers of deep networks, for learning composite representations. We present the main theoretical and computational aspects of a framework for unsupervised learning of invariant audio representations, empirically evaluated on music genre classification.Comment: 5 pages, CBMM Memo No. 002, (to appear) IEEE 2014 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2014

    Deep Learning and Music Adversaries

    Get PDF
    OA Monitor ExerciseOA Monitor ExerciseAn {\em adversary} is essentially an algorithm intent on making a classification system perform in some particular way given an input, e.g., increase the probability of a false negative. Recent work builds adversaries for deep learning systems applied to image object recognition, which exploits the parameters of the system to find the minimal perturbation of the input image such that the network misclassifies it with high confidence. We adapt this approach to construct and deploy an adversary of deep learning systems applied to music content analysis. In our case, however, the input to the systems is magnitude spectral frames, which requires special care in order to produce valid input audio signals from network-derived perturbations. For two different train-test partitionings of two benchmark datasets, and two different deep architectures, we find that this adversary is very effective in defeating the resulting systems. We find the convolutional networks are more robust, however, compared with systems based on a majority vote over individually classified audio frames. Furthermore, we integrate the adversary into the training of new deep systems, but do not find that this improves their resilience against the same adversary
    corecore