15 research outputs found

    A Comparative Study of Neural Models for Polyphonic Music Sequence Transduction

    Get PDF
    Automatic transcription of polyphonic music remains a challenging task in the field of Music Information Retrieval. One under-investigated point is the post-processing of time-pitch posteriograms into binary piano rolls. In this study, we investigate this task using a variety of neural network models and training procedures. We introduce an adversarial framework, that we compare against more traditional training losses. We also propose the use of binary neuron outputs and compare them to the usual real-valued outputs in both training frameworks. This allows us to train networks directly using the F-measure as training objective. We evaluate these methods using two kinds of transduction networks and two different multi-pitch detection systems, and compare the results against baseline note-tracking methods on a dataset of classical piano music. Analysis of results indicates that (1) convolutional models improve results over baseline models, but no improvement is reported for recurrent models; (2) supervised losses are superior to adversarial ones; (3) binary neurons do not improve results; (4) cross-entropy loss results in better or equal performance compared to the F-measure loss

    Classical Music Prediction and Composition by Means of Variational Autoencoders

    Get PDF
    [Abstract] This paper proposes a new model for music prediction based on Variational Autoencoders (VAEs). In this work, VAEs are used in a novel way to address two different issues: music representation into the latent space, and using this representation to make predictions of the future note events of the musical piece. This approach was trained with different songs of Handel. As a result, the system can represent the music in the latent space, and make accurate predictions. Therefore, the system can be used to compose new music either from an existing piece or from a random starting point. An additional feature of this system is that a small dataset was used for training. However, results show that the system is able to return accurate representations and predictions on unseen dataThis work is supported by Instituto de Salud Carlos III, grant number PI17/01826 (Collaborative Project in Genomic Data Integration CICLOGEN) funded by the Instituto de Salud Carlos III from the Spanish National plan for Scientific and Technical Research and Innovation 2013-2016 and the European Regional Development Funds (FEDER). This project was also supported by the General Directorate of Culture, Education and University Management of Xunta de Galicia ED431D 2017/16 and Drug Discovery Galician Network Ref. ED431G/01 and the Galician Network for Colorectal Cancer Research (Ref. ED431D 2017/23), and by the Spanish Ministry of Economy and Competitiveness through the funding of the unique installation BIOCAI (UNLC08-1E-002, UNLC13-13-3503) and the European Regional Development Funds (FEDER) by the European Union. This work was also funded by the grant for the consolidation and structuring of competitive research units (ED431C 2018/49) from the General Directorate of Culture, Education and University Management of Xunta de Galicia, and the CYTED network (PCI2018_093284) funded by the Spanish Ministry of Ministry of Innovation and Science. The experiments described in this section were carried out using the equipment of the Galician Supercomputing Center (CESGA)Xunta de Galicia; ED431D 2017/16Xunta de Galicia; ED431G/01Xunta de Galicia; ED431D 2017/23Xunta de Galicia; ED431C 2018/4

    Modeling High-Dimensional Audio Sequences with Recurrent Neural Networks

    Get PDF
    Cette thèse étudie des modèles de séquences de haute dimension basés sur des réseaux de neurones récurrents (RNN) et leur application à la musique et à la parole. Bien qu'en principe les RNN puissent représenter les dépendances à long terme et la dynamique temporelle complexe propres aux séquences d'intérêt comme la vidéo, l'audio et la langue naturelle, ceux-ci n'ont pas été utilisés à leur plein potentiel depuis leur introduction par Rumelhart et al. (1986a) en raison de la difficulté de les entraîner efficacement par descente de gradient. Récemment, l'application fructueuse de l'optimisation Hessian-free et d'autres techniques d'entraînement avancées ont entraîné la recrudescence de leur utilisation dans plusieurs systèmes de l'état de l'art. Le travail de cette thèse prend part à ce développement. L'idée centrale consiste à exploiter la flexibilité des RNN pour apprendre une description probabiliste de séquences de symboles, c'est-à-dire une information de haut niveau associée aux signaux observés, qui en retour pourra servir d'à priori pour améliorer la précision de la recherche d'information. Par exemple, en modélisant l'évolution de groupes de notes dans la musique polyphonique, d'accords dans une progression harmonique, de phonèmes dans un énoncé oral ou encore de sources individuelles dans un mélange audio, nous pouvons améliorer significativement les méthodes de transcription polyphonique, de reconnaissance d'accords, de reconnaissance de la parole et de séparation de sources audio respectivement. L'application pratique de nos modèles à ces tâches est détaillée dans les quatre derniers articles présentés dans cette thèse. Dans le premier article, nous remplaçons la couche de sortie d'un RNN par des machines de Boltzmann restreintes conditionnelles pour décrire des distributions de sortie multimodales beaucoup plus riches. Dans le deuxième article, nous évaluons et proposons des méthodes avancées pour entraîner les RNN. Dans les quatre derniers articles, nous examinons différentes façons de combiner nos modèles symboliques à des réseaux profonds et à la factorisation matricielle non-négative, notamment par des produits d'experts, des architectures entrée/sortie et des cadres génératifs généralisant les modèles de Markov cachés. Nous proposons et analysons également des méthodes d'inférence efficaces pour ces modèles, telles la recherche vorace chronologique, la recherche en faisceau à haute dimension, la recherche en faisceau élagué et la descente de gradient. Finalement, nous abordons les questions de l'étiquette biaisée, du maître imposant, du lissage temporel, de la régularisation et du pré-entraînement.This thesis studies models of high-dimensional sequences based on recurrent neural networks (RNNs) and their application to music and speech. While in principle RNNs can represent the long-term dependencies and complex temporal dynamics present in real-world sequences such as video, audio and natural language, they have not been used to their full potential since their introduction by Rumelhart et al. (1986a) due to the difficulty to train them efficiently by gradient-based optimization. In recent years, the successful application of Hessian-free optimization and other advanced training techniques motivated an increase of their use in many state-of-the-art systems. The work of this thesis is part of this development. The main idea is to exploit the power of RNNs to learn a probabilistic description of sequences of symbols, i.e. high-level information associated with observed signals, that in turn can be used as a prior to improve the accuracy of information retrieval. For example, by modeling the evolution of note patterns in polyphonic music, chords in a harmonic progression, phones in a spoken utterance, or individual sources in an audio mixture, we can improve significantly the accuracy of polyphonic transcription, chord recognition, speech recognition and audio source separation respectively. The practical application of our models to these tasks is detailed in the last four articles presented in this thesis. In the first article, we replace the output layer of an RNN with conditional restricted Boltzmann machines to describe much richer multimodal output distributions. In the second article, we review and develop advanced techniques to train RNNs. In the last four articles, we explore various ways to combine our symbolic models with deep networks and non-negative matrix factorization algorithms, namely using products of experts, input/output architectures, and generative frameworks that generalize hidden Markov models. We also propose and analyze efficient inference procedures for those models, such as greedy chronological search, high-dimensional beam search, dynamic programming-like pruned beam search and gradient descent. Finally, we explore issues such as label bias, teacher forcing, temporal smoothing, regularization and pre-training

    Statistical Piano Reduction Controlling Performance Difficulty

    Get PDF
    We present a statistical-modelling method for piano reduction, i.e. converting an ensemble score into piano scores, that can control performance difficulty. While previous studies have focused on describing the condition for playable piano scores, it depends on player's skill and can change continuously with the tempo. We thus computationally quantify performance difficulty as well as musical fidelity to the original score, and formulate the problem as optimization of musical fidelity under constraints on difficulty values. First, performance difficulty measures are developed by means of probabilistic generative models for piano scores and the relation to the rate of performance errors is studied. Second, to describe musical fidelity, we construct a probabilistic model integrating a prior piano-score model and a model representing how ensemble scores are likely to be edited. An iterative optimization algorithm for piano reduction is developed based on statistical inference of the model. We confirm the effect of the iterative procedure; we find that subjective difficulty and musical fidelity monotonically increase with controlled difficulty values; and we show that incorporating sequential dependence of pitches and fingering motion in the piano-score model improves the quality of reduction scores in high-difficulty cases.Comment: 12 pages, 7 figures, version accepted to APSIPA Transactions on Signal and Information Processin

    Modeling Pitch Perception With an Active Auditory Model Extended by Octopus Cells

    Get PDF
    Pitch is an essential category for musical sensations. Models of pitch perception are vividly discussed up to date. Most of them rely on definitions of mathematical methods in the spectral or temporal domain. Our proposed pitch perception model is composed of an active auditory model extended by octopus cells. The active auditory model is the same as used in the Stimulation based on Auditory Modeling (SAM), a successful cochlear implant sound processing strategy extended here by modeling the functional behavior of the octopus cells in the ventral cochlear nucleus and by modeling their connections to the auditory nerve fibers (ANFs). The neurophysiological parameterization of the extended model is fully described in the time domain. The model is based on latency-phase en- and decoding as octopus cells are latency-phase rectifiers in their local receptive fields. Pitch is ubiquitously represented by cascaded firing sweeps of octopus cells. Based on the firing patterns of octopus cells, inter-spike interval histograms can be aggregated, in which the place of the global maximum is assumed to encode the pitch

    Automatic characterization and generation of music loops and instrument samples for electronic music production

    Get PDF
    Repurposing audio material to create new music - also known as sampling - was a foundation of electronic music and is a fundamental component of this practice. Currently, large-scale databases of audio offer vast collections of audio material for users to work with. The navigation on these databases is heavily focused on hierarchical tree directories. Consequently, sound retrieval is tiresome and often identified as an undesired interruption in the creative process. We address two fundamental methods for navigating sounds: characterization and generation. Characterizing loops and one-shots in terms of instruments or instrumentation allows for organizing unstructured collections and a faster retrieval for music-making. The generation of loops and one-shot sounds enables the creation of new sounds not present in an audio collection through interpolation or modification of the existing material. To achieve this, we employ deep-learning-based data-driven methodologies for classification and generation.Repurposing audio material to create new music - also known as sampling - was a foundation of electronic music and is a fundamental component of this practice. Currently, large-scale databases of audio offer vast collections of audio material for users to work with. The navigation on these databases is heavily focused on hierarchical tree directories. Consequently, sound retrieval is tiresome and often identified as an undesired interruption in the creative process. We address two fundamental methods for navigating sounds: characterization and generation. Characterizing loops and one-shots in terms of instruments or instrumentation allows for organizing unstructured collections and a faster retrieval for music-making. The generation of loops and one-shot sounds enables the creation of new sounds not present in an audio collection through interpolation or modification of the existing material. To achieve this, we employ deep-learning-based data-driven methodologies for classification and generation
    corecore