13 research outputs found
Automatic Chord Estimation Based on a Frame-wise Convolutional Recurrent Neural Network with Non-Aligned Annotations
International audienceThis paper describes a weakly-supervised approach to Automatic Chord Estimation (ACE) task that aims to estimate a sequence of chords from a given music audio signal at the frame level, under a realistic condition that only non-aligned chord annotations are available. In conventional studies assuming the availability of time-aligned chord annotations, Deep Neural Networks (DNNs) that learn frame-wise mappings from acoustic features to chords have attained excellent performance. The major drawback of such frame-wise models is that they cannot be trained without the time alignment information. Inspired by a common approach in automatic speech recognition based on non-aligned speech transcriptions, we propose a two-step method that trains a Hidden Markov Model (HMM) for the forced alignment between chord annotations and music signals, and then trains a powerful frame-wise DNN model for ACE. Experimental results show that although the frame-level accuracy of the forced alignment was just under 90%, the performance of the proposed method was degraded only slightly from that of the DNN model trained by using the ground-truth alignment data. Furthermore, using a sufficient amount of easily collected non-aligned data, the proposed method is able to reach or even outperform the conventional methods based on ground-truth time-aligned annotations
MULTI-STEP CHORD SEQUENCE PREDICTION BASED ON AGGREGATED MULTI-SCALE ENCODER-DECODER NETWORKS
International audienceThis paper studies the prediction of chord progressions for jazz music by relying on machine learning models. The motivation of our study comes from the recent success of neu-ral networks for performing automatic music composition. Although high accuracies are obtained in single-step prediction scenarios, most models fail to generate accurate multi-step chord predictions. In this paper, we postulate that this comes from the multi-scale structure of musical information and propose new architectures based on an iterative temporal aggregation of input labels. Specifically, the input and ground truth labels are merged into increasingly large temporal bags, on which we train a family of encoder-decoder networks for each temporal scale. In a second step, we use these pre-trained encoder bottleneck features at each scale in order to train a final encoder-decoder network. Furthermore, we rely on different reductions of the initial chord alphabet into three adapted chord alphabets. We perform evaluations against several state-of-the-art models and show that our multi-scale architecture outperforms existing methods in terms of accuracy and perplexity, while requiring relatively few parameters. We analyze musical properties of the results, showing the influence of downbeat position within the analysis window on accuracy , and evaluate errors using a musically-informed distance metric
MULTI-STEP CHORD SEQUENCE PREDICTION BASED ON AGGREGATED MULTI-SCALE ENCODER-DECODER NETWORKS
International audienceThis paper studies the prediction of chord progressions for jazz music by relying on machine learning models. The motivation of our study comes from the recent success of neu-ral networks for performing automatic music composition. Although high accuracies are obtained in single-step prediction scenarios, most models fail to generate accurate multi-step chord predictions. In this paper, we postulate that this comes from the multi-scale structure of musical information and propose new architectures based on an iterative temporal aggregation of input labels. Specifically, the input and ground truth labels are merged into increasingly large temporal bags, on which we train a family of encoder-decoder networks for each temporal scale. In a second step, we use these pre-trained encoder bottleneck features at each scale in order to train a final encoder-decoder network. Furthermore, we rely on different reductions of the initial chord alphabet into three adapted chord alphabets. We perform evaluations against several state-of-the-art models and show that our multi-scale architecture outperforms existing methods in terms of accuracy and perplexity, while requiring relatively few parameters. We analyze musical properties of the results, showing the influence of downbeat position within the analysis window on accuracy , and evaluate errors using a musically-informed distance metric
Introduction de la connaissance musicale et de l'analyse qualitative dans les tùches d'extraction et de prédiction d'accords avec apprentissage automatique
This thesis investigates the impact of introducing musical properties in machine learning models for the extraction and inference of musical features. Furthermore, it discusses the use of musical knowledge to perform qualitative evaluations of the results. In this work, we focus on musical chords since these mid-level features are frequently used to describe harmonic progressions in Western music. Hence, amongs the variety of tasks encountered in the field of Music Information Retrieval (MIR), the two main tasks that we address are the Automatic Chord Extraction (ACE) and the inference of symbolic chord sequences. In the case of musical chords, there exists inherent strong hierarchical and functional relationships. Indeed, even if two chords do not belong to the same class, they can share the same harmonic function within a chord progression. Hence, we developed a specifically-tailored analyzer that focuses on the functional relations between chords to distinguish strong and weak errors. We define weak errors as a misclassification that still preserves the relevance in terms of harmonic function. This reflects the fact that, in contrast to strict transcription tasks, the extraction of high-level musical features is a rather subjective task. Moreover, many creative applications would benefit from a higher level of harmonic understanding rather than an increased accuracy of label classification. For instance, one of our application case is the development of a software that interacts with a musician in real-time by inferring expected chord progressions. In order to achieve this goal, we divided the project into two main tasks : a listening module and a symbolic generation module. The listening module extracts the musical structure played by the musician, where as the generative module predicts musical sequences based on the extracted features. In the first part of this thesis, we target the development of an ACE system that could emulate the process of musical structure discovery, as performed by musicians in improvisation contexts. Most ACE systems are built on the idea of extracting features from raw audio signals and, then, using these features to construct a chord classifier. This entail two major families of approaches, as either rule-based or statistical models. In this work, we identify drawbacks in the use of statistical models for ACE tasks. Then, we propose to introduce prior musical knowledge in order to account for the inherent relationships between chords directly inside the loss function of learning methods. In the second part of this thesis, we focus on learning higher-level relationships inside sequences of extracted chords in order to develop models with the ability to generate potential continuations of chord sequences. In order to introduce musical knowledge in these models, we propose both new architectures, multi-label training methods and novel data representations.Cette thĂšse Ă©tudie lâimpact de lâintroduction de propriĂ©tĂ©s musicales dans les modĂšles dâapprentissage machine pour lâextraction et lâinfĂ©rence de structures musicales. De plus, elle traite de lâutilisation des connaissances musicales pour effectuer des Ă©valuations qualitatives des rĂ©sultats. Dans ce travail, nous nous concentrons sur les accords musicaux puisque ce sont des structures musicales frĂ©quemment utilisĂ©es pour dĂ©crire les progressions harmoniques dans la musique occidentale. Ainsi, parmi la variĂ©tĂ© des tĂąches rencontrĂ©es dans le domaine de la recherche dâinformations musicales (MIR), les deux principales tĂąches que nous abordons sont lâextraction automatique dâaccords (ACE) et lâinfĂ©rence de sĂ©quences de label dâaccords. Dans le cas des accords musicaux, il existe de fortes relations inhĂ©rentes dâun point de vue hiĂ©rarchiques et fonctionnelles. En effet, mĂȘme si deux accords nâappartiennent pas Ă la mĂȘme classe, ils peuvent partager la mĂȘme fonction harmonique au sein dâune progression dâaccords. En outre, de nombreuses applications crĂ©atives bĂ©nĂ©ficieraient dâun niveau plus Ă©levĂ© de comprĂ©hension harmonique plutĂŽt que dâune prĂ©cision accrue dans la tĂąche de classification. Nous avons donc dĂ©veloppĂ© un analyseur spĂ©cifiquement adaptĂ© qui se concentre sur les relations fonctionnelles entre les accords pour distinguer les erreurs fortes et faibles. Nous dĂ©finissons les erreurs faibles comme une mauvaise classification qui conserve la pertinence en termes de fonction harmonique. Cela reflĂšte le fait que, contrairement aux tĂąches de transcription strict, lâextraction de caractĂ©ristiques musicales de haut niveau est une tĂąche plutĂŽt subjective. Un de nos cas dâapplication est le dĂ©veloppement dâun logiciel qui interagit avec un musicien en temps rĂ©el en dĂ©duisant les progressions dâaccords attendues. Pour atteindre cet objectif, nous avons divisĂ© le projet en deux tĂąches principales : un module dâĂ©coute et un module de gĂ©nĂ©ration symbolique. Le module dâĂ©coute extrait la structure musicale jouĂ©e par le musicien, tandis que le module de gĂ©nĂ©ration prĂ©dit les sĂ©quences musicales en fonction des accords extraits. Dans la premiĂšre partie de cette thĂšse, nous visons le dĂ©veloppement dâun systĂšme ACE qui pourrait Ă©muler le processus de dĂ©couverte de la structure musicale, tel quâil est exĂ©cutĂ© par les musiciens dans des contextes dâimprovisation. La plupart des systĂšmes ACE sont construits sur lâidĂ©e dâextraire des caractĂ©ristiques des signaux audio bruts et, ensuite, dâutiliser ces caractĂ©ristiques pour construire un classificateur dâaccords. Nous distinguons deux grandes familles dâapproches, les modĂšles basĂ©s sur les rĂšgles musicales ou les modĂšles statistiques. Dans ce travail, nous identifions les inconvĂ©nients de lâutilisation des modĂšles statistiques pour les tĂąches ACE. Ensuite, nous proposons dâintroduire les connaissances musicales prĂ©alables afin de rendre compte des relations inhĂ©rentes entre les accords directement Ă lâintĂ©rieur de la fonction de coĂ»t des mĂ©thodes dâapprentissage machine. Dans la deuxiĂšme partie de cette thĂšse, nous nous concentrons sur lâapprentissage de relations de plus haut niveau Ă lâintĂ©rieur de sĂ©quences dâaccords extraites, en vue de dĂ©velopper des modĂšles capables de gĂ©nĂ©rer des suites potentielles de sĂ©quences dâaccords
Automatic chord extraction and musical structure prediction through semi-supervised learning, application to human-computer improvisation
Human computer co-improvisation aims to rely on a computer in order to produce a musical accompaniment to a musicianâs improvisation.Recently, the notion of guidance has been introduced to enhance the process of human computer co-improvisation. Although this concept has already been studied with a step-by-step guidance or by guiding with a formal temporal structure, it is usually only based on a past memory of events. This memory is derived from an annotated corpus which limits the possibility to infer the potential future improvisation structure. Nevertheless, most improvisations are based on long-term structures or grids. Our study intends to target these aspects and provide short term predictions of the musical structures to improve the quality of the computer co-improvisation.Our aim is to develop a software that interacts in real-time with a musician by inferring expected structures. In order to achieve this goal, we divide the project into two main tasks: a listening module and a symbolic generation module. The listening module extracts the musical structure played by the musician whereas the generative module predicts musical sequences based on these extractions.In this report, we present a first approach towards this goal by introducing an automatic chord extraction module and a chord label sequence generator. Regarding the structure extraction, as the current state-of-the-art results in automatic chord extraction are obtained with Convolutional Neural Networks (CNN), we first study new architectures derived from the CNNs applied to this task. However, as we underline in our study, the low quantity of audio labeled dataset could limit the use of machine learning algorithms. Hence, we also propose the use of Ladder Networks (LN) which can be trained in a semi-supervised way. This allows us to evaluate the use of unlabeled music data to improve labeled chord extraction. Regarding the chord label generator, many recent works showed the success of Recurrent Neural Networks (RNN) for generative temporal applications. Thus, we use a family of recurrent networks, the Long Short-Term Memory (LSTM) unit, for our generative task.Here, we present our implementations and the results of our models by comparing to the current state-of-the-art and show that we obtain comparable results on the seminal evaluation datasets. Finally, we introduce the overall architecture of the software linking both modules and propose some directions of future work
Automatic chord extraction and musical structure prediction through semi-supervised learning, application to human-computer improvisation
Human computer co-improvisation aims to rely on a computer in order to produce a musical accompaniment to a musicianâs improvisation.Recently, the notion of guidance has been introduced to enhance the process of human computer co-improvisation. Although this concept has already been studied with a step-by-step guidance or by guiding with a formal temporal structure, it is usually only based on a past memory of events. This memory is derived from an annotated corpus which limits the possibility to infer the potential future improvisation structure. Nevertheless, most improvisations are based on long-term structures or grids. Our study intends to target these aspects and provide short term predictions of the musical structures to improve the quality of the computer co-improvisation.Our aim is to develop a software that interacts in real-time with a musician by inferring expected structures. In order to achieve this goal, we divide the project into two main tasks: a listening module and a symbolic generation module. The listening module extracts the musical structure played by the musician whereas the generative module predicts musical sequences based on these extractions.In this report, we present a first approach towards this goal by introducing an automatic chord extraction module and a chord label sequence generator. Regarding the structure extraction, as the current state-of-the-art results in automatic chord extraction are obtained with Convolutional Neural Networks (CNN), we first study new architectures derived from the CNNs applied to this task. However, as we underline in our study, the low quantity of audio labeled dataset could limit the use of machine learning algorithms. Hence, we also propose the use of Ladder Networks (LN) which can be trained in a semi-supervised way. This allows us to evaluate the use of unlabeled music data to improve labeled chord extraction. Regarding the chord label generator, many recent works showed the success of Recurrent Neural Networks (RNN) for generative temporal applications. Thus, we use a family of recurrent networks, the Long Short-Term Memory (LSTM) unit, for our generative task.Here, we present our implementations and the results of our models by comparing to the current state-of-the-art and show that we obtain comparable results on the seminal evaluation datasets. Finally, we introduce the overall architecture of the software linking both modules and propose some directions of future work
Using musical relationships between chord labels in automatic chord extraction tasks
International audienceRecent research on Automatic Chord Extraction (ACE) has focused on the improvement of models based on machine learning. However, most models still fail to take into account the prior knowledge underlying the labeling alphabets (chord labels). Furthermore, recent works have shown that ACE performances have reached a glass ceiling.Therefore, this prompts the need to focus on other aspects of the task, such as the introduction of musical knowledge in the representation, the improvement of the models towards more complex chord alphabets and the development of more adapted evaluation methods.In this paper, we propose to exploit specific properties and relationships between chord labels in order to improve the learning of statistical ACE models. Hence, we analyze the interdependence of the representations of chords and their associated distances, the precision of the chord alphabets, and the impact of performing alphabet reductionbefore or after training the model. Furthermore, we propose new training losses based on musical theory. We show that these improve the results of ACE systems based on Convolutional Neural Networks. By analyzing our results, we uncover a set of related insights on ACE tasks based on statistical models, and also formalize the musicalmeaning of some classification errors
Automatic Chord Estimation Based on a Frame-wise Convolutional Recurrent Neural Network with Non-Aligned Annotations
International audienceThis paper describes a weakly-supervised approach to Automatic Chord Estimation (ACE) task that aims to estimate a sequence of chords from a given music audio signal at the frame level, under a realistic condition that only non-aligned chord annotations are available. In conventional studies assuming the availability of time-aligned chord annotations, Deep Neural Networks (DNNs) that learn frame-wise mappings from acoustic features to chords have attained excellent performance. The major drawback of such frame-wise models is that they cannot be trained without the time alignment information. Inspired by a common approach in automatic speech recognition based on non-aligned speech transcriptions, we propose a two-step method that trains a Hidden Markov Model (HMM) for the forced alignment between chord annotations and music signals, and then trains a powerful frame-wise DNN model for ACE. Experimental results show that although the frame-level accuracy of the forced alignment was just under 90%, the performance of the proposed method was degraded only slightly from that of the DNN model trained by using the ground-truth alignment data. Furthermore, using a sufficient amount of easily collected non-aligned data, the proposed method is able to reach or even outperform the conventional methods based on ground-truth time-aligned annotations