33 research outputs found
Deep Learning Techniques for Music Generation -- A Survey
This paper is a survey and an analysis of different ways of using deep
learning (deep artificial neural networks) to generate musical content. We
propose a methodology based on five dimensions for our analysis:
Objective - What musical content is to be generated? Examples are: melody,
polyphony, accompaniment or counterpoint. - For what destination and for what
use? To be performed by a human(s) (in the case of a musical score), or by a
machine (in the case of an audio file).
Representation - What are the concepts to be manipulated? Examples are:
waveform, spectrogram, note, chord, meter and beat. - What format is to be
used? Examples are: MIDI, piano roll or text. - How will the representation be
encoded? Examples are: scalar, one-hot or many-hot.
Architecture - What type(s) of deep neural network is (are) to be used?
Examples are: feedforward network, recurrent network, autoencoder or generative
adversarial networks.
Challenge - What are the limitations and open challenges? Examples are:
variability, interactivity and creativity.
Strategy - How do we model and control the process of generation? Examples
are: single-step feedforward, iterative feedforward, sampling or input
manipulation.
For each dimension, we conduct a comparative analysis of various models and
techniques and we propose some tentative multidimensional typology. This
typology is bottom-up, based on the analysis of many existing deep-learning
based systems for music generation selected from the relevant literature. These
systems are described and are used to exemplify the various choices of
objective, representation, architecture, challenge and strategy. The last
section includes some discussion and some prospects.Comment: 209 pages. This paper is a simplified version of the book: J.-P.
Briot, G. Hadjeres and F.-D. Pachet, Deep Learning Techniques for Music
Generation, Computational Synthesis and Creative Systems, Springer, 201
μμ μ μμμ λν μ‘°κ±΄λΆ μμ±μ κ°μ μ κ΄ν μ°κ΅¬: νμκ³Ό ννμ μ€μ¬μΌλ‘
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : μ΅ν©κ³ΌνκΈ°μ λνμ μ΅ν©κ³ΌνλΆ(λμ§νΈμ 보μ΅ν©μ 곡), 2023. 2. μ΄κ΅κ΅¬.Conditional generation of musical components (CGMC) creates a part of music based on partial musical components such as melody or chord. CGMC is beneficial for discovering complex relationships among musical attributes. It can also assist non-experts who face difficulties in making music. However, recent studies for CGMC are still facing two challenges in terms of generation quality and model controllability. First, the structure of the generated music is not robust. Second, only limited ranges of musical factors and tasks have been examined as targets for flexible control of generation. In this thesis, we aim to mitigate these two challenges to improve the CGMC systems. For musical structure, we focus on intuitive modeling of musical hierarchy to help the model explicitly learn musically meaningful dependency. To this end, we utilize alignment paths between the raw music data and the musical units such as notes or chords. For musical creativity, we facilitate smooth control of novel musical attributes using latent representations. We attempt to achieve disentangled representations of the intended factors by regularizing them with data-driven inductive bias. This thesis verifies the proposed approaches particularly in two representative CGMC tasks, melody harmonization and expressive performance rendering. A variety of experimental results show the possibility of the proposed approaches to expand musical creativity under stable generation quality.μμ
μ μμλ₯Ό μ‘°κ±΄λΆ μμ±νλ λΆμΌμΈ CGMCλ λ©λ‘λλ νμκ³Ό κ°μ μμ
μ μΌλΆλΆμ κΈ°λ°μΌλ‘ λλ¨Έμ§ λΆλΆμ μμ±νλ κ²μ λͺ©νλ‘ νλ€. μ΄ λΆμΌλ μμ
μ μμ κ° λ³΅μ‘ν κ΄κ³λ₯Ό νꡬνλ λ° μ©μ΄νκ³ , μμ
μ λ§λλ λ° μ΄λ €μμ κ²ͺλ λΉμ λ¬Έκ°λ€μ λμΈ μ μλ€. μ΅κ·Ό μ°κ΅¬λ€μ λ₯λ¬λ κΈ°μ μ νμ©νμ¬ CGMC μμ€ν
μ μ±λ₯μ λμ¬μλ€. νμ§λ§, μ΄λ¬ν μ°κ΅¬λ€μλ μμ§ μμ± νμ§κ³Ό μ μ΄κ°λ₯μ± μΈ‘λ©΄μμ λ κ°μ§μ νκ³μ μ΄ μλ€. λ¨Όμ , μμ±λ μμ
μ μμ
μ κ΅¬μ‘°κ° λͺ
ννμ§ μλ€. λν, μμ§ μ’μ λ²μμ μμ
μ μμ λ° ν
μ€ν¬λ§μ΄ μ μ°ν μ μ΄μ λμμΌλ‘μ νꡬλμλ€. μ΄μ λ³Έ νμλ
Όλ¬Έμμλ CGMCμ κ°μ μ μν΄ μ λ κ°μ§μ νκ³μ μ ν΄κ²°νκ³ μ νλ€. 첫 λ²μ§Έλ‘, μμ
ꡬ쑰λ₯Ό μ΄λ£¨λ μμ
μ μκ³λ₯Ό μ§κ΄μ μΌλ‘ λͺ¨λΈλ§νλ λ° μ§μ€νκ³ μ νλ€. λ³Έλ λ°μ΄ν°μ μ, νμκ³Ό κ°μ μμ
μ λ¨μ κ° μ λ ¬ κ²½λ‘λ₯Ό μ¬μ©νμ¬ λͺ¨λΈμ΄ μμ
μ μΌλ‘ μλ―Έμλ μ’
μμ±μ λͺ
ννκ² λ°°μΈ μ μλλ‘ νλ€. λ λ²μ§Έλ‘, μ μ¬ νμμ νμ©νμ¬ μλ‘μ΄ μμ
μ μμλ€μ μ μ°νκ² μ μ΄νκ³ μ νλ€. νΉν μ μ¬ νμμ΄ μλλ μμμ λν΄ ν리λλ‘ νλ ¨νκΈ° μν΄μ λΉμ§λ νΉμ μκ°μ§λ νμ΅ νλ μμν¬μ μ¬μ©νμ¬ μ μ¬ νμμ μ ννλλ‘ νλ€. λ³Έ νμλ
Όλ¬Έμμλ CGMC λΆμΌμ λνμ μΈ λ ν
μ€ν¬μΈ λ©λ‘λ νλͺ¨λμ΄μ μ΄μ
λ° ννμ μ°μ£Ό λ λλ§ ν
μ€ν¬μ λν΄ μμ λ κ°μ§ λ°©λ²λ‘ μ κ²μ¦νλ€. λ€μν μ€νμ κ²°κ³Όλ€μ ν΅ν΄ μ μν λ°©λ²λ‘ μ΄ CGMC μμ€ν
μ μμ
μ μ°½μμ±μ μμ μ μΈ μμ± νμ§λ‘ νμ₯ν μ μλ€λ κ°λ₯μ±μ μμ¬νλ€.Chapter 1 Introduction 1
1.1 Motivation 5
1.2 Definitions 8
1.3 Tasks of Interest 10
1.3.1 Generation Quality 10
1.3.2 Controllability 12
1.4 Approaches 13
1.4.1 Modeling Musical Hierarchy 14
1.4.2 Regularizing Latent Representations 16
1.4.3 Target Tasks 18
1.5 Outline of the Thesis 19
Chapter 2 Background 22
2.1 Music Generation Tasks 23
2.1.1 Melody Harmonization 23
2.1.2 Expressive Performance Rendering 25
2.2 Structure-enhanced Music Generation 27
2.2.1 Hierarchical Music Generation 27
2.2.2 Transformer-based Music Generation 28
2.3 Disentanglement Learning 29
2.3.1 Unsupervised Approaches 30
2.3.2 Supervised Approaches 30
2.3.3 Self-supervised Approaches 31
2.4 Controllable Music Generation 32
2.4.1 Score Generation 32
2.4.2 Performance Rendering 33
2.5 Summary 34
Chapter 3 Translating Melody to Chord: Structured and Flexible Harmonization of Melody with Transformer 36
3.1 Introduction 36
3.2 Proposed Methods 41
3.2.1 Standard Transformer Model (STHarm) 41
3.2.2 Variational Transformer Model (VTHarm) 44
3.2.3 Regularized Variational Transformer Model (rVTHarm) 46
3.2.4 Training Objectives 47
3.3 Experimental Settings 48
3.3.1 Datasets 49
3.3.2 Comparative Methods 50
3.3.3 Training 50
3.3.4 Metrics 51
3.4 Evaluation 56
3.4.1 Chord Coherence and Diversity 57
3.4.2 Harmonic Similarity to Human 59
3.4.3 Controlling Chord Complexity 60
3.4.4 Subjective Evaluation 62
3.4.5 Qualitative Results 67
3.4.6 Ablation Study 73
3.5 Conclusion and Future Work 74
Chapter 4 Sketching the Expression: Flexible Rendering of Expressive Piano Performance with Self-supervised Learning 76
4.1 Introduction 76
4.2 Proposed Methods 79
4.2.1 Data Representation 79
4.2.2 Modeling Musical Hierarchy 80
4.2.3 Overall Network Architecture 81
4.2.4 Regularizing the Latent Variables 84
4.2.5 Overall Objective 86
4.3 Experimental Settings 87
4.3.1 Dataset and Implementation 87
4.3.2 Comparative Methods 88
4.4 Evaluation 88
4.4.1 Generation Quality 89
4.4.2 Disentangling Latent Representations 90
4.4.3 Controllability of Expressive Attributes 91
4.4.4 KL Divergence 93
4.4.5 Ablation Study 94
4.4.6 Subjective Evaluation 95
4.4.7 Qualitative Examples 97
4.4.8 Extent of Control 100
4.5 Conclusion 102
Chapter 5 Conclusion and Future Work 103
5.1 Conclusion 103
5.2 Future Work 106
5.2.1 Deeper Investigation of Controllable Factors 106
5.2.2 More Analysis of Qualitative Evaluation Results 107
5.2.3 Improving Diversity and Scale of Dataset 108
Bibliography 109
μ΄ λ‘ 137λ°
νΉμ± μ‘°μ μ΄ κ°λ₯ν μ¬μΈ΅ μ κ²½λ§ κΈ°λ°μ ꡬ쑰μ λ©λ‘λ μμ±
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : 곡과λν μ°μ
곡νκ³Ό, 2021.8. λ°μ’
ν.This thesis aims to generate structural melodies using attribute controllable deep neural networks. The development of music-composing artificial intelligence can inspire professional composers and reduce the difficulty of creating and provide the public with the combination and utilization of music and various media content.
For a melody generation model to function as a composer, it must control specific desired characteristics. The characteristics include quantifiable attributes, such as pitch level and rhythm density, and chords, which are essential elements that comprise modern popular (pop) music along with melodies.
First, this thesis introduces a melody generation model that separately produces rhythm and pitch conditioned on chord progressions. The quantitative evaluation results demonstrate that the melodies produced by the proposed model have a distribution more similar to the dataset than other baseline models. Qualitative analysis reveals the presence of repetition and variation within the generated melodies. Using a subjective human listening test, we conclude that the model successfully produced new melodies that sound pleasant in rhythm and pitch.
Four quantifiable attributes are considered: pitch level, pitch variety, rhythm density, and rhythm variety. We improve the previous study of training a variational autoencoder (VAE) and a discriminator in an adversarial manner to eliminate attribute information from the encoded latent variable. Rhythm and pitch VAEs are separately trained to control pitch-and rhythm-related attributes entirely independently. The experimental results indicate that though the ratio of the outputs belonging to the intended bin is not high, the model learned the relative order between the bins.
Finally, a hierarchical song structure generation model is proposed. A sequence-to-sequence framework is adopted to capture the similar mood between two parts of the same song. The time axis is compressed by applying attention with different lengths of query and key to model the hierarchy of music. The concept of musical contrast is implemented by controlling attributes with relative bin information. The human evaluation results suggest the possibility of solving the problem of generating different structures of the same song with the sequence-to-sequence framework and reveal that the proposed model can create song structures with musical contrasts.λ³Έ λ
Όλ¬Έμ νΉμ± μ‘°μ μ΄ κ°λ₯ν μ¬μΈ΅ μ κ²½λ§μ νμ©νμ¬ κ΅¬μ‘°μ λ©λ‘λλ₯Ό μμ±νλ κΈ°λ²μ μ°κ΅¬νλ€. μ곑μ λλ μΈκ³΅μ§λ₯μ κ°λ°μ μ λ¬Έ μ곑κ°μκ²λ μ곑μ μκ°μ μ£Όμ΄ μ°½μμ κ³ ν΅μ λ μ μκ³ , μΌλ° λμ€μκ²λ κ°μ’
λ―Έλμ΄ μ½ν
μΈ μ μ’
λ₯μ μμ΄ μ¦κ°νλ μΆμΈμμ νμλ‘ νλ μμ
μ μ 곡ν΄μ€μΌλ‘ μΈν΄ λ€λ₯Έ λ―Έλμ΄ λ§€μ²΄μμ κ²°ν© λ° νμ©μ μ¦λν μ μλ€.
μ곑 μΈκ³΅μ§λ₯μ μμ€μ΄ μΈκ° μ곑κ°μ μμ€μ λ€λ€λ₯΄κΈ° μν΄μλ μλμ λ°λ₯Έ νΉμ± μ‘°μ μκ³‘μ΄ κ°λ₯ν΄μΌ νλ€. μ¬κΈ°μ λ§νλ νΉμ±μ΄λ μμ λμ΄λ 리λ¬μ λ°λμ κ°μ΄ μμΉν κ°λ₯ν νΉμ± λΏλ§ μλλΌ, λ©λ‘λμ ν¨κ² μμ
μ κΈ°λ³Έ κ΅¬μ± μμλΌκ³ λ³Ό μ μλ μ½λ λν ν¬ν¨νλ€. κΈ°μ‘΄μλ νΉμ± μ‘°μ μ΄ κ°λ₯ν μμ
μμ± λͺ¨λΈμ΄ μ μλμμΌλ μ곑κ°κ° 곑 μ 체μ ꡬμ±μ μΌλμ λκ³ κ° λΆλΆμ μ곑νλ― κΈ΄ λ²μμ ꡬ쑰μ νΉμ§ λ° μμ
μ λμ‘°κ° κ³ λ €λ νΉμ± μ‘°μ μ κ΄ν μ°κ΅¬λ λ§μ§ μλ€.
λ³Έ λ
Όλ¬Έμμλ λ¨Όμ μ½λ μ‘°κ±΄λΆ λ©λ‘λ μμ±μ μμ΄ λ¦¬λ¬κ³Ό μλμ΄λ₯Ό κ°κ° λ°λ‘ μμ±νλ λͺ¨λΈκ³Ό κ·Έ νμ΅ λ°©λ²μ μ μνλ€. μ λμ νκ°μ κ²°κ³Όλ μ μν κΈ°λ²μ΄ λ€λ₯Έ λΉκ΅ λͺ¨λΈλ€μ λΉν΄ κ·Έ μμ± κ²°κ³Όκ° λ°μ΄ν°μ
κ³Ό λ μ μ¬ν λΆν¬λ₯Ό λνλ΄κ³ μμμ 보μ¬μ€λ€. μ μ±μ νκ° κ²°κ³Ό μμ±λ μμ
μμ μ λΉν λ°λ³΅κ³Ό λ³νμ΄ νμΈλλ©°, μ¬λμ΄ λ£κΈ°μ μμ κ³Ό λ°μ λͺ¨λ λ£κΈ° μ’μ μλ‘μ΄ λ©λ‘λλ₯Ό μμ±ν μ μλ€λ κ²°λ‘ μ λμΆνλ€.
μμΉν κ°λ₯ν νΉμ±μΌλ‘λ μμ λμ΄, μλμ΄ λ³ν, 리λ¬μ λ°λ, 리λ¬μ 볡μ‘λ λ€ κ°μ§ νΉμ±μ μ μνλ€. νΉμ± μ‘°μ μ΄ κ°λ₯ν λ³μ΄ν μ€ν μΈμ½λλ₯Ό νμ΅νκΈ° μ μ¬ λ³μλ‘λΆν° νΉμ± μ 보λ₯Ό μ μΈνλ νλ³κΈ°λ₯Ό μ λμ μΌλ‘ νμ΅νλ κΈ°μ‘΄ μ°κ΅¬λ₯Ό λ°μ μμΌ, μλμ΄μ λ¦¬λ¬ κ΄λ ¨ νΉμ±μ μμ ν λ
립μ μΌλ‘ μ‘°μ ν μ μλλ‘ λ κ°μ λͺ¨λΈμ λΆλ¦¬νμ¬ νμ΅νλ€. κ° κ΅¬κ°λ§λ€ λμΌν μμ λ°μ΄ν°λ₯Ό ν¬ν¨νλλ‘ νΉμ± κ°μ λ°λΌ ꡬκ°μ λλ ν νμ΅ν κ²°κ³Ό, μμ± κ²°κ³Όκ° μλν ꡬκ°μ μ νν ν¬ν¨λλ λΉμ¨μ λμ§ μμ§λ§ μκ΄κ³μλ λκ² λνλλ€.
λ§μ§λ§μΌλ‘ μμ λ μ°κ΅¬μ κ²°κ³Όλ₯Ό νμ©νμ¬, μμ
μ μΌλ‘ λΉμ·νλ©΄μλ μλ‘ λμ‘°λ₯Ό μ΄λ£¨λ 곑 ꡬ쑰 μμ± κΈ°λ²μ μ μνλ€. μνμ€-ν¬-μνμ€ λ¬Έμ μν©μμ μ’μ μ±λ₯μ 보μ΄λ νΈλμ€ν¬λ¨Έ λͺ¨λΈμ λ² μ΄μ€λΌμΈμΌλ‘ μΌμ μ΄ν
μ
맀컀λμ¦μ μ μ©νλ€. μμ
μ κ³μΈ΅μ ꡬ쑰λ₯Ό λ°μνκΈ° μν΄ κ³μΈ΅μ μ΄ν
μ
μ μ μ©νλ©°, μ΄ λ μλμ μμΉ μλ² λ©μ ν¨μ¨μ μΌλ‘ κ³μ°νλ λ°©λ²μ μ μνλ€. μμ
μ λμ‘°λ₯Ό ꡬννκΈ° μν΄ μμ μ μν λ€ κ°μ§ νΉμ± μ 보λ₯Ό μ‘°μ νλλ‘ μ λμ νμ΅μ μ§ννκ³ , μ΄ λ νΉμ± μ 보λ μ νν κ΅¬κ° μ λ³΄κ° μλ μλμ κ΅¬κ° λΉκ΅ μ 보λ₯Ό μ¬μ©νλ€. μ²μ·¨ μ€ν κ²°κ³Ό κ°μ 곑μ λ€λ₯Έ ꡬ쑰λ₯Ό μμ±νλ λ¬Έμ λ₯Ό μνμ€-ν¬-μνμ€ λ°©λ²μΌλ‘ ν΄κ²°ν μ μλ κ°λ₯μ±μ μ μνκ³ , μ μλ κΈ°λ²μ ν΅ν΄ μμ
μ λμ‘°κ° λνλλ 곑 ꡬ쑰 μμ±μ΄ κ°λ₯νλ€λ μ μ 보μ¬μ€λ€.Chapter 1 Introduction 1
1.1 Background and Motivation 1
1.2 Objectives 4
1.3 Thesis Outline 6
Chapter 2 Literature Review 7
2.1 Chord-conditioned Melody Generation 7
2.2 Attention Mechanism and Transformer 10
2.2.1 Attention Mechanism 10
2.2.2 Transformer 10
2.2.3 Relative Positional Embedding 12
2.2.4 Funnel-Transformer 14
2.3 Attribute Controllable Music Generation 16
Chapter 3 Problem Definition 17
3.1 Data Representation 17
3.1.1 Datasets 18
3.1.2 Preprocessing 19
3.2 Notation and Formulas 21
3.2.1 Chord-conditioned Melody Generation 21
3.2.2 Attribute Controllable Melody Generation 22
3.2.3 Song Structure Generation 22
3.2.4 Notation 22
Chapter 4 Chord-conditioned Melody Generation 24
4.1 Methodology 24
4.1.1 Model Architecture 24
4.1.2 Relative Positional Embedding 27
4.2 Training and Generation 29
4.2.1 Two-phase Training 30
4.2.2 Pitch-varied Rhythm Data 30
4.2.3 Generating Melodies 31
4.3 Experiments 32
4.3.1 Experiment Settings 32
4.3.2 Baseline Models 33
4.4 Evaluation Results 34
4.4.1 Quantitative Evaluation 34
4.4.2 Qualitative Evaluation 42
Chapter 5 Attribute Controllable Melody Generation 48
5.1 Attribute Definition 48
5.1.1 Pitch-Related Attributes 48
5.1.2 Rhythm-Related Attributes 49
5.2 Model Architecture 51
5.3 Experiments 54
5.3.1 Data Preprocessing 54
5.3.2 Training 56
5.4 Results 58
5.4.1 Quantitative Results 58
5.4.2 Output Examples 60
Chapter 6 Hierarchical Song Structure Generation 68
6.1 Baseline 69
6.2 Proposed Model 70
6.2.1 Relative Hierarchical Attention 70
6.2.2 Model Architecture 78
6.3 Experiments 84
6.3.1 Training and Generation 84
6.3.2 Human Evaluation 85
6.4 Evaluation Results 86
6.4.1 Control Success Ratio 86
6.4.2 Human Perception Ratio 86
6.4.3 Generated Samples 88
Chapter 7 Conclusion 104
7.1 Summary and Contributions 104
7.2 Limitations and Future Research 107
Appendices 108
Chapter A MGEval Results Between the Music of Different Genres 109
Chapter B MGEval Results of CMT and Baseline Models 116
Chapter C Samples Generated by CMT 126
Bibliography 129
κ΅λ¬Έμ΄λ‘ 144λ°
Rhythm, Chord and Melody Generation for Lead Sheets using Recurrent Neural Networks
Music that is generated by recurrent neural networks often lacks a sense of
direction and coherence. We therefore propose a two-stage LSTM-based model for
lead sheet generation, in which the harmonic and rhythmic templates of the song
are produced first, after which, in a second stage, a sequence of melody notes
is generated conditioned on these templates. A subjective listening test shows
that our approach outperforms the baselines and increases perceived musical
coherence.Comment: 8 pages, 2 figures, 3 tables, 2 appendice
Emotion-Conditioned Melody Harmonization with Hierarchical Variational Autoencoder
Existing melody harmonization models have made great progress in improving
the quality of generated harmonies, but most of them ignored the emotions
beneath the music. Meanwhile, the variability of harmonies generated by
previous methods is insufficient. To solve these problems, we propose a novel
LSTM-based Hierarchical Variational Auto-Encoder (LHVAE) to investigate the
influence of emotional conditions on melody harmonization, while improving the
quality of generated harmonies and capturing the abundant variability of chord
progressions. Specifically, LHVAE incorporates latent variables and emotional
conditions at different levels (piece- and bar-level) to model the global and
local music properties. Additionally, we introduce an attention-based melody
context vector at each step to better learn the correspondence between melodies
and harmonies. Experimental results of the objective evaluation show that our
proposed model outperforms other LSTM-based models. Through subjective
evaluation, we conclude that only altering the chords hardly changes the
overall emotion of the music. The qualitative analysis demonstrates the ability
of our model to generate variable harmonies.Comment: Accepted by IEEE SMC 202
Emotion-Guided Music Accompaniment Generation Based on Variational Autoencoder
Music accompaniment generation is a crucial aspect in the composition
process. Deep neural networks have made significant strides in this field, but
it remains a challenge for AI to effectively incorporate human emotions to
create beautiful accompaniments. Existing models struggle to effectively
characterize human emotions within neural network models while composing music.
To address this issue, we propose the use of an easy-to-represent emotion flow
model, the Valence/Arousal Curve, which allows for the compatibility of
emotional information within the model through data transformation and enhances
interpretability of emotional factors by utilizing a Variational Autoencoder as
the model structure. Further, we used relative self-attention to maintain the
structure of the music at music phrase level and to generate a richer
accompaniment when combined with the rules of music theory.Comment: Accepted By International Joint Conference on Neural Networks
2023(IJCNN2023