173 research outputs found

    Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

    Full text link
    Numerous studies in the field of music generation have demonstrated impressive performance, yet virtually no models are able to directly generate music to match accompanying videos. In this work, we develop a generative music AI framework, Video2Music, that can match a provided video. We first curated a unique collection of music videos. Then, we analysed the music videos to obtain semantic, scene offset, motion, and emotion features. These distinct features are then employed as guiding input to our music generation model. We transcribe the audio files into MIDI and chords, and extract features such as note density and loudness. This results in a rich multimodal dataset, called MuVi-Sync, on which we train a novel Affective Multimodal Transformer (AMT) model to generate music given a video. This model includes a novel mechanism to enforce affective similarity between video and music. Finally, post-processing is performed based on a biGRU-based regression model to estimate note density and loudness based on the video features. This ensures a dynamic rendering of the generated chords with varying rhythm and volume. In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion. The musical quality, along with the quality of music-video matching is confirmed in a user study. The proposed AMT model, along with the new MuVi-Sync dataset, presents a promising step for the new task of music generation for videos

    Transformers in Machine Learning: Literature Review

    Get PDF
    In this study, the researcher presents an approach regarding methods in Transformer Machine Learning. Initially, transformers are neural network architectures that are considered as inputs. Transformers are widely used in various studies with various objects. The transformer is one of the deep learning architectures that can be modified. Transformers are also mechanisms that study contextual relationships between words. Transformers are used for text compression in readings. Transformers are used to recognize chemical images with an accuracy rate of 96%. Transformers are used to detect a person's emotions. Transformer to detect emotions in social media conversations, for example, on Facebook with happy, sad, and angry categories. Figure 1 illustrates the encoder and decoder process through the input process and produces output. the purpose of this study is to only review literature from various journals that discuss transformers. This explanation is also done by presenting the subject or dataset, data analysis method, year, and accuracy achieved. By using the methods presented, researchers can conclude results in search of the highest accuracy and opportunities for further research

    MARBLE: Music Audio Representation Benchmark for Universal Evaluation

    Get PDF
    In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement. The leaderboard and toolkit repository are published at this https URL to promote future music AI research

    음악적 요소에 대한 조건부 생성의 개선에 관한 연구: 화음과 표현을 중심으로

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 융합과학부(디지털정보융합전공), 2023. 2. 이교구.Conditional generation of musical components (CGMC) creates a part of music based on partial musical components such as melody or chord. CGMC is beneficial for discovering complex relationships among musical attributes. It can also assist non-experts who face difficulties in making music. However, recent studies for CGMC are still facing two challenges in terms of generation quality and model controllability. First, the structure of the generated music is not robust. Second, only limited ranges of musical factors and tasks have been examined as targets for flexible control of generation. In this thesis, we aim to mitigate these two challenges to improve the CGMC systems. For musical structure, we focus on intuitive modeling of musical hierarchy to help the model explicitly learn musically meaningful dependency. To this end, we utilize alignment paths between the raw music data and the musical units such as notes or chords. For musical creativity, we facilitate smooth control of novel musical attributes using latent representations. We attempt to achieve disentangled representations of the intended factors by regularizing them with data-driven inductive bias. This thesis verifies the proposed approaches particularly in two representative CGMC tasks, melody harmonization and expressive performance rendering. A variety of experimental results show the possibility of the proposed approaches to expand musical creativity under stable generation quality.음악적 요소를 조건부 생성하는 분야인 CGMC는 멜로디나 화음과 같은 음악의 일부분을 기반으로 나머지 부분을 생성하는 것을 목표로 한다. 이 분야는 음악적 요소 간 복잡한 관계를 탐구하는 데 용이하고, 음악을 만드는 데 어려움을 겪는 비전문가들을 도울 수 있다. 최근 연구들은 딥러닝 기술을 활용하여 CGMC 시스템의 성능을 높여왔다. 하지만, 이러한 연구들에는 아직 생성 품질과 제어가능성 측면에서 두 가지의 한계점이 있다. 먼저, 생성된 음악의 음악적 구조가 명확하지 않다. 또한, 아직 좁은 범위의 음악적 요소 및 테스크만이 유연한 제어의 대상으로서 탐구되었다. 이에 본 학위논문에서는 CGMC의 개선을 위해 위 두 가지의 한계점을 해결하고자 한다. 첫 번째로, 음악 구조를 이루는 음악적 위계를 직관적으로 모델링하는 데 집중하고자 한다. 본래 데이터와 음, 화음과 같은 음악적 단위 간 정렬 경로를 사용하여 모델이 음악적으로 의미있는 종속성을 명확하게 배울 수 있도록 한다. 두 번째로, 잠재 표상을 활용하여 새로운 음악적 요소들을 유연하게 제어하고자 한다. 특히 잠재 표상이 의도된 요소에 대해 풀리도록 훈련하기 위해서 비지도 혹은 자가지도 학습 프레임워크을 사용하여 잠재 표상을 제한하도록 한다. 본 학위논문에서는 CGMC 분야의 대표적인 두 테스크인 멜로디 하모나이제이션 및 표현적 연주 렌더링 테스크에 대해 위의 두 가지 방법론을 검증한다. 다양한 실험적 결과들을 통해 제안한 방법론이 CGMC 시스템의 음악적 창의성을 안정적인 생성 품질로 확장할 수 있다는 가능성을 시사한다.Chapter 1 Introduction 1 1.1 Motivation 5 1.2 Definitions 8 1.3 Tasks of Interest 10 1.3.1 Generation Quality 10 1.3.2 Controllability 12 1.4 Approaches 13 1.4.1 Modeling Musical Hierarchy 14 1.4.2 Regularizing Latent Representations 16 1.4.3 Target Tasks 18 1.5 Outline of the Thesis 19 Chapter 2 Background 22 2.1 Music Generation Tasks 23 2.1.1 Melody Harmonization 23 2.1.2 Expressive Performance Rendering 25 2.2 Structure-enhanced Music Generation 27 2.2.1 Hierarchical Music Generation 27 2.2.2 Transformer-based Music Generation 28 2.3 Disentanglement Learning 29 2.3.1 Unsupervised Approaches 30 2.3.2 Supervised Approaches 30 2.3.3 Self-supervised Approaches 31 2.4 Controllable Music Generation 32 2.4.1 Score Generation 32 2.4.2 Performance Rendering 33 2.5 Summary 34 Chapter 3 Translating Melody to Chord: Structured and Flexible Harmonization of Melody with Transformer 36 3.1 Introduction 36 3.2 Proposed Methods 41 3.2.1 Standard Transformer Model (STHarm) 41 3.2.2 Variational Transformer Model (VTHarm) 44 3.2.3 Regularized Variational Transformer Model (rVTHarm) 46 3.2.4 Training Objectives 47 3.3 Experimental Settings 48 3.3.1 Datasets 49 3.3.2 Comparative Methods 50 3.3.3 Training 50 3.3.4 Metrics 51 3.4 Evaluation 56 3.4.1 Chord Coherence and Diversity 57 3.4.2 Harmonic Similarity to Human 59 3.4.3 Controlling Chord Complexity 60 3.4.4 Subjective Evaluation 62 3.4.5 Qualitative Results 67 3.4.6 Ablation Study 73 3.5 Conclusion and Future Work 74 Chapter 4 Sketching the Expression: Flexible Rendering of Expressive Piano Performance with Self-supervised Learning 76 4.1 Introduction 76 4.2 Proposed Methods 79 4.2.1 Data Representation 79 4.2.2 Modeling Musical Hierarchy 80 4.2.3 Overall Network Architecture 81 4.2.4 Regularizing the Latent Variables 84 4.2.5 Overall Objective 86 4.3 Experimental Settings 87 4.3.1 Dataset and Implementation 87 4.3.2 Comparative Methods 88 4.4 Evaluation 88 4.4.1 Generation Quality 89 4.4.2 Disentangling Latent Representations 90 4.4.3 Controllability of Expressive Attributes 91 4.4.4 KL Divergence 93 4.4.5 Ablation Study 94 4.4.6 Subjective Evaluation 95 4.4.7 Qualitative Examples 97 4.4.8 Extent of Control 100 4.5 Conclusion 102 Chapter 5 Conclusion and Future Work 103 5.1 Conclusion 103 5.2 Future Work 106 5.2.1 Deeper Investigation of Controllable Factors 106 5.2.2 More Analysis of Qualitative Evaluation Results 107 5.2.3 Improving Diversity and Scale of Dataset 108 Bibliography 109 초 록 137박

    특성 조절이 가능한 심층 신경망 기반의 구조적 멜로디 생성

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 공과대학 산업공학과, 2021.8. 박종헌.This thesis aims to generate structural melodies using attribute controllable deep neural networks. The development of music-composing artificial intelligence can inspire professional composers and reduce the difficulty of creating and provide the public with the combination and utilization of music and various media content. For a melody generation model to function as a composer, it must control specific desired characteristics. The characteristics include quantifiable attributes, such as pitch level and rhythm density, and chords, which are essential elements that comprise modern popular (pop) music along with melodies. First, this thesis introduces a melody generation model that separately produces rhythm and pitch conditioned on chord progressions. The quantitative evaluation results demonstrate that the melodies produced by the proposed model have a distribution more similar to the dataset than other baseline models. Qualitative analysis reveals the presence of repetition and variation within the generated melodies. Using a subjective human listening test, we conclude that the model successfully produced new melodies that sound pleasant in rhythm and pitch. Four quantifiable attributes are considered: pitch level, pitch variety, rhythm density, and rhythm variety. We improve the previous study of training a variational autoencoder (VAE) and a discriminator in an adversarial manner to eliminate attribute information from the encoded latent variable. Rhythm and pitch VAEs are separately trained to control pitch-and rhythm-related attributes entirely independently. The experimental results indicate that though the ratio of the outputs belonging to the intended bin is not high, the model learned the relative order between the bins. Finally, a hierarchical song structure generation model is proposed. A sequence-to-sequence framework is adopted to capture the similar mood between two parts of the same song. The time axis is compressed by applying attention with different lengths of query and key to model the hierarchy of music. The concept of musical contrast is implemented by controlling attributes with relative bin information. The human evaluation results suggest the possibility of solving the problem of generating different structures of the same song with the sequence-to-sequence framework and reveal that the proposed model can create song structures with musical contrasts.본 논문은 특성 조절이 가능한 심층 신경망을 활용하여 구조적 멜로디를 생성하는 기법을 연구한다. 작곡을 돕는 인공지능의 개발은 전문 작곡가에게는 작곡의 영감을 주어 창작의 고통을 덜 수 있고, 일반 대중에게는 각종 미디어 콘텐츠의 종류와 양이 증가하는 추세에서 필요로 하는 음악을 제공해줌으로 인해 다른 미디어 매체와의 결합 및 활용을 증대할 수 있다. 작곡 인공지능의 수준이 인간 작곡가의 수준에 다다르기 위해서는 의도에 따른 특성 조절 작곡이 가능해야 한다. 여기서 말하는 특성이란 음의 높이나 리듬의 밀도와 같이 수치화 가능한 특성 뿐만 아니라, 멜로디와 함게 음악의 기본 구성 요소라고 볼 수 있는 코드 또한 포함한다. 기존에도 특성 조절이 가능한 음악 생성 모델이 제안되었으나 작곡가가 곡 전체의 구성을 염두에 두고 각 부분을 작곡하듯 긴 범위의 구조적 특징 및 음악적 대조가 고려된 특성 조절에 관한 연구는 많지 않다. 본 논문에서는 먼저 코드 조건부 멜로디 생성에 있어 리듬과 음높이를 각각 따로 생성하는 모델과 그 학습 방법을 제안한다. 정량적 평가의 결과는 제안한 기법이 다른 비교 모델들에 비해 그 생성 결과가 데이터셋과 더 유사한 분포를 나타내고 있음을 보여준다. 정성적 평가 결과 생성된 음악에서 적당한 반복과 변형이 확인되며, 사람이 듣기에 음정과 박자 모두 듣기 좋은 새로운 멜로디를 생성할 수 있다는 결론을 도출한다. 수치화 가능한 특성으로는 음의 높이, 음높이 변화, 리듬의 밀도, 리듬의 복잡도 네 가지 특성을 정의한다. 특성 조절이 가능한 변이형 오토인코더를 학습하기 잠재 변수로부터 특성 정보를 제외하는 판별기를 적대적으로 학습하는 기존 연구를 발전시켜, 음높이와 리듬 관련 특성을 완전히 독립적으로 조절할 수 있도록 두 개의 모델을 분리하여 학습한다. 각 구간마다 동일한 양의 데이터를 포함하도록 특성 값에 따라 구간을 나눈 후 학습한 결과, 생성 결과가 의도한 구간에 정확히 포함되는 비율은 높지 않지만 상관계수는 높게 나타난다. 마지막으로 앞의 두 연구의 결과를 활용하여, 음악적으로 비슷하면서도 서로 대조를 이루는 곡 구조 생성 기법을 제안한다. 시퀀스-투-시퀀스 문제 상황에서 좋은 성능을 보이는 트랜스포머 모델을 베이스라인으로 삼아 어텐션 매커니즘을 적용한다. 음악의 계층적 구조를 반영하기 위해 계층적 어텐션을 적용하며, 이 때 상대적 위치 임베딩을 효율적으로 계산하는 방법을 제시한다. 음악적 대조를 구현하기 위해 앞서 정의한 네 가지 특성 정보를 조절하도록 적대적 학습을 진행하고, 이 때 특성 정보는 정확한 구간 정보가 아닌 상대적 구간 비교 정보를 사용한다. 청취 실험 결과 같은 곡의 다른 구조를 생성하는 문제를 시퀀스-투-시퀀스 방법으로 해결할 수 있는 가능성을 제시하고, 제안된 기법을 통해 음악적 대조가 나타나는 곡 구조 생성이 가능하다는 점을 보여준다.Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Objectives 4 1.3 Thesis Outline 6 Chapter 2 Literature Review 7 2.1 Chord-conditioned Melody Generation 7 2.2 Attention Mechanism and Transformer 10 2.2.1 Attention Mechanism 10 2.2.2 Transformer 10 2.2.3 Relative Positional Embedding 12 2.2.4 Funnel-Transformer 14 2.3 Attribute Controllable Music Generation 16 Chapter 3 Problem Definition 17 3.1 Data Representation 17 3.1.1 Datasets 18 3.1.2 Preprocessing 19 3.2 Notation and Formulas 21 3.2.1 Chord-conditioned Melody Generation 21 3.2.2 Attribute Controllable Melody Generation 22 3.2.3 Song Structure Generation 22 3.2.4 Notation 22 Chapter 4 Chord-conditioned Melody Generation 24 4.1 Methodology 24 4.1.1 Model Architecture 24 4.1.2 Relative Positional Embedding 27 4.2 Training and Generation 29 4.2.1 Two-phase Training 30 4.2.2 Pitch-varied Rhythm Data 30 4.2.3 Generating Melodies 31 4.3 Experiments 32 4.3.1 Experiment Settings 32 4.3.2 Baseline Models 33 4.4 Evaluation Results 34 4.4.1 Quantitative Evaluation 34 4.4.2 Qualitative Evaluation 42 Chapter 5 Attribute Controllable Melody Generation 48 5.1 Attribute Definition 48 5.1.1 Pitch-Related Attributes 48 5.1.2 Rhythm-Related Attributes 49 5.2 Model Architecture 51 5.3 Experiments 54 5.3.1 Data Preprocessing 54 5.3.2 Training 56 5.4 Results 58 5.4.1 Quantitative Results 58 5.4.2 Output Examples 60 Chapter 6 Hierarchical Song Structure Generation 68 6.1 Baseline 69 6.2 Proposed Model 70 6.2.1 Relative Hierarchical Attention 70 6.2.2 Model Architecture 78 6.3 Experiments 84 6.3.1 Training and Generation 84 6.3.2 Human Evaluation 85 6.4 Evaluation Results 86 6.4.1 Control Success Ratio 86 6.4.2 Human Perception Ratio 86 6.4.3 Generated Samples 88 Chapter 7 Conclusion 104 7.1 Summary and Contributions 104 7.2 Limitations and Future Research 107 Appendices 108 Chapter A MGEval Results Between the Music of Different Genres 109 Chapter B MGEval Results of CMT and Baseline Models 116 Chapter C Samples Generated by CMT 126 Bibliography 129 국문초록 144박

    Controllable music performance synthesis via hierarchical modelling

    Full text link
    L’expression musicale requiert le contrôle sur quelles notes sont jouées ainsi que comment elles se jouent. Les synthétiseurs audios conventionnels offrent des contrôles expressifs détaillés, cependant au détriment du réalisme. La synthèse neuronale en boîte noire des audios et les échantillonneurs concaténatifs sont capables de produire un son réaliste, pourtant, nous avons peu de mécanismes de contrôle. Dans ce travail, nous introduisons MIDI-DDSP, un modèle hiérarchique des instruments musicaux qui permet tant la synthèse neuronale réaliste des audios que le contrôle sophistiqué de la part des utilisateurs. À partir des paramètres interprétables de synthèse provenant du traitement différentiable des signaux numériques (Differentiable Digital Signal Processing, DDSP), nous inférons les notes musicales et la propriété de haut niveau de leur performance expressive (telles que le timbre, le vibrato, l’intensité et l’articulation). Ceci donne naissance à une hiérarchie de trois niveaux (notes, performance, synthèse) qui laisse aux individus la possibilité d’intervenir à chaque niveau, ou d’utiliser la distribution préalable entraînée (notes étant donné performance, synthèse étant donné performance) pour une assistance créative. À l’aide des expériences quantitatives et des tests d’écoute, nous démontrons que cette hiérarchie permet de reconstruire des audios de haute fidélité, de prédire avec précision les attributs de performance d’une séquence de notes, mais aussi de manipuler indépendamment les attributs étant donné la performance. Comme il s’agit d’un système complet, la hiérarchie peut aussi générer des audios réalistes à partir d’une nouvelle séquence de notes. En utilisant une hiérarchie interprétable avec de multiples niveaux de granularité, MIDI-DDSP ouvre la porte aux outils auxiliaires qui renforce la capacité des individus à travers une grande variété d’expérience musicale.Musical expression requires control of both what notes are played, and how they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control. Starting from interpretable Differentiable Digital Signal Processing (DDSP) synthesis parameters, we infer musical notes and high-level properties of their expressive performance (such as timbre, vibrato, dynamics, and articulation). This creates a 3-level hierarchy (notes, performance, synthesis) that affords individuals the option to intervene at each level, or utilize trained priors (performance given notes, synthesis given performance) for creative assistance. Through quantitative experiments and listening tests, we demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence. By utilizing an interpretable hierarchy, with multiple levels of granularity, MIDI-DDSP opens the door to assistive tools to empower individuals across a diverse range of musical experience

    Joint Chord and Key Estimation Based on a Hierarchical Variational Autoencoder with Multi-task Learning

    Get PDF
    This paper describes a deep generative approach to joint chord and key estimation for music signals. The limited amount of music signals with complete annotations has been the major bottleneck in supervised multi-task learning of a classification model. To overcome this limitation, we integrate the supervised multi-task learning approach with the unsupervised autoencoding approach in a mutually complementary manner. Considering the typical process of music composition, we formulate a hierarchical latent variable model that sequentially generates keys, chords, and chroma vectors. The keys and chords are assumed to follow a language model that represents their relationships and dynamics. In the framework of amortized variational inference (AVI), we introduce a classification model that jointly infers discrete chord and key labels and a recognition model that infers continuous latent features. These models are combined to form a variational autoencoder (VAE) and are trained jointly in a (semi-)supervised manner, where the generative and language models act as regularizers for the classification model. We comprehensively investigate three different architectures for the chord and key classification model, and three different architectures for the language model. Experimental results demonstrate that the VAE-based multi-task learning improves chord estimation as well as key estimation
    corecore