179 research outputs found

    심측 신경망 기반의 μŒμ•… λ¦¬λ“œ μ‹œνŠΈ μžλ™ 채보 및 λ©œλ‘œλ”” μœ μ‚¬λ„ 평가

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 산업곡학과, 2023. 2. 이경식.Since the composition, arrangement, and distribution of music became convenient thanks to the digitization of the music industry, the number of newly supplied music recordings is increasing. Recently, due to platform environments being established whereby anyone can become a creator, user-created music such as their songs, cover songs, and remixes is being distributed through YouTube and TikTok. With such a large volume of musical recordings, the demand to transcribe music into sheet music has always existed for musicians. However, it requires musical knowledge and is time-consuming. This thesis studies automatic lead sheet transcription using deep neural networks. The development of transcription artificial intelligence (AI) can greatly reduce the time and cost for people in the music industry to find or transcribe sheet music. In addition, since the conversion from music sources to the form of digital music is possible, the applications could be expanded, such as music plagiarism detection and music composition AI. The thesis first proposes a model recognizing chords from audio signals. Chord recognition is an important task in music information retrieval since chords are highly abstract and descriptive features of music. We utilize a self-attention mechanism for chord recognition to focus on certain regions of chords. Through an attention map analysis, we visualize how attention is performed. It turns out that the model is able to divide segments of chords by utilizing the adaptive receptive field of the attention mechanism. This thesis proposes a note-level singing melody transcription model using sequence-to-sequence transformers. Overlapping decoding is introduced to solve the problem of the context between segments being broken. Applying pitch augmentation and adding a noisy dataset with data cleansing turns out to be effective in preventing overfitting and generalizing the model performance. Ablation studies demonstrate the effects of the proposed techniques in note-level singing melody transcription, both quantitatively and qualitatively. The proposed model outperforms other models in note-level singing melody transcription performance for all the metrics considered. Finally, subjective human evaluation demonstrates that the results of the proposed models are perceived as more accurate than the results of a previous study. Utilizing the above research results, we introduce the entire process of an automatic music lead sheet transcription. By combining various music information recognized from audio signals, we show that it is possible to transcribe lead sheets that express the core of popular music. Furthermore, we compare the results with lead sheets transcribed by musicians. Finally, we propose a melody similarity assessment method based on self-supervised learning by applying the automatic lead sheet transcription. We present convolutional neural networks that express the melody of lead sheet transcription results in embedding space. To apply self-supervised learning, we introduce methods of generating training data by musical data augmentation techniques. Furthermore, a loss function is presented to utilize the training data. Experimental results demonstrate that the proposed model is able to detect similar melodies of popular music from plagiarism and cover song cases.μŒμ•… μ‚°μ—…μ˜ 디지털화λ₯Ό 톡해 μŒμ•…μ˜ μž‘κ³‘, 편곑 및 μœ ν†΅μ΄ νŽΈλ¦¬ν•΄μ‘ŒκΈ° λ•Œλ¬Έμ— μƒˆλ‘­κ²Œ κ³΅κΈ‰λ˜λŠ” μŒμ›μ˜ μˆ˜κ°€ μ¦κ°€ν•˜κ³  μžˆλ‹€. μ΅œκ·Όμ—λŠ” λˆ„κ΅¬λ‚˜ 크리에이터가 될 수 μžˆλŠ” ν”Œλž«νΌ ν™˜κ²½μ΄ κ΅¬μΆ•λ˜μ–΄, μ‚¬μš©μžκ°€ λ§Œλ“  μžμž‘κ³‘, 컀버곑, 리믹슀 등이 유튜브, 틱톑을 톡해 μœ ν†΅λ˜κ³  μžˆλ‹€. μ΄λ ‡κ²Œ λ§Žμ€ μ–‘μ˜ μŒμ•…μ— λŒ€ν•΄, μŒμ•…μ„ μ•…λ³΄λ‘œ μ±„λ³΄ν•˜κ³ μž ν•˜λŠ” μˆ˜μš”λŠ” μŒμ•…κ°€λ“€μ—κ²Œ 항상 μ‘΄μž¬ν–ˆλ‹€. κ·ΈλŸ¬λ‚˜ 악보 μ±„λ³΄μ—λŠ” μŒμ•…μ  지식이 ν•„μš”ν•˜κ³ , μ‹œκ°„κ³Ό λΉ„μš©μ΄ 많이 μ†Œμš”λœλ‹€λŠ” 문제점이 μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 심측 신경망을 ν™œμš©ν•˜μ—¬ μŒμ•… λ¦¬λ“œ μ‹œνŠΈ 악보 μžλ™ 채보 기법을 μ—°κ΅¬ν•œλ‹€. 채보 인곡지λŠ₯의 κ°œλ°œμ€ μŒμ•… μ’…μ‚¬μž 및 μ—°μ£Όμžλ“€μ΄ 악보λ₯Ό κ΅¬ν•˜κ±°λ‚˜ λ§Œλ“€κΈ° μœ„ν•΄ μ†Œλͺ¨ν•˜λŠ” μ‹œκ°„κ³Ό λΉ„μš©μ„ 크게 쀄여 쀄 수 μžˆλ‹€. λ˜ν•œ μŒμ›μ—μ„œ 디지털 악보 ν˜•νƒœλ‘œ λ³€ν™˜μ΄ κ°€λŠ₯ν•΄μ§€λ―€λ‘œ, μžλ™ ν‘œμ ˆ 탐지, μž‘κ³‘ 인곡지λŠ₯ ν•™μŠ΅ λ“± λ‹€μ–‘ν•˜κ²Œ ν™œμš©μ΄ κ°€λŠ₯ν•˜λ‹€. λ¦¬λ“œ μ‹œνŠΈ 채보λ₯Ό μœ„ν•΄, λ¨Όμ € μ˜€λ””μ˜€ μ‹ ν˜Έλ‘œλΆ€ν„° μ½”λ“œλ₯Ό μΈμ‹ν•˜λŠ” λͺ¨λΈμ„ μ œμ•ˆν•œλ‹€. μŒμ•…μ—μ„œ μ½”λ“œλŠ” 함좕적이고 ν‘œν˜„μ μΈ μŒμ•…μ˜ μ€‘μš”ν•œ νŠΉμ§•μ΄λ―€λ‘œ 이λ₯Ό μΈμ‹ν•˜λŠ” 것은 맀우 μ€‘μš”ν•˜λ‹€. μ½”λ“œ ꡬ간 인식을 μœ„ν•΄, μ–΄ν…μ…˜ λ§€μ»€λ‹ˆμ¦˜μ„ μ΄μš©ν•˜λŠ” 트랜슀포머 기반 λͺ¨λΈμ„ μ œμ‹œν•œλ‹€. μ–΄ν…μ…˜ 지도 뢄석을 톡해, μ–΄ν…μ…˜μ΄ μ‹€μ œλ‘œ μ–΄λ–»κ²Œ μ μš©λ˜λŠ”μ§€ μ‹œκ°ν™”ν•˜κ³ , λͺ¨λΈμ΄ μ½”λ“œμ˜ ꡬ간을 λ‚˜λˆ„κ³  μΈμ‹ν•˜λŠ” 과정을 μ‚΄νŽ΄λ³Έλ‹€. 그리고 μ‹œν€€μŠ€ 투 μ‹œν€€μŠ€ 트랜슀포머λ₯Ό μ΄μš©ν•œ μŒν‘œ μˆ˜μ€€μ˜ κ°€μ°½ λ©œλ‘œλ”” 채보 λͺ¨λΈμ„ μ œμ•ˆν•œλ‹€. λ””μ½”λ”© κ³Όμ •μ—μ„œ 각 ꡬ간 μ‚¬μ΄μ˜ λ¬Έλ§₯ 정보가 λ‹¨μ ˆλ˜λŠ” 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 쀑첩 디코딩을 λ„μž…ν•œλ‹€. 데이터 λ³€ν˜• κΈ°λ²•μœΌλ‘œ μŒλ†’μ΄ λ³€ν˜•μ„ μ μš©ν•˜λŠ” 방법과 데이터 ν΄λ Œμ§•μ„ 톡해 ν•™μŠ΅ 데이터λ₯Ό μΆ”κ°€ν•˜λŠ” 방법을 μ†Œκ°œν•œλ‹€. μ •λŸ‰ 및 정성적인 비ꡐλ₯Ό 톡해 μ œμ•ˆν•œ 기법듀이 μ„±λŠ₯ κ°œμ„ μ— 도움이 λ˜λŠ” 것을 ν™•μΈν•˜μ˜€κ³ , μ œμ•ˆλͺ¨λΈμ΄ MIR-ST500 데이터 셋에 λŒ€ν•œ μŒν‘œ μˆ˜μ€€μ˜ κ°€μ°½ λ©œλ‘œλ”” 채보 μ„±λŠ₯μ—μ„œ κ°€μž₯ μš°μˆ˜ν•œ μ„±λŠ₯을 λ³΄μ˜€λ‹€. μΆ”κ°€λ‘œ 주관적인 μ‚¬λžŒμ˜ ν‰κ°€μ—μ„œ μ œμ•ˆ λͺ¨λΈμ˜ 채보 κ²°κ³Όκ°€ 이전 λͺ¨λΈλ³΄λ‹€ μ € μ •ν™•ν•˜λ‹€κ³  인식됨을 ν™•μΈν•˜μ˜€λ‹€. μ•žμ˜ μ—°κ΅¬μ˜ κ²°κ³Όλ₯Ό ν™œμš©ν•˜μ—¬, μŒμ•… λ¦¬λ“œ μ‹œνŠΈ μžλ™ μ±„λ³΄μ˜ 전체 과정을 μ œμ‹œν•œλ‹€. μ˜€λ””μ˜€ μ‹ ν˜Έλ‘œλΆ€ν„° μΈμ‹ν•œ λ‹€μ–‘ν•œ μŒμ•… 정보λ₯Ό μ’…ν•©ν•˜μ—¬, λŒ€μ€‘ μŒμ•… μ˜€λ””μ˜€ μ‹ ν˜Έμ˜ 핡심을 ν‘œν˜„ν•˜λŠ” λ¦¬λ“œ μ‹œνŠΈ 악보 채보가 κ°€λŠ₯함을 보인닀. 그리고 이λ₯Ό μ „λ¬Έκ°€κ°€ μ œμž‘ν•œ λ¦¬λ“œμ‹œνŠΈμ™€ λΉ„κ΅ν•˜μ—¬ λΆ„μ„ν•œλ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ λ¦¬λ“œ μ‹œνŠΈ 악보 μžλ™ 채보 기법을 μ‘μš©ν•˜μ—¬, 자기 지도 ν•™μŠ΅ 기반 λ©œλ‘œλ”” μœ μ‚¬λ„ 평가 방법을 μ œμ•ˆν•œλ‹€. λ¦¬λ“œ μ‹œνŠΈ 채보 결과의 λ©œλ‘œλ””λ₯Ό μž„λ² λ”© 곡간에 ν‘œν˜„ν•˜λŠ” ν•©μ„±κ³± 신경망 λͺ¨λΈμ„ μ œμ‹œν•œλ‹€. μžκΈ°μ§€λ„ ν•™μŠ΅ 방법둠을 μ μš©ν•˜κΈ° μœ„ν•΄, μŒμ•…μ  데이터 λ³€ν˜• 기법을 μ μš©ν•˜μ—¬ ν•™μŠ΅ 데이터λ₯Ό μƒμ„±ν•˜λŠ” 방법을 μ œμ•ˆν•œλ‹€. 그리고 μ€€λΉ„λœ ν•™μŠ΅ 데이터λ₯Ό ν™œμš©ν•˜λŠ” 심측 거리 ν•™μŠ΅ μ†μ‹€ν•¨μˆ˜λ₯Ό μ„€κ³„ν•œλ‹€. μ‹€ν—˜ κ²°κ³Ό 뢄석을 톡해, μ œμ•ˆ λͺ¨λΈμ΄ ν‘œμ ˆ 및 컀버솑 μΌ€μ΄μŠ€μ—μ„œ λŒ€μ€‘μŒμ•…μ˜ μœ μ‚¬ν•œ λ©œλ‘œλ””λ₯Ό 탐지할 수 μžˆμŒμ„ ν™•μΈν•œλ‹€.Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Objectives 4 1.3 Thesis Outline 6 Chapter 2 Literature Review 7 2.1 Attention Mechanism and Transformers 7 2.1.1 Attention-based Models 7 2.1.2 Transformers with Musical Event Sequence 8 2.2 Chord Recognition 11 2.3 Note-level Singing Melody Transcription 13 2.4 Musical Key Estimation 15 2.5 Beat Tracking 17 2.6 Music Plagiarism Detection and Cover Song Identi cation 19 2.7 Deep Metric Learning and Triplet Loss 21 Chapter 3 Problem De nition 23 3.1 Lead Sheet Transcription 23 3.1.1 Chord Recognition 24 3.1.2 Singing Melody Transcription 25 3.1.3 Post-processing for Lead Sheet Representation 26 3.2 Melody Similarity Assessment 28 Chapter 4 A Bi-directional Transformer for Musical Chord Recognition 29 4.1 Methodology 29 4.1.1 Model Architecture 29 4.1.2 Self-attention in Chord Recognition 33 4.2 Experiments 35 4.2.1 Datasets 35 4.2.2 Preprocessing 35 4.2.3 Evaluation Metrics 36 4.2.4 Training 37 4.3 Results 38 4.3.1 Quantitative Evaluation 38 4.3.2 Attention Map Analysis 41 Chapter 5 Note-level Singing Melody Transcription 44 5.1 Methodology 44 5.1.1 Monophonic Note Event Sequence 44 5.1.2 Audio Features 45 5.1.3 Model Architecture 46 5.1.4 Autoregressive Decoding and Monophonic Masking 47 5.1.5 Overlapping Decoding 47 5.1.6 Pitch Augmentation 49 5.1.7 Adding Noisy Dataset with Data Cleansing 50 5.2 Experiments 51 5.2.1 Dataset 51 5.2.2 Experiment Con gurations 52 5.2.3 Evaluation Metrics 53 5.2.4 Comparison Models 54 5.2.5 Human Evaluation 55 5.3 Results 56 5.3.1 Ablation Study 56 5.3.2 Note-level Transcription Model Comparison 59 5.3.3 Transcription Performance Distribution Analysis 59 5.3.4 Fundamental Frequency (F0) Metric Evaluation 60 5.4 Qualitative Analysis 62 5.4.1 Visualization of Ablation Study 62 5.4.2 Spectrogram Analysis 65 5.4.3 Human Evaluation 67 Chapter 6 Automatic Music Lead Sheet Transcription 68 6.1 Post-processing for Lead Sheet Representation 68 6.2 Lead Sheet Transcription Results 71 Chapter 7 Melody Similarity Assessment with Self-supervised Convolutional Neural Networks 77 7.1 Methodology 77 7.1.1 Input Data Representation 77 7.1.2 Data Augmentation 78 7.1.3 Model Architecture 82 7.1.4 Loss Function 84 7.1.5 De nition of Distance between Songs 85 7.2 Experiments 87 7.2.1 Dataset 87 7.2.2 Training 88 7.2.3 Evaluation Metrics 88 7.3 Results 89 7.3.1 Quantitative Evaluation 89 7.3.2 Qualitative Evaluation 99 Chapter 8 Conclusion 107 8.1 Summary and Contributions 107 8.2 Limitations and Future Research 110 Bibliography 111 ꡭ문초둝 126λ°•

    Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

    Full text link
    Numerous studies in the field of music generation have demonstrated impressive performance, yet virtually no models are able to directly generate music to match accompanying videos. In this work, we develop a generative music AI framework, Video2Music, that can match a provided video. We first curated a unique collection of music videos. Then, we analysed the music videos to obtain semantic, scene offset, motion, and emotion features. These distinct features are then employed as guiding input to our music generation model. We transcribe the audio files into MIDI and chords, and extract features such as note density and loudness. This results in a rich multimodal dataset, called MuVi-Sync, on which we train a novel Affective Multimodal Transformer (AMT) model to generate music given a video. This model includes a novel mechanism to enforce affective similarity between video and music. Finally, post-processing is performed based on a biGRU-based regression model to estimate note density and loudness based on the video features. This ensures a dynamic rendering of the generated chords with varying rhythm and volume. In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion. The musical quality, along with the quality of music-video matching is confirmed in a user study. The proposed AMT model, along with the new MuVi-Sync dataset, presents a promising step for the new task of music generation for videos

    Generating Chord Progression from Melody with Flexible Harmonic Rhythm and Controllable Harmonic Density

    Full text link
    Melody harmonization, which involves generating a chord progression that complements a user-provided melody, continues to pose a significant challenge. A chord progression must not only be in harmony with the melody, but also interdependent on its rhythmic pattern. While previous neural network-based systems have been successful in producing chord progressions for given melodies, they have not adequately addressed controllable melody harmonization, nor have they focused on generating harmonic rhythms with flexibility in the rates or patterns of chord changes. This paper presents AutoHarmonizer, a novel system for harmonic density-controllable melody harmonization with such a flexible harmonic rhythm. AutoHarmonizer is equipped with an extensive vocabulary of 1,462 chord types and can generate chord progressions that vary in harmonic density for a given melody. Experimental results indicate that the AutoHarmonizer-generated chord progressions exhibit a diverse range of harmonic rhythms and that the system's controllable harmonic density is effective.Comment: 12 pages, 6 figures, 1 table, accepted by EURASIP JASM

    Transformers in Machine Learning: Literature Review

    Get PDF
    In this study, the researcher presents an approach regarding methods in Transformer Machine Learning. Initially, transformers are neural network architectures that are considered as inputs. Transformers are widely used in various studies with various objects. The transformer is one of the deep learning architectures that can be modified. Transformers are also mechanisms that study contextual relationships between words. Transformers are used for text compression in readings. Transformers are used to recognize chemical images with an accuracy rate of 96%. Transformers are used to detect a person's emotions. Transformer to detect emotions in social media conversations, for example, on Facebook with happy, sad, and angry categories. Figure 1 illustrates the encoder and decoder process through the input process and produces output. the purpose of this study is to only review literature from various journals that discuss transformers. This explanation is also done by presenting the subject or dataset, data analysis method, year, and accuracy achieved. By using the methods presented, researchers can conclude results in search of the highest accuracy and opportunities for further research

    A Review of Intelligent Music Generation Systems

    Full text link
    With the introduction of ChatGPT, the public's perception of AI-generated content (AIGC) has begun to reshape. Artificial intelligence has significantly reduced the barrier to entry for non-professionals in creative endeavors, enhancing the efficiency of content creation. Recent advancements have seen significant improvements in the quality of symbolic music generation, which is enabled by the use of modern generative algorithms to extract patterns implicit in a piece of music based on rule constraints or a musical corpus. Nevertheless, existing literature reviews tend to present a conventional and conservative perspective on future development trajectories, with a notable absence of thorough benchmarking of generative models. This paper provides a survey and analysis of recent intelligent music generation techniques, outlining their respective characteristics and discussing existing methods for evaluation. Additionally, the paper compares the different characteristics of music generation techniques in the East and West as well as analysing the field's development prospects

    MARBLE: Music Audio Representation Benchmark for Universal Evaluation

    Get PDF
    In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement. The leaderboard and toolkit repository are published at this https URL to promote future music AI research

    μŒμ•…μ  μš”μ†Œμ— λŒ€ν•œ 쑰건뢀 μƒμ„±μ˜ κ°œμ„ μ— κ΄€ν•œ 연ꡬ: ν™”μŒκ³Ό ν‘œν˜„μ„ μ€‘μ‹¬μœΌλ‘œ

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : μœ΅ν•©κ³Όν•™κΈ°μˆ λŒ€ν•™μ› μœ΅ν•©κ³Όν•™λΆ€(λ””μ§€ν„Έμ •λ³΄μœ΅ν•©μ „κ³΅), 2023. 2. 이ꡐꡬ.Conditional generation of musical components (CGMC) creates a part of music based on partial musical components such as melody or chord. CGMC is beneficial for discovering complex relationships among musical attributes. It can also assist non-experts who face difficulties in making music. However, recent studies for CGMC are still facing two challenges in terms of generation quality and model controllability. First, the structure of the generated music is not robust. Second, only limited ranges of musical factors and tasks have been examined as targets for flexible control of generation. In this thesis, we aim to mitigate these two challenges to improve the CGMC systems. For musical structure, we focus on intuitive modeling of musical hierarchy to help the model explicitly learn musically meaningful dependency. To this end, we utilize alignment paths between the raw music data and the musical units such as notes or chords. For musical creativity, we facilitate smooth control of novel musical attributes using latent representations. We attempt to achieve disentangled representations of the intended factors by regularizing them with data-driven inductive bias. This thesis verifies the proposed approaches particularly in two representative CGMC tasks, melody harmonization and expressive performance rendering. A variety of experimental results show the possibility of the proposed approaches to expand musical creativity under stable generation quality.μŒμ•…μ  μš”μ†Œλ₯Ό 쑰건뢀 μƒμ„±ν•˜λŠ” 뢄야인 CGMCλŠ” λ©œλ‘œλ””λ‚˜ ν™”μŒκ³Ό 같은 μŒμ•…μ˜ 일뢀뢄을 기반으둜 λ‚˜λ¨Έμ§€ 뢀뢄을 μƒμ„±ν•˜λŠ” 것을 λͺ©ν‘œλ‘œ ν•œλ‹€. 이 λΆ„μ•ΌλŠ” μŒμ•…μ  μš”μ†Œ κ°„ λ³΅μž‘ν•œ 관계λ₯Ό νƒκ΅¬ν•˜λŠ” 데 μš©μ΄ν•˜κ³ , μŒμ•…μ„ λ§Œλ“œλŠ” 데 어렀움을 κ²ͺλŠ” 비전문가듀을 λ„μšΈ 수 μžˆλ‹€. 졜근 연ꡬ듀은 λ”₯λŸ¬λ‹ κΈ°μˆ μ„ ν™œμš©ν•˜μ—¬ CGMC μ‹œμŠ€ν…œμ˜ μ„±λŠ₯을 λ†’μ—¬μ™”λ‹€. ν•˜μ§€λ§Œ, μ΄λŸ¬ν•œ μ—°κ΅¬λ“€μ—λŠ” 아직 생성 ν’ˆμ§ˆκ³Ό μ œμ–΄κ°€λŠ₯μ„± μΈ‘λ©΄μ—μ„œ 두 κ°€μ§€μ˜ ν•œκ³„μ μ΄ μžˆλ‹€. λ¨Όμ €, μƒμ„±λœ μŒμ•…μ˜ μŒμ•…μ  ꡬ쑰가 λͺ…ν™•ν•˜μ§€ μ•Šλ‹€. λ˜ν•œ, 아직 쒁은 λ²”μœ„μ˜ μŒμ•…μ  μš”μ†Œ 및 ν…ŒμŠ€ν¬λ§Œμ΄ μœ μ—°ν•œ μ œμ–΄μ˜ λŒ€μƒμœΌλ‘œμ„œ νƒκ΅¬λ˜μ—ˆλ‹€. 이에 λ³Έ ν•™μœ„λ…Όλ¬Έμ—μ„œλŠ” CGMC의 κ°œμ„ μ„ μœ„ν•΄ μœ„ 두 κ°€μ§€μ˜ ν•œκ³„μ μ„ ν•΄κ²°ν•˜κ³ μž ν•œλ‹€. 첫 번째둜, μŒμ•… ꡬ쑰λ₯Ό μ΄λ£¨λŠ” μŒμ•…μ  μœ„κ³„λ₯Ό μ§κ΄€μ μœΌλ‘œ λͺ¨λΈλ§ν•˜λŠ” 데 μ§‘μ€‘ν•˜κ³ μž ν•œλ‹€. 본래 데이터와 음, ν™”μŒκ³Ό 같은 μŒμ•…μ  λ‹¨μœ„ κ°„ μ •λ ¬ 경둜λ₯Ό μ‚¬μš©ν•˜μ—¬ λͺ¨λΈμ΄ μŒμ•…μ μœΌλ‘œ μ˜λ―ΈμžˆλŠ” 쒅속성을 λͺ…ν™•ν•˜κ²Œ 배울 수 μžˆλ„λ‘ ν•œλ‹€. 두 번째둜, 잠재 ν‘œμƒμ„ ν™œμš©ν•˜μ—¬ μƒˆλ‘œμš΄ μŒμ•…μ  μš”μ†Œλ“€μ„ μœ μ—°ν•˜κ²Œ μ œμ–΄ν•˜κ³ μž ν•œλ‹€. 특히 잠재 ν‘œμƒμ΄ μ˜λ„λœ μš”μ†Œμ— λŒ€ν•΄ 풀리도둝 ν›ˆλ ¨ν•˜κΈ° μœ„ν•΄μ„œ 비지도 ν˜Ήμ€ μžκ°€μ§€λ„ ν•™μŠ΅ ν”„λ ˆμž„μ›Œν¬μ„ μ‚¬μš©ν•˜μ—¬ 잠재 ν‘œμƒμ„ μ œν•œν•˜λ„λ‘ ν•œλ‹€. λ³Έ ν•™μœ„λ…Όλ¬Έμ—μ„œλŠ” CGMC λΆ„μ•Όμ˜ λŒ€ν‘œμ μΈ 두 ν…ŒμŠ€ν¬μΈ λ©œλ‘œλ”” ν•˜λͺ¨λ‚˜μ΄μ œμ΄μ…˜ 및 ν‘œν˜„μ  μ—°μ£Ό λ Œλ”λ§ ν…ŒμŠ€ν¬μ— λŒ€ν•΄ μœ„μ˜ 두 가지 방법둠을 κ²€μ¦ν•œλ‹€. λ‹€μ–‘ν•œ μ‹€ν—˜μ  결과듀을 톡해 μ œμ•ˆν•œ 방법둠이 CGMC μ‹œμŠ€ν…œμ˜ μŒμ•…μ  μ°½μ˜μ„±μ„ μ•ˆμ •μ μΈ 생성 ν’ˆμ§ˆλ‘œ ν™•μž₯ν•  수 μžˆλ‹€λŠ” κ°€λŠ₯성을 μ‹œμ‚¬ν•œλ‹€.Chapter 1 Introduction 1 1.1 Motivation 5 1.2 Definitions 8 1.3 Tasks of Interest 10 1.3.1 Generation Quality 10 1.3.2 Controllability 12 1.4 Approaches 13 1.4.1 Modeling Musical Hierarchy 14 1.4.2 Regularizing Latent Representations 16 1.4.3 Target Tasks 18 1.5 Outline of the Thesis 19 Chapter 2 Background 22 2.1 Music Generation Tasks 23 2.1.1 Melody Harmonization 23 2.1.2 Expressive Performance Rendering 25 2.2 Structure-enhanced Music Generation 27 2.2.1 Hierarchical Music Generation 27 2.2.2 Transformer-based Music Generation 28 2.3 Disentanglement Learning 29 2.3.1 Unsupervised Approaches 30 2.3.2 Supervised Approaches 30 2.3.3 Self-supervised Approaches 31 2.4 Controllable Music Generation 32 2.4.1 Score Generation 32 2.4.2 Performance Rendering 33 2.5 Summary 34 Chapter 3 Translating Melody to Chord: Structured and Flexible Harmonization of Melody with Transformer 36 3.1 Introduction 36 3.2 Proposed Methods 41 3.2.1 Standard Transformer Model (STHarm) 41 3.2.2 Variational Transformer Model (VTHarm) 44 3.2.3 Regularized Variational Transformer Model (rVTHarm) 46 3.2.4 Training Objectives 47 3.3 Experimental Settings 48 3.3.1 Datasets 49 3.3.2 Comparative Methods 50 3.3.3 Training 50 3.3.4 Metrics 51 3.4 Evaluation 56 3.4.1 Chord Coherence and Diversity 57 3.4.2 Harmonic Similarity to Human 59 3.4.3 Controlling Chord Complexity 60 3.4.4 Subjective Evaluation 62 3.4.5 Qualitative Results 67 3.4.6 Ablation Study 73 3.5 Conclusion and Future Work 74 Chapter 4 Sketching the Expression: Flexible Rendering of Expressive Piano Performance with Self-supervised Learning 76 4.1 Introduction 76 4.2 Proposed Methods 79 4.2.1 Data Representation 79 4.2.2 Modeling Musical Hierarchy 80 4.2.3 Overall Network Architecture 81 4.2.4 Regularizing the Latent Variables 84 4.2.5 Overall Objective 86 4.3 Experimental Settings 87 4.3.1 Dataset and Implementation 87 4.3.2 Comparative Methods 88 4.4 Evaluation 88 4.4.1 Generation Quality 89 4.4.2 Disentangling Latent Representations 90 4.4.3 Controllability of Expressive Attributes 91 4.4.4 KL Divergence 93 4.4.5 Ablation Study 94 4.4.6 Subjective Evaluation 95 4.4.7 Qualitative Examples 97 4.4.8 Extent of Control 100 4.5 Conclusion 102 Chapter 5 Conclusion and Future Work 103 5.1 Conclusion 103 5.2 Future Work 106 5.2.1 Deeper Investigation of Controllable Factors 106 5.2.2 More Analysis of Qualitative Evaluation Results 107 5.2.3 Improving Diversity and Scale of Dataset 108 Bibliography 109 초 둝 137λ°•
    • …
    corecore