Search CORE

42 research outputs found

Deep Learning for Audio Signal Processing

Author: Chang Shuo-yiin
Li Bo
Purwins Hendrik
Sainath Tara
Schlüter Jan
Virtanen Tuomas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2019
Field of study

Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

arXiv.org e-Print Archive

VBN

Deep Learning Techniques for Music Generation -- A Survey

Author: Briot Jean-Pierre
Hadjeres Gaëtan
Pachet François-David
Publication venue
Publication date: 23/03/2019
Field of study

This paper is a survey and an analysis of different ways of using deep learning (deep artificial neural networks) to generate musical content. We propose a methodology based on five dimensions for our analysis: Objective - What musical content is to be generated? Examples are: melody, polyphony, accompaniment or counterpoint. - For what destination and for what use? To be performed by a human(s) (in the case of a musical score), or by a machine (in the case of an audio file). Representation - What are the concepts to be manipulated? Examples are: waveform, spectrogram, note, chord, meter and beat. - What format is to be used? Examples are: MIDI, piano roll or text. - How will the representation be encoded? Examples are: scalar, one-hot or many-hot. Architecture - What type(s) of deep neural network is (are) to be used? Examples are: feedforward network, recurrent network, autoencoder or generative adversarial networks. Challenge - What are the limitations and open challenges? Examples are: variability, interactivity and creativity. Strategy - How do we model and control the process of generation? Examples are: single-step feedforward, iterative feedforward, sampling or input manipulation. For each dimension, we conduct a comparative analysis of various models and techniques and we propose some tentative multidimensional typology. This typology is bottom-up, based on the analysis of many existing deep-learning based systems for music generation selected from the relevant literature. These systems are described and are used to exemplify the various choices of objective, representation, architecture, challenge and strategy. The last section includes some discussion and some prospects.Comment: 209 pages. This paper is a simplified version of the book: J.-P. Briot, G. Hadjeres and F.-D. Pachet, Deep Learning Techniques for Music Generation, Computational Synthesis and Creative Systems, Springer, 201

arXiv.org e-Print Archive

Comparision Of Adversarial And Non-Adversarial LSTM Music Generative Models

Author: Bosman Anna Sergeevna
De Villiers Johan Pieter
Mots'oehli Moseli
Publication venue
Publication date: 01/11/2022
Field of study

Algorithmic music composition is a way of composing musical pieces with minimal to no human intervention. While recurrent neural networks are traditionally applied to many sequence-to-sequence prediction tasks, including successful implementations of music composition, their standard supervised learning approach based on input-to-output mapping leads to a lack of note variety. These models can therefore be seen as potentially unsuitable for tasks such as music generation. Generative adversarial networks learn the generative distribution of data and lead to varied samples. This work implements and compares adversarial and non-adversarial training of recurrent neural network music composers on MIDI data. The resulting music samples are evaluated by human listeners, their preferences recorded. The evaluation indicates that adversarial training produces more aesthetically pleasing music.Comment: Submitted to a 2023 conference, 20 pages, 13 figure

arXiv.org e-Print Archive

특성 조절이 가능한 심층 신경망 기반의 구조적 멜로디 생성

Author: 최교윤
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 산업공학과, 2021.8. 박종헌.This thesis aims to generate structural melodies using attribute controllable deep neural networks. The development of music-composing artificial intelligence can inspire professional composers and reduce the difficulty of creating and provide the public with the combination and utilization of music and various media content. For a melody generation model to function as a composer, it must control specific desired characteristics. The characteristics include quantifiable attributes, such as pitch level and rhythm density, and chords, which are essential elements that comprise modern popular (pop) music along with melodies. First, this thesis introduces a melody generation model that separately produces rhythm and pitch conditioned on chord progressions. The quantitative evaluation results demonstrate that the melodies produced by the proposed model have a distribution more similar to the dataset than other baseline models. Qualitative analysis reveals the presence of repetition and variation within the generated melodies. Using a subjective human listening test, we conclude that the model successfully produced new melodies that sound pleasant in rhythm and pitch. Four quantifiable attributes are considered: pitch level, pitch variety, rhythm density, and rhythm variety. We improve the previous study of training a variational autoencoder (VAE) and a discriminator in an adversarial manner to eliminate attribute information from the encoded latent variable. Rhythm and pitch VAEs are separately trained to control pitch-and rhythm-related attributes entirely independently. The experimental results indicate that though the ratio of the outputs belonging to the intended bin is not high, the model learned the relative order between the bins. Finally, a hierarchical song structure generation model is proposed. A sequence-to-sequence framework is adopted to capture the similar mood between two parts of the same song. The time axis is compressed by applying attention with different lengths of query and key to model the hierarchy of music. The concept of musical contrast is implemented by controlling attributes with relative bin information. The human evaluation results suggest the possibility of solving the problem of generating different structures of the same song with the sequence-to-sequence framework and reveal that the proposed model can create song structures with musical contrasts.본 논문은 특성 조절이 가능한 심층 신경망을 활용하여 구조적 멜로디를 생성하는 기법을 연구한다. 작곡을 돕는 인공지능의 개발은 전문 작곡가에게는 작곡의 영감을 주어 창작의 고통을 덜 수 있고, 일반 대중에게는 각종 미디어 콘텐츠의 종류와 양이 증가하는 추세에서 필요로 하는 음악을 제공해줌으로 인해 다른 미디어 매체와의 결합 및 활용을 증대할 수 있다. 작곡 인공지능의 수준이 인간 작곡가의 수준에 다다르기 위해서는 의도에 따른 특성 조절 작곡이 가능해야 한다. 여기서 말하는 특성이란 음의 높이나 리듬의 밀도와 같이 수치화 가능한 특성 뿐만 아니라, 멜로디와 함게 음악의 기본 구성 요소라고 볼 수 있는 코드 또한 포함한다. 기존에도 특성 조절이 가능한 음악 생성 모델이 제안되었으나 작곡가가 곡 전체의 구성을 염두에 두고 각 부분을 작곡하듯 긴 범위의 구조적 특징 및 음악적 대조가 고려된 특성 조절에 관한 연구는 많지 않다. 본 논문에서는 먼저 코드 조건부 멜로디 생성에 있어 리듬과 음높이를 각각 따로 생성하는 모델과 그 학습 방법을 제안한다. 정량적 평가의 결과는 제안한 기법이 다른 비교 모델들에 비해 그 생성 결과가 데이터셋과 더 유사한 분포를 나타내고 있음을 보여준다. 정성적 평가 결과 생성된 음악에서 적당한 반복과 변형이 확인되며, 사람이 듣기에 음정과 박자 모두 듣기 좋은 새로운 멜로디를 생성할 수 있다는 결론을 도출한다. 수치화 가능한 특성으로는 음의 높이, 음높이 변화, 리듬의 밀도, 리듬의 복잡도 네 가지 특성을 정의한다. 특성 조절이 가능한 변이형 오토인코더를 학습하기 잠재 변수로부터 특성 정보를 제외하는 판별기를 적대적으로 학습하는 기존 연구를 발전시켜, 음높이와 리듬 관련 특성을 완전히 독립적으로 조절할 수 있도록 두 개의 모델을 분리하여 학습한다. 각 구간마다 동일한 양의 데이터를 포함하도록 특성 값에 따라 구간을 나눈 후 학습한 결과, 생성 결과가 의도한 구간에 정확히 포함되는 비율은 높지 않지만 상관계수는 높게 나타난다. 마지막으로 앞의 두 연구의 결과를 활용하여, 음악적으로 비슷하면서도 서로 대조를 이루는 곡 구조 생성 기법을 제안한다. 시퀀스-투-시퀀스 문제 상황에서 좋은 성능을 보이는 트랜스포머 모델을 베이스라인으로 삼아 어텐션 매커니즘을 적용한다. 음악의 계층적 구조를 반영하기 위해 계층적 어텐션을 적용하며, 이 때 상대적 위치 임베딩을 효율적으로 계산하는 방법을 제시한다. 음악적 대조를 구현하기 위해 앞서 정의한 네 가지 특성 정보를 조절하도록 적대적 학습을 진행하고, 이 때 특성 정보는 정확한 구간 정보가 아닌 상대적 구간 비교 정보를 사용한다. 청취 실험 결과 같은 곡의 다른 구조를 생성하는 문제를 시퀀스-투-시퀀스 방법으로 해결할 수 있는 가능성을 제시하고, 제안된 기법을 통해 음악적 대조가 나타나는 곡 구조 생성이 가능하다는 점을 보여준다.Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Objectives 4 1.3 Thesis Outline 6 Chapter 2 Literature Review 7 2.1 Chord-conditioned Melody Generation 7 2.2 Attention Mechanism and Transformer 10 2.2.1 Attention Mechanism 10 2.2.2 Transformer 10 2.2.3 Relative Positional Embedding 12 2.2.4 Funnel-Transformer 14 2.3 Attribute Controllable Music Generation 16 Chapter 3 Problem Definition 17 3.1 Data Representation 17 3.1.1 Datasets 18 3.1.2 Preprocessing 19 3.2 Notation and Formulas 21 3.2.1 Chord-conditioned Melody Generation 21 3.2.2 Attribute Controllable Melody Generation 22 3.2.3 Song Structure Generation 22 3.2.4 Notation 22 Chapter 4 Chord-conditioned Melody Generation 24 4.1 Methodology 24 4.1.1 Model Architecture 24 4.1.2 Relative Positional Embedding 27 4.2 Training and Generation 29 4.2.1 Two-phase Training 30 4.2.2 Pitch-varied Rhythm Data 30 4.2.3 Generating Melodies 31 4.3 Experiments 32 4.3.1 Experiment Settings 32 4.3.2 Baseline Models 33 4.4 Evaluation Results 34 4.4.1 Quantitative Evaluation 34 4.4.2 Qualitative Evaluation 42 Chapter 5 Attribute Controllable Melody Generation 48 5.1 Attribute Definition 48 5.1.1 Pitch-Related Attributes 48 5.1.2 Rhythm-Related Attributes 49 5.2 Model Architecture 51 5.3 Experiments 54 5.3.1 Data Preprocessing 54 5.3.2 Training 56 5.4 Results 58 5.4.1 Quantitative Results 58 5.4.2 Output Examples 60 Chapter 6 Hierarchical Song Structure Generation 68 6.1 Baseline 69 6.2 Proposed Model 70 6.2.1 Relative Hierarchical Attention 70 6.2.2 Model Architecture 78 6.3 Experiments 84 6.3.1 Training and Generation 84 6.3.2 Human Evaluation 85 6.4 Evaluation Results 86 6.4.1 Control Success Ratio 86 6.4.2 Human Perception Ratio 86 6.4.3 Generated Samples 88 Chapter 7 Conclusion 104 7.1 Summary and Contributions 104 7.2 Limitations and Future Research 107 Appendices 108 Chapter A MGEval Results Between the Music of Different Genres 109 Chapter B MGEval Results of CMT and Baseline Models 116 Chapter C Samples Generated by CMT 126 Bibliography 129 국문초록 144박

SNU Open Repository and Archive

Music-STAR: a Style Translation system for Audio-based Rearrangement

Author: Alinoori Mahshid
Publication venue
Publication date: 03/03/2022
Field of study

Music style translation has recently gained attention among music processing studies. It aims to generate variations of existing music pieces by altering the style-variant characteristics of the original music piece, while content such as the melody remains unchanged. These alterations could involve timbre translation, reharmonization, or music rearrangement. In this thesis, we plan to address music rearrangement, focusing on instrumentation, by processing waveforms of two-instrument pieces. Previous studies have achieved promising results utilizing time-frequency and symbolic music representations. Music translation on raw audio has also been investigated using single-instrument pieces. Although processing raw audio is more challenging, it embodies more detailed information about the performance, timbre, and dynamics of a music piece. To this end, we introduce Music-STAR, the first audio-based model that can transform the instruments of a multi-track piece into another set of instruments, resulting in a rearranged piece

YorkSpace

Generative models for music using transformer architectures

Author: OZTURK SILA
Publication venue
Publication date: 19/10/2023
Field of study

openThis thesis focus on growth and impact of Transformes architectures which are mainly used for Natural Language Processing tasks for Audio generation. We think that music, with its notes, chords, and volumes, is a language. You could think of symbolic representation of music as human language. A brief sound synthesis history which gives basic foundation for modern AI-generated music models is mentioned . The most recent in AI-generated audio is carefully studied and instances of AI-generated music is told in many contexts. Deep learning models and their applications to real-world issues are one of the key subjects that are covered. The main areas of interest include transformer-based audio generation, including the training procedure, encoding and decoding techniques, and post-processing stages. Transformers have several key advantages, including long-term consistency and the ability to create minute-long audio compositions. Numerous studies on the various representations of music have been explained, including how neural network and deep learning techniques can be used to apply symbolic melodies, musical arrangements, style transfer, and sound production. This thesis largely focuses on transformation models, but it also recognises the importance of numerous AI-based generative models, including GAN. Overall, this thesis enhances generative models for music composition and provides a complete understanding of transformer design. It shows the possibilities of AI-generated sound synthesis by emphasising the most current developments.This thesis focus on growth and impact of Transformes architectures which are mainly used for Natural Language Processing tasks for Audio generation. We think that music, with its notes, chords, and volumes, is a language. You could think of symbolic representation of music as human language. A brief sound synthesis history which gives basic foundation for modern AI-generated music models is mentioned . The most recent in AI-generated audio is carefully studied and instances of AI-generated music is told in many contexts. Deep learning models and their applications to real-world issues are one of the key subjects that are covered. The main areas of interest include transformer-based audio generation, including the training procedure, encoding and decoding techniques, and post-processing stages. Transformers have several key advantages, including long-term consistency and the ability to create minute-long audio compositions. Numerous studies on the various representations of music have been explained, including how neural network and deep learning techniques can be used to apply symbolic melodies, musical arrangements, style transfer, and sound production. This thesis largely focuses on transformation models, but it also recognises the importance of numerous AI-based generative models, including GAN. Overall, this thesis enhances generative models for music composition and provides a complete understanding of transformer design. It shows the possibilities of AI-generated sound synthesis by emphasising the most current developments

Padua Thesis and Dissertation Archive

Structural complexity in music modelling and generation with deep neural networks

Author: De Berardinis Jacopo
Publication venue
Publication date: 01/08/2022
Field of study

The University of Manchester - Institutional Repository