628 research outputs found

    AI and Tempo Estimation: A Review

    Full text link
    The author's goal in this paper is to explore how artificial intelligence (AI) has been utilised to inform our understanding of and ability to estimate at scale a critical aspect of musical creativity - musical tempo. The central importance of tempo to musical creativity can be seen in how it is used to express specific emotions (Eerola and Vuoskoski 2013), suggest particular musical styles (Li and Chan 2011), influence perception of expression (Webster and Weir 2005) and mediate the urge to move one's body in time to the music (Burger et al. 2014). Traditional tempo estimation methods typically detect signal periodicities that reflect the underlying rhythmic structure of the music, often using some form of autocorrelation of the amplitude envelope (Lartillot and Toiviainen 2007). Recently, AI-based methods utilising convolutional or recurrent neural networks (CNNs, RNNs) on spectral representations of the audio signal have enjoyed significant improvements in accuracy (Aarabi and Peeters 2022). Common AI-based techniques include those based on probability (e.g., Bayesian approaches, hidden Markov models (HMM)), classification and statistical learning (e.g., support vector machines (SVM)), and artificial neural networks (ANNs) (e.g., self-organising maps (SOMs), CNNs, RNNs, deep learning (DL)). The aim here is to provide an overview of some of the more common AI-based tempo estimation algorithms and to shine a light on notable benefits and potential drawbacks of each. Limitations of AI in this field in general are also considered, as is the capacity for such methods to account for idiosyncrasies inherent in tempo perception, i.e., how well AI-based approaches are able to think and act like humans.Comment: 9 page

    심측 신경망 기반의 μŒμ•… λ¦¬λ“œ μ‹œνŠΈ μžλ™ 채보 및 λ©œλ‘œλ”” μœ μ‚¬λ„ 평가

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 산업곡학과, 2023. 2. 이경식.Since the composition, arrangement, and distribution of music became convenient thanks to the digitization of the music industry, the number of newly supplied music recordings is increasing. Recently, due to platform environments being established whereby anyone can become a creator, user-created music such as their songs, cover songs, and remixes is being distributed through YouTube and TikTok. With such a large volume of musical recordings, the demand to transcribe music into sheet music has always existed for musicians. However, it requires musical knowledge and is time-consuming. This thesis studies automatic lead sheet transcription using deep neural networks. The development of transcription artificial intelligence (AI) can greatly reduce the time and cost for people in the music industry to find or transcribe sheet music. In addition, since the conversion from music sources to the form of digital music is possible, the applications could be expanded, such as music plagiarism detection and music composition AI. The thesis first proposes a model recognizing chords from audio signals. Chord recognition is an important task in music information retrieval since chords are highly abstract and descriptive features of music. We utilize a self-attention mechanism for chord recognition to focus on certain regions of chords. Through an attention map analysis, we visualize how attention is performed. It turns out that the model is able to divide segments of chords by utilizing the adaptive receptive field of the attention mechanism. This thesis proposes a note-level singing melody transcription model using sequence-to-sequence transformers. Overlapping decoding is introduced to solve the problem of the context between segments being broken. Applying pitch augmentation and adding a noisy dataset with data cleansing turns out to be effective in preventing overfitting and generalizing the model performance. Ablation studies demonstrate the effects of the proposed techniques in note-level singing melody transcription, both quantitatively and qualitatively. The proposed model outperforms other models in note-level singing melody transcription performance for all the metrics considered. Finally, subjective human evaluation demonstrates that the results of the proposed models are perceived as more accurate than the results of a previous study. Utilizing the above research results, we introduce the entire process of an automatic music lead sheet transcription. By combining various music information recognized from audio signals, we show that it is possible to transcribe lead sheets that express the core of popular music. Furthermore, we compare the results with lead sheets transcribed by musicians. Finally, we propose a melody similarity assessment method based on self-supervised learning by applying the automatic lead sheet transcription. We present convolutional neural networks that express the melody of lead sheet transcription results in embedding space. To apply self-supervised learning, we introduce methods of generating training data by musical data augmentation techniques. Furthermore, a loss function is presented to utilize the training data. Experimental results demonstrate that the proposed model is able to detect similar melodies of popular music from plagiarism and cover song cases.μŒμ•… μ‚°μ—…μ˜ 디지털화λ₯Ό 톡해 μŒμ•…μ˜ μž‘κ³‘, 편곑 및 μœ ν†΅μ΄ νŽΈλ¦¬ν•΄μ‘ŒκΈ° λ•Œλ¬Έμ— μƒˆλ‘­κ²Œ κ³΅κΈ‰λ˜λŠ” μŒμ›μ˜ μˆ˜κ°€ μ¦κ°€ν•˜κ³  μžˆλ‹€. μ΅œκ·Όμ—λŠ” λˆ„κ΅¬λ‚˜ 크리에이터가 될 수 μžˆλŠ” ν”Œλž«νΌ ν™˜κ²½μ΄ κ΅¬μΆ•λ˜μ–΄, μ‚¬μš©μžκ°€ λ§Œλ“  μžμž‘κ³‘, 컀버곑, 리믹슀 등이 유튜브, 틱톑을 톡해 μœ ν†΅λ˜κ³  μžˆλ‹€. μ΄λ ‡κ²Œ λ§Žμ€ μ–‘μ˜ μŒμ•…μ— λŒ€ν•΄, μŒμ•…μ„ μ•…λ³΄λ‘œ μ±„λ³΄ν•˜κ³ μž ν•˜λŠ” μˆ˜μš”λŠ” μŒμ•…κ°€λ“€μ—κ²Œ 항상 μ‘΄μž¬ν–ˆλ‹€. κ·ΈλŸ¬λ‚˜ 악보 μ±„λ³΄μ—λŠ” μŒμ•…μ  지식이 ν•„μš”ν•˜κ³ , μ‹œκ°„κ³Ό λΉ„μš©μ΄ 많이 μ†Œμš”λœλ‹€λŠ” 문제점이 μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 심측 신경망을 ν™œμš©ν•˜μ—¬ μŒμ•… λ¦¬λ“œ μ‹œνŠΈ 악보 μžλ™ 채보 기법을 μ—°κ΅¬ν•œλ‹€. 채보 인곡지λŠ₯의 κ°œλ°œμ€ μŒμ•… μ’…μ‚¬μž 및 μ—°μ£Όμžλ“€μ΄ 악보λ₯Ό κ΅¬ν•˜κ±°λ‚˜ λ§Œλ“€κΈ° μœ„ν•΄ μ†Œλͺ¨ν•˜λŠ” μ‹œκ°„κ³Ό λΉ„μš©μ„ 크게 쀄여 쀄 수 μžˆλ‹€. λ˜ν•œ μŒμ›μ—μ„œ 디지털 악보 ν˜•νƒœλ‘œ λ³€ν™˜μ΄ κ°€λŠ₯ν•΄μ§€λ―€λ‘œ, μžλ™ ν‘œμ ˆ 탐지, μž‘κ³‘ 인곡지λŠ₯ ν•™μŠ΅ λ“± λ‹€μ–‘ν•˜κ²Œ ν™œμš©μ΄ κ°€λŠ₯ν•˜λ‹€. λ¦¬λ“œ μ‹œνŠΈ 채보λ₯Ό μœ„ν•΄, λ¨Όμ € μ˜€λ””μ˜€ μ‹ ν˜Έλ‘œλΆ€ν„° μ½”λ“œλ₯Ό μΈμ‹ν•˜λŠ” λͺ¨λΈμ„ μ œμ•ˆν•œλ‹€. μŒμ•…μ—μ„œ μ½”λ“œλŠ” 함좕적이고 ν‘œν˜„μ μΈ μŒμ•…μ˜ μ€‘μš”ν•œ νŠΉμ§•μ΄λ―€λ‘œ 이λ₯Ό μΈμ‹ν•˜λŠ” 것은 맀우 μ€‘μš”ν•˜λ‹€. μ½”λ“œ ꡬ간 인식을 μœ„ν•΄, μ–΄ν…μ…˜ λ§€μ»€λ‹ˆμ¦˜μ„ μ΄μš©ν•˜λŠ” 트랜슀포머 기반 λͺ¨λΈμ„ μ œμ‹œν•œλ‹€. μ–΄ν…μ…˜ 지도 뢄석을 톡해, μ–΄ν…μ…˜μ΄ μ‹€μ œλ‘œ μ–΄λ–»κ²Œ μ μš©λ˜λŠ”μ§€ μ‹œκ°ν™”ν•˜κ³ , λͺ¨λΈμ΄ μ½”λ“œμ˜ ꡬ간을 λ‚˜λˆ„κ³  μΈμ‹ν•˜λŠ” 과정을 μ‚΄νŽ΄λ³Έλ‹€. 그리고 μ‹œν€€μŠ€ 투 μ‹œν€€μŠ€ 트랜슀포머λ₯Ό μ΄μš©ν•œ μŒν‘œ μˆ˜μ€€μ˜ κ°€μ°½ λ©œλ‘œλ”” 채보 λͺ¨λΈμ„ μ œμ•ˆν•œλ‹€. λ””μ½”λ”© κ³Όμ •μ—μ„œ 각 ꡬ간 μ‚¬μ΄μ˜ λ¬Έλ§₯ 정보가 λ‹¨μ ˆλ˜λŠ” 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 쀑첩 디코딩을 λ„μž…ν•œλ‹€. 데이터 λ³€ν˜• κΈ°λ²•μœΌλ‘œ μŒλ†’μ΄ λ³€ν˜•μ„ μ μš©ν•˜λŠ” 방법과 데이터 ν΄λ Œμ§•μ„ 톡해 ν•™μŠ΅ 데이터λ₯Ό μΆ”κ°€ν•˜λŠ” 방법을 μ†Œκ°œν•œλ‹€. μ •λŸ‰ 및 정성적인 비ꡐλ₯Ό 톡해 μ œμ•ˆν•œ 기법듀이 μ„±λŠ₯ κ°œμ„ μ— 도움이 λ˜λŠ” 것을 ν™•μΈν•˜μ˜€κ³ , μ œμ•ˆλͺ¨λΈμ΄ MIR-ST500 데이터 셋에 λŒ€ν•œ μŒν‘œ μˆ˜μ€€μ˜ κ°€μ°½ λ©œλ‘œλ”” 채보 μ„±λŠ₯μ—μ„œ κ°€μž₯ μš°μˆ˜ν•œ μ„±λŠ₯을 λ³΄μ˜€λ‹€. μΆ”κ°€λ‘œ 주관적인 μ‚¬λžŒμ˜ ν‰κ°€μ—μ„œ μ œμ•ˆ λͺ¨λΈμ˜ 채보 κ²°κ³Όκ°€ 이전 λͺ¨λΈλ³΄λ‹€ μ € μ •ν™•ν•˜λ‹€κ³  인식됨을 ν™•μΈν•˜μ˜€λ‹€. μ•žμ˜ μ—°κ΅¬μ˜ κ²°κ³Όλ₯Ό ν™œμš©ν•˜μ—¬, μŒμ•… λ¦¬λ“œ μ‹œνŠΈ μžλ™ μ±„λ³΄μ˜ 전체 과정을 μ œμ‹œν•œλ‹€. μ˜€λ””μ˜€ μ‹ ν˜Έλ‘œλΆ€ν„° μΈμ‹ν•œ λ‹€μ–‘ν•œ μŒμ•… 정보λ₯Ό μ’…ν•©ν•˜μ—¬, λŒ€μ€‘ μŒμ•… μ˜€λ””μ˜€ μ‹ ν˜Έμ˜ 핡심을 ν‘œν˜„ν•˜λŠ” λ¦¬λ“œ μ‹œνŠΈ 악보 채보가 κ°€λŠ₯함을 보인닀. 그리고 이λ₯Ό μ „λ¬Έκ°€κ°€ μ œμž‘ν•œ λ¦¬λ“œμ‹œνŠΈμ™€ λΉ„κ΅ν•˜μ—¬ λΆ„μ„ν•œλ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ λ¦¬λ“œ μ‹œνŠΈ 악보 μžλ™ 채보 기법을 μ‘μš©ν•˜μ—¬, 자기 지도 ν•™μŠ΅ 기반 λ©œλ‘œλ”” μœ μ‚¬λ„ 평가 방법을 μ œμ•ˆν•œλ‹€. λ¦¬λ“œ μ‹œνŠΈ 채보 결과의 λ©œλ‘œλ””λ₯Ό μž„λ² λ”© 곡간에 ν‘œν˜„ν•˜λŠ” ν•©μ„±κ³± 신경망 λͺ¨λΈμ„ μ œμ‹œν•œλ‹€. μžκΈ°μ§€λ„ ν•™μŠ΅ 방법둠을 μ μš©ν•˜κΈ° μœ„ν•΄, μŒμ•…μ  데이터 λ³€ν˜• 기법을 μ μš©ν•˜μ—¬ ν•™μŠ΅ 데이터λ₯Ό μƒμ„±ν•˜λŠ” 방법을 μ œμ•ˆν•œλ‹€. 그리고 μ€€λΉ„λœ ν•™μŠ΅ 데이터λ₯Ό ν™œμš©ν•˜λŠ” 심측 거리 ν•™μŠ΅ μ†μ‹€ν•¨μˆ˜λ₯Ό μ„€κ³„ν•œλ‹€. μ‹€ν—˜ κ²°κ³Ό 뢄석을 톡해, μ œμ•ˆ λͺ¨λΈμ΄ ν‘œμ ˆ 및 컀버솑 μΌ€μ΄μŠ€μ—μ„œ λŒ€μ€‘μŒμ•…μ˜ μœ μ‚¬ν•œ λ©œλ‘œλ””λ₯Ό 탐지할 수 μžˆμŒμ„ ν™•μΈν•œλ‹€.Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Objectives 4 1.3 Thesis Outline 6 Chapter 2 Literature Review 7 2.1 Attention Mechanism and Transformers 7 2.1.1 Attention-based Models 7 2.1.2 Transformers with Musical Event Sequence 8 2.2 Chord Recognition 11 2.3 Note-level Singing Melody Transcription 13 2.4 Musical Key Estimation 15 2.5 Beat Tracking 17 2.6 Music Plagiarism Detection and Cover Song Identi cation 19 2.7 Deep Metric Learning and Triplet Loss 21 Chapter 3 Problem De nition 23 3.1 Lead Sheet Transcription 23 3.1.1 Chord Recognition 24 3.1.2 Singing Melody Transcription 25 3.1.3 Post-processing for Lead Sheet Representation 26 3.2 Melody Similarity Assessment 28 Chapter 4 A Bi-directional Transformer for Musical Chord Recognition 29 4.1 Methodology 29 4.1.1 Model Architecture 29 4.1.2 Self-attention in Chord Recognition 33 4.2 Experiments 35 4.2.1 Datasets 35 4.2.2 Preprocessing 35 4.2.3 Evaluation Metrics 36 4.2.4 Training 37 4.3 Results 38 4.3.1 Quantitative Evaluation 38 4.3.2 Attention Map Analysis 41 Chapter 5 Note-level Singing Melody Transcription 44 5.1 Methodology 44 5.1.1 Monophonic Note Event Sequence 44 5.1.2 Audio Features 45 5.1.3 Model Architecture 46 5.1.4 Autoregressive Decoding and Monophonic Masking 47 5.1.5 Overlapping Decoding 47 5.1.6 Pitch Augmentation 49 5.1.7 Adding Noisy Dataset with Data Cleansing 50 5.2 Experiments 51 5.2.1 Dataset 51 5.2.2 Experiment Con gurations 52 5.2.3 Evaluation Metrics 53 5.2.4 Comparison Models 54 5.2.5 Human Evaluation 55 5.3 Results 56 5.3.1 Ablation Study 56 5.3.2 Note-level Transcription Model Comparison 59 5.3.3 Transcription Performance Distribution Analysis 59 5.3.4 Fundamental Frequency (F0) Metric Evaluation 60 5.4 Qualitative Analysis 62 5.4.1 Visualization of Ablation Study 62 5.4.2 Spectrogram Analysis 65 5.4.3 Human Evaluation 67 Chapter 6 Automatic Music Lead Sheet Transcription 68 6.1 Post-processing for Lead Sheet Representation 68 6.2 Lead Sheet Transcription Results 71 Chapter 7 Melody Similarity Assessment with Self-supervised Convolutional Neural Networks 77 7.1 Methodology 77 7.1.1 Input Data Representation 77 7.1.2 Data Augmentation 78 7.1.3 Model Architecture 82 7.1.4 Loss Function 84 7.1.5 De nition of Distance between Songs 85 7.2 Experiments 87 7.2.1 Dataset 87 7.2.2 Training 88 7.2.3 Evaluation Metrics 88 7.3 Results 89 7.3.1 Quantitative Evaluation 89 7.3.2 Qualitative Evaluation 99 Chapter 8 Conclusion 107 8.1 Summary and Contributions 107 8.2 Limitations and Future Research 110 Bibliography 111 ꡭ문초둝 126λ°•

    Motivic Pattern Classification of Music Audio Signals Combining Residual and LSTM Networks

    Get PDF
    Motivic pattern classification from music audio recordings is a challenging task. More so in the case of a cappella flamenco cantes, characterized by complex melodic variations, pitch instability, timbre changes, extreme vibrato oscillations, microtonal ornamentations, and noisy conditions of the recordings. Convolutional Neural Networks (CNN) have proven to be very effective algorithms in image classification. Recent work in large-scale audio classification has shown that CNN architectures, originally developed for image problems, can be applied successfully to audio event recognition and classification with little or no modifications to the networks. In this paper, CNN architectures are tested in a more nuanced problem: flamenco cantes intra-style classification using small motivic patterns. A new architecture is proposed that uses the advantages of residual CNN as feature extractors, and a bidirectional LSTM layer to exploit the sequential nature of musical audio data. We present a full end-to-end pipeline for audio music classification that includes a sequential pattern mining technique and a contour simplification method to extract relevant motifs from audio recordings. Mel-spectrograms of the extracted motifs are then used as the input for the different architectures tested. We investigate the usefulness of motivic patterns for the automatic classification of music recordings and the effect of the length of the audio and corpus size on the overall classification accuracy. Results show a relative accuracy improvement of up to 20.4% when CNN architectures are trained using acoustic representations from motivic patterns

    Mustango: Toward Controllable Text-to-Music Generation

    Full text link
    With recent advancements in text-to-audio and text-to-music based on latent diffusion models, the quality of generated content has been reaching new heights. The controllability of musical aspects, however, has not been explicitly explored in text-to-music systems yet. In this paper, we present Mustango, a music-domain-knowledge-inspired text-to-music system based on diffusion, that expands the Tango text-to-audio model. Mustango aims to control the generated music, not only with general text captions, but from more rich captions that could include specific instructions related to chords, beats, tempo, and key. As part of Mustango, we propose MuNet, a Music-Domain-Knowledge-Informed UNet sub-module to integrate these music-specific features, which we predict from the text prompt, as well as the general text embedding, into the diffusion denoising process. To overcome the limited availability of open datasets of music with text captions, we propose a novel data augmentation method that includes altering the harmonic, rhythmic, and dynamic aspects of music audio and using state-of-the-art Music Information Retrieval methods to extract the music features which will then be appended to the existing descriptions in text format. We release the resulting MusicBench dataset which contains over 52K instances and includes music-theory-based descriptions in the caption text. Through extensive experiments, we show that the quality of the music generated by Mustango is state-of-the-art, and the controllability through music-specific text prompts greatly outperforms other models in terms of desired chords, beat, key, and tempo, on multiple datasets

    Final Research Report on Auto-Tagging of Music

    Get PDF
    The deliverable D4.7 concerns the work achieved by IRCAM until M36 for the β€œauto-tagging of music”. The deliverable is a research report. The software libraries resulting from the research have been integrated into Fincons/HearDis! Music Library Manager or are used by TU Berlin. The final software libraries are described in D4.5. The research work on auto-tagging has concentrated on four aspects: 1) Further improving IRCAM’s machine-learning system ircamclass. This has been done by developing the new MASSS audio features, including audio augmentation and audio segmentation into ircamclass. The system has then been applied to train HearDis! β€œsoft” features (Vocals-1, Vocals-2, Pop-Appeal, Intensity, Instrumentation, Timbre, Genre, Style). This is described in Part 3. 2) Developing two sets of β€œhard” features (i.e. related to musical or musicological concepts) as specified by HearDis! (for integration into Fincons/HearDis! Music Library Manager) and TU Berlin (as input for the prediction model of the GMBI attributes). Such features are either derived from previously estimated higher-level concepts (such as structure, key or succession of chords) or by developing new signal processing algorithm (such as HPSS) or main melody estimation. This is described in Part 4. 3) Developing audio features to characterize the audio quality of a music track. The goal is to describe the quality of the audio independently of its apparent encoding. This is then used to estimate audio degradation or music decade. This is to be used to ensure that playlists contain tracks with similar audio quality. This is described in Part 5. 4) Developing innovative algorithms to extract specific audio features to improve music mixes. So far, innovative techniques (based on various Blind Audio Source Separation algorithms and Convolutional Neural Network) have been developed for singing voice separation, singing voice segmentation, music structure boundaries estimation, and DJ cue-region estimation. This is described in Part 6.EC/H2020/688122/EU/Artist-to-Business-to-Business-to-Consumer Audio Branding System/ABC D

    Automatic characterization and generation of music loops and instrument samples for electronic music production

    Get PDF
    Repurposing audio material to create new music - also known as sampling - was a foundation of electronic music and is a fundamental component of this practice. Currently, large-scale databases of audio offer vast collections of audio material for users to work with. The navigation on these databases is heavily focused on hierarchical tree directories. Consequently, sound retrieval is tiresome and often identified as an undesired interruption in the creative process. We address two fundamental methods for navigating sounds: characterization and generation. Characterizing loops and one-shots in terms of instruments or instrumentation allows for organizing unstructured collections and a faster retrieval for music-making. The generation of loops and one-shot sounds enables the creation of new sounds not present in an audio collection through interpolation or modification of the existing material. To achieve this, we employ deep-learning-based data-driven methodologies for classification and generation.Repurposing audio material to create new music - also known as sampling - was a foundation of electronic music and is a fundamental component of this practice. Currently, large-scale databases of audio offer vast collections of audio material for users to work with. The navigation on these databases is heavily focused on hierarchical tree directories. Consequently, sound retrieval is tiresome and often identified as an undesired interruption in the creative process. We address two fundamental methods for navigating sounds: characterization and generation. Characterizing loops and one-shots in terms of instruments or instrumentation allows for organizing unstructured collections and a faster retrieval for music-making. The generation of loops and one-shot sounds enables the creation of new sounds not present in an audio collection through interpolation or modification of the existing material. To achieve this, we employ deep-learning-based data-driven methodologies for classification and generation

    TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer

    Full text link
    In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having a representation that allows independent manipulation of timbre as well as high-quality waveform generation. We introduce TimbreTron, a method for musical timbre transfer which applies "image" domain style transfer to a time-frequency representation of the audio signal, and then produces a high-quality waveform using a conditional WaveNet synthesizer. We show that the Constant Q Transform (CQT) representation is particularly well-suited to convolutional architectures due to its approximate pitch equivariance. Based on human perceptual evaluations, we confirmed that TimbreTron recognizably transferred the timbre while otherwise preserving the musical content, for both monophonic and polyphonic samples.Comment: 17 pages, published as a conference paper at ICLR 201
    • …
    corecore