423 research outputs found

    AUTOMATIC DRUM TRANSCRIPTION USING BI-DIRECTIONAL RECURRENT NEURAL NETWORKS

    Get PDF
    ABSTRACT Automatic drum transcription (ADT) systems attempt to generate a symbolic music notation for percussive instruments in audio recordings. Neural networks have already been shown to perform well in fields related to ADT such as source separation and onset detection due to their utilisation of time-series data in classification. We propose the use of neural networks for ADT in order to exploit their ability to capture a complex configuration of features associated with individual or combined drum classes. In this paper we present a bi-directional recurrent neural network for offline detection of percussive onsets from specified drum classes and a recurrent neural network suitable for online operation. In both systems, a separate network is trained to identify onsets for each drum class under observation-that is, kick drum, snare drum, hi-hats, and combinations thereof. We perform four evaluations utilising the IDMT-SMT-Drums and ENST minus one datasets, which cover solo percussion and polyphonic audio respectively. The results demonstrate the effectiveness of the presented methods for solo percussion and a capacity for identifying snare drums, which are historically the most difficult drum class to detect

    Weakly-Supervised Temporal Localization via Occurrence Count Learning

    Get PDF
    We propose a novel model for temporal detection and localization which allows the training of deep neural networks using only counts of event occurrences as training labels. This powerful weakly-supervised framework alleviates the burden of the imprecise and time-consuming process of annotating event locations in temporal data. Unlike existing methods, in which localization is explicitly achieved by design, our model learns localization implicitly as a byproduct of learning to count instances. This unique feature is a direct consequence of the model's theoretical properties. We validate the effectiveness of our approach in a number of experiments (drum hit and piano onset detection in audio, digit detection in images) and demonstrate performance comparable to that of fully-supervised state-of-the-art methods, despite much weaker training requirements.Comment: Accepted at ICML 201

    An Industry Driven Genre Classification Application using Natural Language Processing

    Get PDF
    With the advent of digitized music, many online streaming companies such as Spotify have capitalized on a listener’s need for a common stream platform. An essential component of such a platform is the recommender systems that suggest to the constituent user base, related tracks, albums and artists. In order to sustain such a recommender system, labeling data to indicate which genre it belongs to is essential. Most recent academic publications that deal with music genre classification focus on the use of deep neural networks developed and applied within the music genre classification domain. This thesis attempts to use some of the highly sophisticated techniques, such as Hierarchical Attention Networks that exist within the text classification domain in order to classify tracks of different genres. In order to do this, the music is first separated into different tracks (drums, vocals, bass and accompaniment) and converted into symbolic text data. Due to the sophistication of the distributed machine learning system (over five computers, each possessing a graphical processing units greater than a GTX 1070) present in this thesis, it is capable of classifying contemporary genres with an impressive peak accuracy of over 93%, when comparing the results with that of competing classifiers. It is also argued that through the use text classification, the ex- pert domain knowledge which musicians and people involved with musicological techniques, can be attracted to improving reccomender systems within the music information retrieval research domain

    심측 신경망 기반의 μŒμ•… λ¦¬λ“œ μ‹œνŠΈ μžλ™ 채보 및 λ©œλ‘œλ”” μœ μ‚¬λ„ 평가

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 산업곡학과, 2023. 2. 이경식.Since the composition, arrangement, and distribution of music became convenient thanks to the digitization of the music industry, the number of newly supplied music recordings is increasing. Recently, due to platform environments being established whereby anyone can become a creator, user-created music such as their songs, cover songs, and remixes is being distributed through YouTube and TikTok. With such a large volume of musical recordings, the demand to transcribe music into sheet music has always existed for musicians. However, it requires musical knowledge and is time-consuming. This thesis studies automatic lead sheet transcription using deep neural networks. The development of transcription artificial intelligence (AI) can greatly reduce the time and cost for people in the music industry to find or transcribe sheet music. In addition, since the conversion from music sources to the form of digital music is possible, the applications could be expanded, such as music plagiarism detection and music composition AI. The thesis first proposes a model recognizing chords from audio signals. Chord recognition is an important task in music information retrieval since chords are highly abstract and descriptive features of music. We utilize a self-attention mechanism for chord recognition to focus on certain regions of chords. Through an attention map analysis, we visualize how attention is performed. It turns out that the model is able to divide segments of chords by utilizing the adaptive receptive field of the attention mechanism. This thesis proposes a note-level singing melody transcription model using sequence-to-sequence transformers. Overlapping decoding is introduced to solve the problem of the context between segments being broken. Applying pitch augmentation and adding a noisy dataset with data cleansing turns out to be effective in preventing overfitting and generalizing the model performance. Ablation studies demonstrate the effects of the proposed techniques in note-level singing melody transcription, both quantitatively and qualitatively. The proposed model outperforms other models in note-level singing melody transcription performance for all the metrics considered. Finally, subjective human evaluation demonstrates that the results of the proposed models are perceived as more accurate than the results of a previous study. Utilizing the above research results, we introduce the entire process of an automatic music lead sheet transcription. By combining various music information recognized from audio signals, we show that it is possible to transcribe lead sheets that express the core of popular music. Furthermore, we compare the results with lead sheets transcribed by musicians. Finally, we propose a melody similarity assessment method based on self-supervised learning by applying the automatic lead sheet transcription. We present convolutional neural networks that express the melody of lead sheet transcription results in embedding space. To apply self-supervised learning, we introduce methods of generating training data by musical data augmentation techniques. Furthermore, a loss function is presented to utilize the training data. Experimental results demonstrate that the proposed model is able to detect similar melodies of popular music from plagiarism and cover song cases.μŒμ•… μ‚°μ—…μ˜ 디지털화λ₯Ό 톡해 μŒμ•…μ˜ μž‘κ³‘, 편곑 및 μœ ν†΅μ΄ νŽΈλ¦¬ν•΄μ‘ŒκΈ° λ•Œλ¬Έμ— μƒˆλ‘­κ²Œ κ³΅κΈ‰λ˜λŠ” μŒμ›μ˜ μˆ˜κ°€ μ¦κ°€ν•˜κ³  μžˆλ‹€. μ΅œκ·Όμ—λŠ” λˆ„κ΅¬λ‚˜ 크리에이터가 될 수 μžˆλŠ” ν”Œλž«νΌ ν™˜κ²½μ΄ κ΅¬μΆ•λ˜μ–΄, μ‚¬μš©μžκ°€ λ§Œλ“  μžμž‘κ³‘, 컀버곑, 리믹슀 등이 유튜브, 틱톑을 톡해 μœ ν†΅λ˜κ³  μžˆλ‹€. μ΄λ ‡κ²Œ λ§Žμ€ μ–‘μ˜ μŒμ•…μ— λŒ€ν•΄, μŒμ•…μ„ μ•…λ³΄λ‘œ μ±„λ³΄ν•˜κ³ μž ν•˜λŠ” μˆ˜μš”λŠ” μŒμ•…κ°€λ“€μ—κ²Œ 항상 μ‘΄μž¬ν–ˆλ‹€. κ·ΈλŸ¬λ‚˜ 악보 μ±„λ³΄μ—λŠ” μŒμ•…μ  지식이 ν•„μš”ν•˜κ³ , μ‹œκ°„κ³Ό λΉ„μš©μ΄ 많이 μ†Œμš”λœλ‹€λŠ” 문제점이 μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 심측 신경망을 ν™œμš©ν•˜μ—¬ μŒμ•… λ¦¬λ“œ μ‹œνŠΈ 악보 μžλ™ 채보 기법을 μ—°κ΅¬ν•œλ‹€. 채보 인곡지λŠ₯의 κ°œλ°œμ€ μŒμ•… μ’…μ‚¬μž 및 μ—°μ£Όμžλ“€μ΄ 악보λ₯Ό κ΅¬ν•˜κ±°λ‚˜ λ§Œλ“€κΈ° μœ„ν•΄ μ†Œλͺ¨ν•˜λŠ” μ‹œκ°„κ³Ό λΉ„μš©μ„ 크게 쀄여 쀄 수 μžˆλ‹€. λ˜ν•œ μŒμ›μ—μ„œ 디지털 악보 ν˜•νƒœλ‘œ λ³€ν™˜μ΄ κ°€λŠ₯ν•΄μ§€λ―€λ‘œ, μžλ™ ν‘œμ ˆ 탐지, μž‘κ³‘ 인곡지λŠ₯ ν•™μŠ΅ λ“± λ‹€μ–‘ν•˜κ²Œ ν™œμš©μ΄ κ°€λŠ₯ν•˜λ‹€. λ¦¬λ“œ μ‹œνŠΈ 채보λ₯Ό μœ„ν•΄, λ¨Όμ € μ˜€λ””μ˜€ μ‹ ν˜Έλ‘œλΆ€ν„° μ½”λ“œλ₯Ό μΈμ‹ν•˜λŠ” λͺ¨λΈμ„ μ œμ•ˆν•œλ‹€. μŒμ•…μ—μ„œ μ½”λ“œλŠ” 함좕적이고 ν‘œν˜„μ μΈ μŒμ•…μ˜ μ€‘μš”ν•œ νŠΉμ§•μ΄λ―€λ‘œ 이λ₯Ό μΈμ‹ν•˜λŠ” 것은 맀우 μ€‘μš”ν•˜λ‹€. μ½”λ“œ ꡬ간 인식을 μœ„ν•΄, μ–΄ν…μ…˜ λ§€μ»€λ‹ˆμ¦˜μ„ μ΄μš©ν•˜λŠ” 트랜슀포머 기반 λͺ¨λΈμ„ μ œμ‹œν•œλ‹€. μ–΄ν…μ…˜ 지도 뢄석을 톡해, μ–΄ν…μ…˜μ΄ μ‹€μ œλ‘œ μ–΄λ–»κ²Œ μ μš©λ˜λŠ”μ§€ μ‹œκ°ν™”ν•˜κ³ , λͺ¨λΈμ΄ μ½”λ“œμ˜ ꡬ간을 λ‚˜λˆ„κ³  μΈμ‹ν•˜λŠ” 과정을 μ‚΄νŽ΄λ³Έλ‹€. 그리고 μ‹œν€€μŠ€ 투 μ‹œν€€μŠ€ 트랜슀포머λ₯Ό μ΄μš©ν•œ μŒν‘œ μˆ˜μ€€μ˜ κ°€μ°½ λ©œλ‘œλ”” 채보 λͺ¨λΈμ„ μ œμ•ˆν•œλ‹€. λ””μ½”λ”© κ³Όμ •μ—μ„œ 각 ꡬ간 μ‚¬μ΄μ˜ λ¬Έλ§₯ 정보가 λ‹¨μ ˆλ˜λŠ” 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 쀑첩 디코딩을 λ„μž…ν•œλ‹€. 데이터 λ³€ν˜• κΈ°λ²•μœΌλ‘œ μŒλ†’μ΄ λ³€ν˜•μ„ μ μš©ν•˜λŠ” 방법과 데이터 ν΄λ Œμ§•μ„ 톡해 ν•™μŠ΅ 데이터λ₯Ό μΆ”κ°€ν•˜λŠ” 방법을 μ†Œκ°œν•œλ‹€. μ •λŸ‰ 및 정성적인 비ꡐλ₯Ό 톡해 μ œμ•ˆν•œ 기법듀이 μ„±λŠ₯ κ°œμ„ μ— 도움이 λ˜λŠ” 것을 ν™•μΈν•˜μ˜€κ³ , μ œμ•ˆλͺ¨λΈμ΄ MIR-ST500 데이터 셋에 λŒ€ν•œ μŒν‘œ μˆ˜μ€€μ˜ κ°€μ°½ λ©œλ‘œλ”” 채보 μ„±λŠ₯μ—μ„œ κ°€μž₯ μš°μˆ˜ν•œ μ„±λŠ₯을 λ³΄μ˜€λ‹€. μΆ”κ°€λ‘œ 주관적인 μ‚¬λžŒμ˜ ν‰κ°€μ—μ„œ μ œμ•ˆ λͺ¨λΈμ˜ 채보 κ²°κ³Όκ°€ 이전 λͺ¨λΈλ³΄λ‹€ μ € μ •ν™•ν•˜λ‹€κ³  인식됨을 ν™•μΈν•˜μ˜€λ‹€. μ•žμ˜ μ—°κ΅¬μ˜ κ²°κ³Όλ₯Ό ν™œμš©ν•˜μ—¬, μŒμ•… λ¦¬λ“œ μ‹œνŠΈ μžλ™ μ±„λ³΄μ˜ 전체 과정을 μ œμ‹œν•œλ‹€. μ˜€λ””μ˜€ μ‹ ν˜Έλ‘œλΆ€ν„° μΈμ‹ν•œ λ‹€μ–‘ν•œ μŒμ•… 정보λ₯Ό μ’…ν•©ν•˜μ—¬, λŒ€μ€‘ μŒμ•… μ˜€λ””μ˜€ μ‹ ν˜Έμ˜ 핡심을 ν‘œν˜„ν•˜λŠ” λ¦¬λ“œ μ‹œνŠΈ 악보 채보가 κ°€λŠ₯함을 보인닀. 그리고 이λ₯Ό μ „λ¬Έκ°€κ°€ μ œμž‘ν•œ λ¦¬λ“œμ‹œνŠΈμ™€ λΉ„κ΅ν•˜μ—¬ λΆ„μ„ν•œλ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ λ¦¬λ“œ μ‹œνŠΈ 악보 μžλ™ 채보 기법을 μ‘μš©ν•˜μ—¬, 자기 지도 ν•™μŠ΅ 기반 λ©œλ‘œλ”” μœ μ‚¬λ„ 평가 방법을 μ œμ•ˆν•œλ‹€. λ¦¬λ“œ μ‹œνŠΈ 채보 결과의 λ©œλ‘œλ””λ₯Ό μž„λ² λ”© 곡간에 ν‘œν˜„ν•˜λŠ” ν•©μ„±κ³± 신경망 λͺ¨λΈμ„ μ œμ‹œν•œλ‹€. μžκΈ°μ§€λ„ ν•™μŠ΅ 방법둠을 μ μš©ν•˜κΈ° μœ„ν•΄, μŒμ•…μ  데이터 λ³€ν˜• 기법을 μ μš©ν•˜μ—¬ ν•™μŠ΅ 데이터λ₯Ό μƒμ„±ν•˜λŠ” 방법을 μ œμ•ˆν•œλ‹€. 그리고 μ€€λΉ„λœ ν•™μŠ΅ 데이터λ₯Ό ν™œμš©ν•˜λŠ” 심측 거리 ν•™μŠ΅ μ†μ‹€ν•¨μˆ˜λ₯Ό μ„€κ³„ν•œλ‹€. μ‹€ν—˜ κ²°κ³Ό 뢄석을 톡해, μ œμ•ˆ λͺ¨λΈμ΄ ν‘œμ ˆ 및 컀버솑 μΌ€μ΄μŠ€μ—μ„œ λŒ€μ€‘μŒμ•…μ˜ μœ μ‚¬ν•œ λ©œλ‘œλ””λ₯Ό 탐지할 수 μžˆμŒμ„ ν™•μΈν•œλ‹€.Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Objectives 4 1.3 Thesis Outline 6 Chapter 2 Literature Review 7 2.1 Attention Mechanism and Transformers 7 2.1.1 Attention-based Models 7 2.1.2 Transformers with Musical Event Sequence 8 2.2 Chord Recognition 11 2.3 Note-level Singing Melody Transcription 13 2.4 Musical Key Estimation 15 2.5 Beat Tracking 17 2.6 Music Plagiarism Detection and Cover Song Identi cation 19 2.7 Deep Metric Learning and Triplet Loss 21 Chapter 3 Problem De nition 23 3.1 Lead Sheet Transcription 23 3.1.1 Chord Recognition 24 3.1.2 Singing Melody Transcription 25 3.1.3 Post-processing for Lead Sheet Representation 26 3.2 Melody Similarity Assessment 28 Chapter 4 A Bi-directional Transformer for Musical Chord Recognition 29 4.1 Methodology 29 4.1.1 Model Architecture 29 4.1.2 Self-attention in Chord Recognition 33 4.2 Experiments 35 4.2.1 Datasets 35 4.2.2 Preprocessing 35 4.2.3 Evaluation Metrics 36 4.2.4 Training 37 4.3 Results 38 4.3.1 Quantitative Evaluation 38 4.3.2 Attention Map Analysis 41 Chapter 5 Note-level Singing Melody Transcription 44 5.1 Methodology 44 5.1.1 Monophonic Note Event Sequence 44 5.1.2 Audio Features 45 5.1.3 Model Architecture 46 5.1.4 Autoregressive Decoding and Monophonic Masking 47 5.1.5 Overlapping Decoding 47 5.1.6 Pitch Augmentation 49 5.1.7 Adding Noisy Dataset with Data Cleansing 50 5.2 Experiments 51 5.2.1 Dataset 51 5.2.2 Experiment Con gurations 52 5.2.3 Evaluation Metrics 53 5.2.4 Comparison Models 54 5.2.5 Human Evaluation 55 5.3 Results 56 5.3.1 Ablation Study 56 5.3.2 Note-level Transcription Model Comparison 59 5.3.3 Transcription Performance Distribution Analysis 59 5.3.4 Fundamental Frequency (F0) Metric Evaluation 60 5.4 Qualitative Analysis 62 5.4.1 Visualization of Ablation Study 62 5.4.2 Spectrogram Analysis 65 5.4.3 Human Evaluation 67 Chapter 6 Automatic Music Lead Sheet Transcription 68 6.1 Post-processing for Lead Sheet Representation 68 6.2 Lead Sheet Transcription Results 71 Chapter 7 Melody Similarity Assessment with Self-supervised Convolutional Neural Networks 77 7.1 Methodology 77 7.1.1 Input Data Representation 77 7.1.2 Data Augmentation 78 7.1.3 Model Architecture 82 7.1.4 Loss Function 84 7.1.5 De nition of Distance between Songs 85 7.2 Experiments 87 7.2.1 Dataset 87 7.2.2 Training 88 7.2.3 Evaluation Metrics 88 7.3 Results 89 7.3.1 Quantitative Evaluation 89 7.3.2 Qualitative Evaluation 99 Chapter 8 Conclusion 107 8.1 Summary and Contributions 107 8.2 Limitations and Future Research 110 Bibliography 111 ꡭ문초둝 126λ°•

    Deep Learning Approaches for Automatic Drum Transcription

    Get PDF
    Drum transcription is the task of transcribing audio or music into drum notation. Drum notation is helpful to help drummers as instruction in playing drums and could also be useful for students to learn about drum music theories. Unfortunately, transcribing music is not an easy task. A good transcription can usually be obtained only by an experienced musician. On the other side, musical notation is beneficial not only for professionals but also for amateurs. This study develops an Automatic Drum Transcription (ADT) application using the segment and classify method with Deep Learning as the classification method. The segment and classify method is divided into two steps. First, the segmentation step achieved a score of 76.14% in macro F1 after doing a grid search to tune the parameters. Second, the spectrogram feature is extracted on the detected onsets as the input for the classification models. The models are evaluated using the multi-objective optimization (MOO) of macro F1 score and time consumption for prediction. The result shows that the LSTM model outperformed the other models with MOO scores of 77.42%, 86.97%, and 82.87% on MDB Drums, IDMT-SMT Drums, and combined datasets, respectively. The model is then used in the ADT application. The application is built using the FastAPI framework, which delivers the transcription result as a drum tab

    Improving peak picking using multiple time-step loss functions

    Get PDF
    The majority of state-of-the-art methods for music infor-mation retrieval (MIR) tasks now utilise deep learningmethods reliant on minimisation of loss functions such ascross entropy. For tasks that include framewise binaryclassification (e.g., onset detection, music transcription)classes are derived from output activation functions byidentifying points of local maxima, or peaks. However, theoperating principles behind peak picking are different tothat of the cross entropy loss function, which minimises theabsolute difference between the output and target valuesfor a single frame. To generate activation functions moresuited to peak-picking, we propose two versions of a newloss function that incorporates information from multipletime-steps: 1)multi-individual, which uses multiple indi-vidual time-step cross entropies; and 2)multi-difference,which directly compares the difference between sequentialtime-step outputs. We evaluate the newly proposed lossfunctions alongside standard cross entropy in the popularMIR tasks of onset detection and automatic drum tran-scription. The results highlight the effectiveness of theseloss functions in the improvement of overall system ac-curacies for both MIR tasks. Additionally, directly com-paring the output from sequential time-steps in the multi-difference approach achieves the highest performance
    • …
    corecore