29 research outputs found
Automatic music genre classification
A dissertation submitted to the Faculty of Science, University of the Witwatersrand, in fulfillment of the requirements for the degree of Master of Science. 2014.No abstract provided
Music Information Retrieval using Machine Learning and Convolutional Neural Networks
Στην παρούσα διπλωματική εργασία γίνεται προσπάθεια ανάλυσης και ανάκτησης χαρακτηριστικών μουσικού περιεχομένου με τη χρήση αλγορίθμων Μηχανικής Μάθησης και Συνελλικτικών Νευρωνικών Δικτύων. Στόχος είναι η αναγνώριση μουσικών κομματιών και η κατηγοριοποίηση τους με βάση τα συναισθήματα που προκαλούν, το είδος και την ομοιότητά τους με άλλα τραγούδια μιας συλλογής. Για την επίτευξη αυτού του στόχου εκπαιδεύτηκαν μοντέλα Συνελικτικών Νευρωνικών Δικτύων που αφορούν το συναισθηματικό πρόσημο - σθένος, την ενέργεια, την χορευτικότητα και το είδος ενός τραγουδιού. Οι κλάσεις που αφορούν το είδος, δημιουργήθηκαν με τη χρήση μεθόδων ομαδοποίησης σε κλάσεις (clustering). Όλα τα μοντέλα εκπαιδεύτηκαν σε μεγάλο όγκο μουσικών δεδομένων με τη χρήση Συνελικτικού Νευρωνικού Δικτύου της python βιβλιοθήκης Deep Audio Features. Τέλος, η αξιολόγηση των μοντέλων βασίστηκε σε ένα σύνολο δεδομένων που δημιουργήθηκε από την αλληλεπίδραση χρηστών συγκρίνοντας τριπλέτες τραγουδιών και αποφασίζοντας ποιο τραγούδι κατά τη γνώμη τους ήταν το λιγότερο ταιριαστό με τα υπόλοιπα δύο. Στην διαδικασία της αξιολόγησης των μοντέλων συνέβαλε επίσης η δημιουργία εφαρμογής βασισμένης στο μουσικό περιεχόμενο με στόχο την οπτικοποίηση και σύγκριση ως προς ορισμένα χαρακτηριστικά τους, τραγουδιών μιας μικρής συλλογής.In this thesis an attempt is made to analyze and retrieve musical information using Machine
Learning algorithms and Convolutional Neural Networks. The goal is to recognize and classify
musical tracks based on their emotional impact, their genre and their similarity between others
in a song collection. To accomplish this goal, models of Convolutional Neural Networks have
been built related to the valence, the energy, the danceability and the genre of a song. The
genre classes are created using clustering methods. All models are trained on a large volume
of musical data using a Convolutional Neural Network from the Deep Audio Features python
library. Finally, the evaluation of the models is based on a set of data created by user inter-
action comparing triplets of songs and deciding which song according to their opinion is the
least compatible with the other two. A music content-based application has been created that
contributed to the evaluation procedure of the CNN models so that songs of a small collection
could be visualized and compared to each other according to some of their characteristics
Singing Voice Recognition for Music Information Retrieval
This thesis proposes signal processing methods for analysis of singing voice audio signals, with the objectives of obtaining information about the identity and lyrics content of the singing. Two main topics are presented, singer identification in monophonic and polyphonic music, and lyrics transcription and alignment. The information automatically extracted from the singing voice is meant to be used for applications such as music classification, sorting and organizing music databases, music information retrieval, etc.
For singer identification, the thesis introduces methods from general audio classification and specific methods for dealing with the presence of accompaniment. The emphasis is on singer identification in polyphonic audio, where the singing voice is present along with musical accompaniment. The presence of instruments is detrimental to voice identification performance, and eliminating the effect of instrumental accompaniment is an important aspect of the problem. The study of singer identification is centered around the degradation of classification performance in presence of instruments, and separation of the vocal line for improving performance. For the study, monophonic singing was mixed with instrumental accompaniment at different signal-to-noise (singing-to-accompaniment) ratios and the classification process was performed on the polyphonic mixture and on the vocal line separated from the polyphonic mixture. The method for classification including the step for separating the vocals is improving significantly the performance compared to classification of the polyphonic mixtures, but not close to the performance in classifying the monophonic singing itself. Nevertheless, the results show that classification of singing voices can be done robustly in polyphonic music when using source separation.
In the problem of lyrics transcription, the thesis introduces the general speech recognition framework and various adjustments that can be done before applying the methods on singing voice. The variability of phonation in singing poses a significant challenge to the speech recognition approach. The thesis proposes using phoneme models trained on speech data and adapted to singing voice characteristics for the recognition of phonemes and words from a singing voice signal. Language models and adaptation techniques are an important aspect of the recognition process. There are two different ways of recognizing the phonemes in the audio: one is alignment, when the true transcription is known and the phonemes have to be located, other one is recognition, when both transcription and location of phonemes have to be found. The alignment is, obviously, a simplified form of the recognition task.
Alignment of textual lyrics to music audio is performed by aligning the phonetic transcription of the lyrics with the vocal line separated from the polyphonic mixture, using a collection of commercial songs. The word recognition is tested for transcription of lyrics from monophonic singing. The performance of the proposed system for automatic alignment of lyrics and audio is sufficient for facilitating applications such as automatic karaoke annotation or song browsing. The word recognition accuracy of the lyrics transcription from singing is quite low, but it is shown to be useful in a query-by-singing application, for performing a textual search based on the words recognized from the query. When some key words in the query are recognized, the song can be reliably identified
Recommended from our members
Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching
Sequences of feature vectors are a natural way of representing temporal data. Given a database of sequences, a fundamental task is to find the database entry which is the most similar to a query. In this thesis, we present learning-based methods for efficiently and accurately comparing sequences in order to facilitate large-scale sequence search. Throughout, we will focus on the problem of matching MIDI files (a digital score format) to a large collection of audio recordings of music. The combination of our proposed approaches enables us to create the largest corpus of paired MIDI files and audio recordings ever assembled.
Dynamic time warping (DTW) has proven to be an extremely effective method for both aligning and matching sequences. However, its performance is heavily affected by factors such as the feature representation used and its adjustable parameters. We therefore investigate automatically optimizing DTW-based alignment and matching of MIDI and audio data. Our approach uses Bayesian optimization to tune system design and parameters over a synthetically-created dataset of audio and MIDI pairs. We then perform an exhaustive search over DTW score normalization techniques to find the optimal method for reporting a reliable alignment confidence score, as required in matching tasks. This results in a DTW-based system which is conceptually simple and highly accurate at both alignment and matching. We also verify that this system achieves high performance in a large-scale qualitative evaluation of real-world alignments.
Unfortunately, DTW can be far too inefficient for large-scale search when sequences are very long and consist of high-dimensional feature vectors. We therefore propose a method for mapping sequences of continuously-valued feature vectors to downsampled sequences of binary vectors. Our approach involves training a pair of convolutional networks to map paired groups of subsequent feature vectors to a Hamming space where similarity is preserved. Evaluated on the task of matching MIDI files to a large database of audio recordings, we show that this technique enables 99.99\% of the database to be discarded with a modest false reject rate while only requiring 0.2\% of the time to compute.
Even when sped-up with a more efficient representation, the quadratic complexity of DTW greatly hinders its feasibility for very large-scale search. This cost can be avoided by mapping entire sequences to fixed-length vectors in an embedded space where sequence similarity is approximated by Euclidean distance. To achieve this embedding, we propose a feed-forward attention-based neural network model which can integrate arbitrarily long sequences. We show that this approach can extremely efficiently prune 90\% of our audio recording database with high confidence.
After developing these approaches, we applied them together to the practical task of matching 178,561 unique MIDI files to the Million Song Dataset. The resulting ``Lakh MIDI Dataset'' provides a potential bounty of ground truth information for audio content-based music information retrieval. This can include transcription, meter, lyrics, and high-level musicological features. The reliability of the resulting annotations depends both on the quality of the transcription and the accuracy of the score-to-audio alignment. We therefore establish a baseline of reliability for score-derived information for different content-based MIR tasks. Finally, we discuss potential future uses of our dataset and the learning-based sequence comparison methods we developed