5 research outputs found
A Comparison Study of Deep Learning Methodologies for Music Emotion Recognition
Classical machine learning techniques have dominated Music Emotion Recognition. However, improvements have slowed down due to the complex and time-consuming task of handcrafting new emotionally relevant audio features. Deep learning methods have recently gained popularity in the field because of their ability to automatically learn relevant features from spectral representations of songs, eliminating such necessity. Nonetheless, there are limitations, such as the need for large amounts of quality labeled data, a common problem in MER research. To understand the effectiveness of these techniques, a comparison study using various classical machine learning and deep learning methods was conducted. The results showed that using an ensemble of a Dense Neural Network and a Convolutional Neural Network architecture resulted in a state-of-the-art 80.20% F1 score, an improvement of around 5% considering the best baseline results, concluding that future research should take advantage of both paradigms, that is, combining handcrafted features with feature learning.info:eu-repo/semantics/publishedVersio
Exploring Deep Learning Methodologies for Music Emotion Recognition
Classical machine learning techniques have dominated Music Emotion Recognition (MER). However, improvements have slowed down due to the complex and time-consuming task of handcrafting new emotionally relevant audio features. Deep Learning methods have recently gained popularity in the field because of their ability to automatically learn relevant features from spectral representations of songs, eliminating such necessity. Nonetheless, there are limitations, such as the need for large amounts of quality labeled data, a common problem in MER research. To understand the effectiveness of these techniques, a comparison study using various classical machine learning and deep learning methods was conducted. The results showed that using an ensemble of a Dense Neural Network and a Convolutional Neural Network architecture resulted in a state-of-the-art 80.20% F1-score, an improvement of around 5% considering the best baseline results, concluding that future research should take advantage of both paradigms, that is, conbining handcrafted features with feature learning.info:eu-repo/semantics/acceptedVersio
Exploring Song Segmentation for Music Emotion Variation Detection
This paper evaluates the impact of song segmentation on Music Emotion Variation Detection (MEVD). In particular, the All-In-One song-structure segmentation system was employed to this end and compared to a fixed 1.5-sec window approach. Acoustic features were extracted for each obtained segment/window, which were classified with SVMs. The attained results (best F1-score of 55.9%) suggest that, despite its promise, the potential of this song segmentation approach was not fully exploited, possibly due to the small employed dataset. Nevertheless, preliminary results are encouraging.This work is funded by FCT - Foundation for Science and Technology, I.P., within the scope of the projects: MERGE - DOI: 10.54499/PTDC/CCI-COM/3171/2021 financed with national funds (PIDDAC) via the Portuguese State Budget; and project CISUC - UID/CEC/00326/2020 with funds from the European Social Fund, through the Regional Operational Program Centro 2020. Renato Panda was supported by Ci2 - FCT UIDP/05567/2020.info:eu-repo/semantics/acceptedVersio
MERGE AUDIO: AUDIO ANALYSIS AND FEATURE ENGINEERING FOR MUSIC EMOTION RECOGNITION MSC THESIS
Dissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e TecnologiaWith the increase of widely available digital streaming options for music, the interest in the field of music emotion recognition has seen the same increasing effect. This field is still dominated by classical approaches that use feature engineering to classify the perceived emotion of a song. Furthermore, in recent years,there has been a surge of deep learning approaches that use neural networks to tackle this same problem. However, these approaches suffer from various problems such as the use of small, private, or low quality datasets, as well as the use of features not designed for emotion classification, amongst others.This work proposes a set of three new datasets, denominated Music Emotion Recognition - Next Generation (MERGE), with three components: audio, lyrics and bi-modal. These datasets are an extension of the previous 4QAED dataset(Panda, 2019) and achieved F1-scores of 71% using the same feature set as 4QAED, while having a much greater size.Furthermore, in this work, we propose a set of new emotionally relevant features to help tackle the problem aforementioned using techniques such as automatic music transcription with tools such as Magenta MT3 (2.6.2). From this framework, a set of features extracted from the outputted MIDI file are proposed.Finally, using the percussion stem extracted from Demucs (2.7.3), a novel set of features extracted from the percussion track is also proposed. A subset of this novel set of features achieved an overall F1-Score of 73.4% on the MERGE_Bimodal_Complete dataset (See Section 3.1.2 for further details on the datasets).Com o aumento das opções de streaming digital de música amplamente disponÃveis,o interesse no domÃnio do reconhecimento de emoções musicais tem registado omesmo efeito crescente. Este domÃnio ainda é dominado por abordagens clássicasque utilizam a engenharia de caracterÃsticas musicais para classificar a emoçãosentida de uma canção. Além disso, nos últimos anos, tem havido uma onda deabordagens de deep learning que utilizam redes neurais para resolver este mesmoproblema. No entanto, estas abordagens sofrem de vários problemas, como autilização de conjuntos de dados pequenos, privados ou de baixa qualidade,bem como a utilização de caracterÃsticas não concebidas para a classificação deemoções, entre outros.Este trabalho propõe um conjunto de três novos conjuntos de dados, denominadosMusic Emotion Recognition - Next Generation (MERGE), com três componentes:áudio, letras e bimodal. Estes conjuntos de dados são uma extensão doanterior conjunto de dados 4QAED (Panda, 2019) alcançaram F1-Scores de 71%usando o mesmo conjunto de caracterÃsticas do 4QAED, tendo no entanto umtamanho muito maior.Além disso, neste trabalho, propomos um conjunto de novas caracterÃsticas emocionalmenterelevantes para ajudar a resolver o problema acima mencionado, utilizandotécnicas como a transcrição automática de música com ferramentas comoo Magenta MT3 (2.6.2). A partir desta framework, é proposto um conjunto de caracterÃsticasextraÃdas do ficheiro MIDI produzido.Finalmente, utilizando a faixa de percussão extraÃda do Demucs (2.7.3), é tambémproposto um novo conjunto de caracterÃsticas extraÃdas da faixa de percussão.Um subconjunto deste novo conjunto de caracterÃsticas obteve uma pontuaçãoF1 global de 73,4% no conjunto de dados MERGE_Bimodal_Complete (ver Secção3.1.2 para mais detalhes sobre os conjuntos de dados)
How Does the Spotify API Compare to the Music Emotion Recognition State-of-the-Art?
Features are arguably the key factor to any machine learning problem. Over the decades, myriads of audio features and recently feature-learning approaches have been tested in Music Emotion Recognition (MER) with scarce improvements. Here, we shed some light on the suitability of the audio features provided by the Spotify API, the leading music streaming service, when applied to MER. To this end, 12 Spotify API features were obtained for 704 of our 900-song dataset, annotated in terms of Russell’s quadrants. These are compared to emotionally-relevant features obtained previously, using feature ranking and emotion classification experiments. We verified that energy, valence and acousticness features from Spotify are highly relevant to MER. However, the 12-feature set is unable to meet the performance of the features available in the state-of-the-art (58.5% vs. 74.7% F1-measure). Combining Spotify and state-of-the-art sets leads to small improvements with fewer features (top5: +2.3%, top10: +1.1%), while not improving the maximum results (100 features). From this we conclude that Spotify provides some higher-level emotionally-relevant features. Such extractors are desirable, since they are closer to human concepts and allow for interpretable rules to be extracted (harder with hundreds of abstract features). Still, additional emotionally-relevant features are needed to improve MER.This work was supported by CISUC (Center for Informatics and Systems of the University of Coimbra). Renato Panda was supported by Ci2 - FCT UIDP/05567/2020