Search CORE

1,664 research outputs found

The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use

Author: Sturm Bob L.
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2013
Field of study

The GTZAN dataset appears in at least 100 published works, and is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). Our recent work, however, shows GTZAN has several faults (repetitions, mislabelings, and distortions), which challenge the interpretability of any result derived using it. In this article, we disprove the claims that all MGR systems are affected in the same ways by these faults, and that the performances of MGR systems in GTZAN are still meaningfully comparable since they all face the same faults. We identify and analyze the contents of GTZAN, and provide a catalog of its faults. We review how GTZAN has been used in MGR research, and find few indications that its faults have been known and considered. Finally, we rigorously study the effects of its faults on evaluating five different MGR systems. The lesson is not to banish GTZAN, but to use it with consideration of its contents.Comment: 29 pages, 7 figures, 6 tables, 128 reference

arXiv.org e-Print Archive

VBN

Deep Learning Techniques for Music Generation -- A Survey

Author: Briot Jean-Pierre
Hadjeres Gaëtan
Pachet François-David
Publication venue
Publication date: 23/03/2019
Field of study

This paper is a survey and an analysis of different ways of using deep learning (deep artificial neural networks) to generate musical content. We propose a methodology based on five dimensions for our analysis: Objective - What musical content is to be generated? Examples are: melody, polyphony, accompaniment or counterpoint. - For what destination and for what use? To be performed by a human(s) (in the case of a musical score), or by a machine (in the case of an audio file). Representation - What are the concepts to be manipulated? Examples are: waveform, spectrogram, note, chord, meter and beat. - What format is to be used? Examples are: MIDI, piano roll or text. - How will the representation be encoded? Examples are: scalar, one-hot or many-hot. Architecture - What type(s) of deep neural network is (are) to be used? Examples are: feedforward network, recurrent network, autoencoder or generative adversarial networks. Challenge - What are the limitations and open challenges? Examples are: variability, interactivity and creativity. Strategy - How do we model and control the process of generation? Examples are: single-step feedforward, iterative feedforward, sampling or input manipulation. For each dimension, we conduct a comparative analysis of various models and techniques and we propose some tentative multidimensional typology. This typology is bottom-up, based on the analysis of many existing deep-learning based systems for music generation selected from the relevant literature. These systems are described and are used to exemplify the various choices of objective, representation, architecture, challenge and strategy. The last section includes some discussion and some prospects.Comment: 209 pages. This paper is a simplified version of the book: J.-P. Briot, G. Hadjeres and F.-D. Pachet, Deep Learning Techniques for Music Generation, Computational Synthesis and Creative Systems, Springer, 201

arXiv.org e-Print Archive

A Survey of Evaluation in Music Genre Recognition

Author: Sturm Bob L.
Publication venue
Publication date: 01/01/2012
Field of study

VBN

Narrative and Hypertext 2011 Proceedings: a workshop at ACM Hypertext 2011, Eindhoven

Author
Publication venue: 'University of Southampton'
Publication date: 05/03/2012
Field of study

Southampton (e-Prints Soton)

Speech Mode Classification using the Fusion of CNNs and LSTM Networks

Author: Vakkantula Pratyusha Chowdary
Publication venue: The Research Repository @ WVU
Publication date: 01/01/2020
Field of study

Speech mode classification is an area that has not been as widely explored in the field of sound classification as others such as environmental sounds, music genre, and speaker identification. But what is speech mode? While mode is defined as the way or the manner in which something occurs or is expressed or done, speech mode is defined as the style in which the speech is delivered by a person. There are some reports on speech mode classification using conventional methods, such as whispering and talking using a normal phonetic sound. However, to the best of our knowledge, deep learning-based methods have not been reported in the open literature for the aforementioned classification scenario. Specifically, in this work we assess the performance of image-based classification algorithms on this challenging speech mode classification problem, including the usage of pre-trained deep neural networks, namely AlexNet, ResNet18 and SqueezeNet. Thus, we compare the classification efficiency of a set of deep learning-based classifiers, while we also assess the impact of different 2D image representations (spectrograms, mel-spectrograms, and their image-based fusion) on classification accuracy. These representations are used as input to the networks after being generated from the original audio signals. Next, we compare the accuracy of the DL-based classifies to a set of machine learning (ML) ones that use as their inputs Mel-Frequency Cepstral Coefficients (MFCCs) features. Then, after determining the most efficient sampling rate for our classification problem (i.e. 32kHz), we study the performance of our proposed method of combining CNN with LSTM (Long Short-Term Memory) networks. For this purpose, we use the features extracted from the deep networks of the previous step. We conclude our study by evaluating the role of sampling rates on classification accuracy by generating two sets of 2D image representations – one with 32kHz and the other with 16kHz sampling. Experimental results show that after cross validation the accuracy of DL-based approaches is 15% higher than ML ones, with SqueezeNet yielding an accuracy of more than 91% at 32kHz, whether we use transfer learning, feature-level fusion or score-level fusion (92.5%). Our proposed method using LSTMs further increased that accuracy by more than 3%, resulting in an average accuracy of 95.7%

The Research Repository @ WVU (West Virginia University)

Enhancing Audio Signal Quality and Learning Experience with Integrated Covariance Weiner Filtering in College Music Education

Author: Yang Mengna
Publication venue: Auricle Global Society of Education and Research
Publication date: 30/07/2023
Field of study

In recent years, computer music technology has become increasingly prevalent in college music education, offering new possibilities for creative expression and pedagogical approaches. This paper concentrated on the music education in the colleges with the application of integrated time and frequency filtering (ITFF) with Kalman integrated covariance Weiner filtering in college music education. The ITFF technique combines time and frequency domain analysis to enhance the quality and clarity of audio signals. By integrating the Kalman integrated covariance Weiner filtering, the ITFF method provides robust noise reduction and improved signal representation. This integrated approach enables music educators to effectively analyze and manipulate audio signals in real-time, fostering a more immersive and engaging learning environment for students. The findings of this study highlight the benefits and potential applications of ITFF with Kalman-integrated covariance Weiner filtering in college music education, including audio signal enhancement, sound synthesis, and interactive performance systems. The integration of computer music technology with advanced filtering techniques presents new opportunities for exploring sound, composition, and music production within an educational context

International Journal on Recent and Innovation Trends in Computing and Communication