Search CORE

23 research outputs found

Supervised and Unsupervised Learning of Audio Representations for Music Understanding

Author: Ehmann Andreas F.
Gouyon Fabien
Korzeniowski Filip
McCallum Matthew C.
Oramas Sergio
Publication venue
Publication date: 07/10/2022
Field of study

In this work, we provide a broad comparative analysis of strategies for pre-training audio understanding models for several tasks in the music domain, including labelling of genre, era, origin, mood, instrumentation, key, pitch, vocal characteristics, tempo and sonority. Specifically, we explore how the domain of pre-training datasets (music or generic audio) and the pre-training methodology (supervised or unsupervised) affects the adequacy of the resulting audio embeddings for downstream tasks. We show that models trained via supervised learning on large-scale expert-annotated music datasets achieve state-of-the-art performance in a wide range of music labelling tasks, each with novel content and vocabularies. This can be done in an efficient manner with models containing less than 100 million parameters that require no fine-tuning or reparameterization for downstream tasks, making this approach practical for industry-scale audio catalogs. Within the class of unsupervised learning strategies, we show that the domain of the training dataset can significantly impact the performance of representations learned by the model. We find that restricting the domain of the pre-training dataset to music allows for training with smaller batch sizes while achieving state-of-the-art in unsupervised learning -- and in some cases, supervised learning -- for music understanding. We also corroborate that, while achieving state-of-the-art performance on many tasks, supervised learning can cause models to specialize to the supervised information provided, somewhat compromising a model's generality

arXiv.org e-Print Archive

Recommended from our members

Melody Transcription From Music Audio: Approaches and Evaluation

Author: Ehmann Andreas F.
Ellis Daniel P. W.
Gomez Emilia
Ong Beesuan
Poliner Graham E.
Streich Sebastian
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2007
Field of study

Although the process of analyzing an audio recording of a music performance is complex and difficult even for a human listener, there are limited forms of information that may be tractably extracted and yet still enable interesting applications. We discuss melody--roughly, the part a listener might whistle or hum--as one such reduced descriptor of music audio, and consider how to define it, and what use it might be. We go on to describe the results of full-scale evaluations of melody transcription systems conducted in 2004 and 2005, including an overview of the systems submitted, details of how the evaluations were conducted, and a discussion of the results. For our definition of melody, current systems can achieve around 70% correct transcription at the frame level, including distinguishing between the presence or absence of the melody. Melodies transcribed at this level are readily recognizable, and show promise for practical applications

Columbia University Academic Commons

Melody Transcription From Music Audio: Approaches and Evaluation

Author: Andreas F. Ehmann
Beesuan Ong
Daniel P. W. Ellis
Emilia Gomez
Graham E. Poliner
Sebastian Streich
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Structured audio content analysis and metadata in a digital library

Author: Bainbridge David
Downie J. Stephen
Ehmann Andreas F.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2012
Field of study

This work illustrates how audio content analysis of music and manually assigned structural temporal metadata can be used to form a digital library designed for musicological exploration. In addition to text-based searching and browsing, the document view is enriched with an interactive structured audio time-line that shows ground-truth data representing the logical segments to the song, and a version that was automatically generated for comparison. A self-similarity "heat" map is also displayed, and is interactive. Clicking within the map at a co-ordinate (x,y) results in the audio being played simultaneous at time offset x and y, panned left and right, respectively, to make it easier for the listener to separate out the differences. The musicologist can also initiate an audio content based query starting at any point in the song. This produces a ranked result set which can be further studied through their respective document views. Alternatively they can perform a musical structure search (for example, for songs that contain the structure b, b, c, b, c)

Crossref

Research Commons@Waikato

Music Information Retrieval Evaluation eXchange (MIREX 2005): Preliminary overview.

Author: Andreas F Ehmann
J Stephen Downie
Jin Ha Lee
Publication venue: 'Vathek Publishing'
Publication date: 01/01/2005
Field of study

CiteSeerX