In this work, we provide a broad comparative analysis of strategies for
pre-training audio understanding models for several tasks in the music domain,
including labelling of genre, era, origin, mood, instrumentation, key, pitch,
vocal characteristics, tempo and sonority. Specifically, we explore how the
domain of pre-training datasets (music or generic audio) and the pre-training
methodology (supervised or unsupervised) affects the adequacy of the resulting
audio embeddings for downstream tasks.
We show that models trained via supervised learning on large-scale
expert-annotated music datasets achieve state-of-the-art performance in a wide
range of music labelling tasks, each with novel content and vocabularies. This
can be done in an efficient manner with models containing less than 100 million
parameters that require no fine-tuning or reparameterization for downstream
tasks, making this approach practical for industry-scale audio catalogs.
Within the class of unsupervised learning strategies, we show that the domain
of the training dataset can significantly impact the performance of
representations learned by the model. We find that restricting the domain of
the pre-training dataset to music allows for training with smaller batch sizes
while achieving state-of-the-art in unsupervised learning -- and in some cases,
supervised learning -- for music understanding.
We also corroborate that, while achieving state-of-the-art performance on
many tasks, supervised learning can cause models to specialize to the
supervised information provided, somewhat compromising a model's generality