7,802 research outputs found
MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers
Music annotation has always been one of the critical topics in the field of
Music Information Retrieval (MIR). Traditional models use supervised learning
for music annotation tasks. However, as supervised machine learning approaches
increase in complexity, the increasing need for more annotated training data
can often not be matched with available data. In this paper, a new
self-supervised music acoustic representation learning approach named MusiCoder
is proposed. Inspired by the success of BERT, MusiCoder builds upon the
architecture of self-attention bidirectional transformers. Two pre-training
objectives, including Contiguous Frames Masking (CFM) and Contiguous Channels
Masking (CCM), are designed to adapt BERT-like masked reconstruction
pre-training to continuous acoustic frame domain. The performance of MusiCoder
is evaluated in two downstream music annotation tasks. The results show that
MusiCoder outperforms the state-of-the-art models in both music genre
classification and auto-tagging tasks. The effectiveness of MusiCoder indicates
a great potential of a new self-supervised learning approach to understand
music: first apply masked reconstruction tasks to pre-train a transformer-based
model with massive unlabeled music acoustic data, and then finetune the model
on specific downstream tasks with labeled data
Audio Event Detection using Weakly Labeled Data
Acoustic event detection is essential for content analysis and description of
multimedia recordings. The majority of current literature on the topic learns
the detectors through fully-supervised techniques employing strongly labeled
data. However, the labels available for majority of multimedia data are
generally weak and do not provide sufficient detail for such methods to be
employed. In this paper we propose a framework for learning acoustic event
detectors using only weakly labeled data. We first show that audio event
detection using weak labels can be formulated as an Multiple Instance Learning
problem. We then suggest two frameworks for solving multiple-instance learning,
one based on support vector machines, and the other on neural networks. The
proposed methods can help in removing the time consuming and expensive process
of manually annotating data to facilitate fully supervised learning. Moreover,
it can not only detect events in a recording but can also provide temporal
locations of events in the recording. This helps in obtaining a complete
description of the recording and is notable since temporal information was
never known in the first place in weakly labeled data.Comment: ACM Multimedia 201
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training
Self-supervised learning (SSL) has recently emerged as a promising paradigm
for training generalisable models on large-scale data in the fields of vision,
text, and speech. Although SSL has been proven effective in speech and audio,
its application to music audio has yet to be thoroughly explored. This is
primarily due to the distinctive challenges associated with modelling musical
knowledge, particularly its tonal and pitched characteristics of music. To
address this research gap, we propose an acoustic Music undERstanding model
with large-scale self-supervised Training (MERT), which incorporates teacher
models to provide pseudo labels in the masked language modelling (MLM) style
acoustic pre-training. In our exploration, we identified a superior combination
of teacher models, which outperforms conventional speech and audio approaches
in terms of performance. This combination includes an acoustic teacher based on
Residual Vector Quantization - Variational AutoEncoder (RVQ-VAE) and a musical
teacher based on the Constant-Q Transform (CQT). These teachers effectively
guide our student model, a BERT-style transformer encoder, to better model
music audio. In addition, we introduce an in-batch noise mixture augmentation
to enhance the representation robustness. Furthermore, we explore a wide range
of settings to overcome the instability in acoustic language model
pre-training, which allows our designed paradigm to scale from 95M to 330M
parameters. Experimental results indicate that our model can generalise and
perform well on 14 music understanding tasks and attains state-of-the-art
(SOTA) overall scores. The code and models are online:
https://github.com/yizhilll/MERT
MARBLE: Music Audio Representation Benchmark for Universal Evaluation
In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement. The leaderboard and toolkit repository are published at this https URL to promote future music AI research
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Music Similarity Estimation
Music is a complicated form of communication, where creators and culture communicate and expose their individuality. After music digitalization took place, recommendation systems and other online services have become indispensable in the field of Music Information Retrieval (MIR). To build these systems and recommend the right choice of song to the user, classification of songs is required. In this paper, we propose an approach for finding similarity between music based on mid-level attributes like pitch, midi value corresponding to pitch, interval, contour and duration and applying text based classification techniques. Our system predicts jazz, metal and ragtime for western music. The experiment to predict the genre of music is conducted based on 450 music files and maximum accuracy achieved is 95.8% across different n-grams. We have also analyzed the Indian classical Carnatic music and are classifying them based on its raga. Our system predicts Sankarabharam, Mohanam and Sindhubhairavi ragas. The experiment to predict the raga of the song is conducted based on 95 music files and the maximum accuracy achieved is 90.3% across different n-grams. Performance evaluation is done by using the accuracy score of scikit-learn
Low-Resource Music Genre Classification with Advanced Neural Model Reprogramming
Transfer learning (TL) approaches have shown promising results when handling
tasks with limited training data. However, considerable memory and
computational resources are often required for fine-tuning pre-trained neural
networks with target domain data. In this work, we introduce a novel method for
leveraging pre-trained models for low-resource (music) classification based on
the concept of Neural Model Reprogramming (NMR). NMR aims at re-purposing a
pre-trained model from a source domain to a target domain by modifying the
input of a frozen pre-trained model. In addition to the known,
input-independent, reprogramming method, we propose an advanced reprogramming
paradigm: Input-dependent NMR, to increase adaptability to complex input data
such as musical audio. Experimental results suggest that a neural model
pre-trained on large-scale datasets can successfully perform music genre
classification by using this reprogramming method. The two proposed
Input-dependent NMR TL methods outperform fine-tuning-based TL methods on a
small genre classification dataset.Comment: Submitted to ICASSP 2023. Some experimental results were reduced due
to the space limit. The implementation will be available at
https://github.com/biboamy/music-repr
- …