40 research outputs found
PoLyScriber: Integrated Training of Extractor and Lyrics Transcriber for Polyphonic Music
Lyrics transcription of polyphonic music is challenging as the background
music affects lyrics intelligibility. Typically, lyrics transcription can be
performed by a two step pipeline, i.e. singing vocal extraction frontend,
followed by a lyrics transcriber backend, where the frontend and backend are
trained separately. Such a two step pipeline suffers from both imperfect vocal
extraction and mismatch between frontend and backend. In this work, we propose
a novel end-to-end integrated training framework, that we call PoLyScriber, to
globally optimize the vocal extractor front-end and lyrics transcriber backend
for lyrics transcription in polyphonic music. The experimental results show
that our proposed integrated training model achieves substantial improvements
over the existing approaches on publicly available test datasets.Comment: 13 page
MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers
Music annotation has always been one of the critical topics in the field of
Music Information Retrieval (MIR). Traditional models use supervised learning
for music annotation tasks. However, as supervised machine learning approaches
increase in complexity, the increasing need for more annotated training data
can often not be matched with available data. In this paper, a new
self-supervised music acoustic representation learning approach named MusiCoder
is proposed. Inspired by the success of BERT, MusiCoder builds upon the
architecture of self-attention bidirectional transformers. Two pre-training
objectives, including Contiguous Frames Masking (CFM) and Contiguous Channels
Masking (CCM), are designed to adapt BERT-like masked reconstruction
pre-training to continuous acoustic frame domain. The performance of MusiCoder
is evaluated in two downstream music annotation tasks. The results show that
MusiCoder outperforms the state-of-the-art models in both music genre
classification and auto-tagging tasks. The effectiveness of MusiCoder indicates
a great potential of a new self-supervised learning approach to understand
music: first apply masked reconstruction tasks to pre-train a transformer-based
model with massive unlabeled music acoustic data, and then finetune the model
on specific downstream tasks with labeled data
MARBLE: Music Audio Representation Benchmark for Universal Evaluation
In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement. The leaderboard and toolkit repository are published at this https URL to promote future music AI research
FMA: A Dataset For Music Analysis
We introduce the Free Music Archive (FMA), an open and easily accessible
dataset suitable for evaluating several tasks in MIR, a field concerned with
browsing, searching, and organizing large music collections. The community's
growing interest in feature and end-to-end learning is however restrained by
the limited availability of large audio datasets. The FMA aims to overcome this
hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio
from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a
hierarchical taxonomy of 161 genres. It provides full-length and high-quality
audio, pre-computed features, together with track- and user-level metadata,
tags, and free-form text such as biographies. We here describe the dataset and
how it was created, propose a train/validation/test split and three subsets,
discuss some suitable MIR tasks, and evaluate some baselines for genre
recognition. Code, data, and usage examples are available at
https://github.com/mdeff/fmaComment: ISMIR 2017 camera-read
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training
Self-supervised learning (SSL) has recently emerged as a promising paradigm
for training generalisable models on large-scale data in the fields of vision,
text, and speech. Although SSL has been proven effective in speech and audio,
its application to music audio has yet to be thoroughly explored. This is
primarily due to the distinctive challenges associated with modelling musical
knowledge, particularly its tonal and pitched characteristics of music. To
address this research gap, we propose an acoustic Music undERstanding model
with large-scale self-supervised Training (MERT), which incorporates teacher
models to provide pseudo labels in the masked language modelling (MLM) style
acoustic pre-training. In our exploration, we identified a superior combination
of teacher models, which outperforms conventional speech and audio approaches
in terms of performance. This combination includes an acoustic teacher based on
Residual Vector Quantization - Variational AutoEncoder (RVQ-VAE) and a musical
teacher based on the Constant-Q Transform (CQT). These teachers effectively
guide our student model, a BERT-style transformer encoder, to better model
music audio. In addition, we introduce an in-batch noise mixture augmentation
to enhance the representation robustness. Furthermore, we explore a wide range
of settings to overcome the instability in acoustic language model
pre-training, which allows our designed paradigm to scale from 95M to 330M
parameters. Experimental results indicate that our model can generalise and
perform well on 14 music understanding tasks and attains state-of-the-art
(SOTA) overall scores. The code and models are online:
https://github.com/yizhilll/MERT
Music On Canvas: A Quest to Generate Art That Evokes the Feeling of Music
Although the idea of connecting music and art dates back to ancient Greece, recent advancements in computing have made automating this feasible. This project represents a quest to transform music into art, using three methodologies where each is an improvement towards generating images that convey our feelings and imaginations during music listening. The three methods respectively involve:
1. An element-wise mapping of sound and colors2. Using song tags3. Tuning an Artificial Intelligence (AI) model to generate pictorial text captions.
To create artistic images, methods two and three utilize an existing text-to-image generative AI
ENSA dataset: a dataset of songs by non-superstar artists tested with an emotional analysis based on time-series
This paper presents a novel dataset of songs by non-superstar artists in which a set of musical data is collected, identifying for each song its musical structure, and the emotional perception of the artist through a categorical emotional labeling process. The generation of this preliminary dataset is motivated by the existence of biases that have been detected in the analysis of the most used datasets in the field of emotion-based music recommendation. This new dataset contains 234 min of audio and 60 complete and labeled songs. In addition, an emotional analysis is carried out based on the representation of dynamic emotional perception through a time-series approach, in which the similarity values generated by the dynamic time warping (DTW) algorithm are analyzed and then used to implement a clustering process with the K-means algorithm. In the same way, clustering is also implemented with a Uniform Manifold Approximation and Projection (UMAP) technique, which is a manifold learning and dimension reduction algorithm. The algorithm HDBSCAN is applied for determining the optimal number of clusters. The results obtained from the different clustering strategies are compared and, in a preliminary analysis, a significant consistency is found between them. With the findings and experimental results obtained, a discussion is presented highlighting the importance of working with complete songs, preferably with a well-defined musical structure, considering the emotional variation that characterizes a song during the listening experience, in which the intensity of the emotion usually changes between verse, bridge, and chorus
Emotion and themes recognition in music utilising convolutional and recurrent neural networks
Emotion is an inherent aspect of music, and associations to music can be made via both life experience and specific musical techniques applied by the composer. Computational approaches for music recognition have been well-established in the research community; however, deep approaches have been limited and not yet comparable to conventional approaches. In this study, we present our fusion system of end-to-end convolutional recurrent neural networks (CRNN) and pre-trained convolutional feature extractors for music emotion and theme recognition1. We train 9 models and conduct various late fusion experiments. Our best performing model (team name: AugLi) achieves 74.2 % ROC-AUC on the test partition which is 1.6 percentage points over the baseline system of the MediaEval 2019 Emotion & Themes in Music task