40 research outputs found

    PoLyScriber: Integrated Training of Extractor and Lyrics Transcriber for Polyphonic Music

    Full text link
    Lyrics transcription of polyphonic music is challenging as the background music affects lyrics intelligibility. Typically, lyrics transcription can be performed by a two step pipeline, i.e. singing vocal extraction frontend, followed by a lyrics transcriber backend, where the frontend and backend are trained separately. Such a two step pipeline suffers from both imperfect vocal extraction and mismatch between frontend and backend. In this work, we propose a novel end-to-end integrated training framework, that we call PoLyScriber, to globally optimize the vocal extractor front-end and lyrics transcriber backend for lyrics transcription in polyphonic music. The experimental results show that our proposed integrated training model achieves substantial improvements over the existing approaches on publicly available test datasets.Comment: 13 page

    MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers

    Full text link
    Music annotation has always been one of the critical topics in the field of Music Information Retrieval (MIR). Traditional models use supervised learning for music annotation tasks. However, as supervised machine learning approaches increase in complexity, the increasing need for more annotated training data can often not be matched with available data. In this paper, a new self-supervised music acoustic representation learning approach named MusiCoder is proposed. Inspired by the success of BERT, MusiCoder builds upon the architecture of self-attention bidirectional transformers. Two pre-training objectives, including Contiguous Frames Masking (CFM) and Contiguous Channels Masking (CCM), are designed to adapt BERT-like masked reconstruction pre-training to continuous acoustic frame domain. The performance of MusiCoder is evaluated in two downstream music annotation tasks. The results show that MusiCoder outperforms the state-of-the-art models in both music genre classification and auto-tagging tasks. The effectiveness of MusiCoder indicates a great potential of a new self-supervised learning approach to understand music: first apply masked reconstruction tasks to pre-train a transformer-based model with massive unlabeled music acoustic data, and then finetune the model on specific downstream tasks with labeled data

    MARBLE: Music Audio Representation Benchmark for Universal Evaluation

    Get PDF
    In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement. The leaderboard and toolkit repository are published at this https URL to promote future music AI research

    FMA: A Dataset For Music Analysis

    Full text link
    We introduce the Free Music Archive (FMA), an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections. The community's growing interest in feature and end-to-end learning is however restrained by the limited availability of large audio datasets. The FMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and free-form text such as biographies. We here describe the dataset and how it was created, propose a train/validation/test split and three subsets, discuss some suitable MIR tasks, and evaluate some baselines for genre recognition. Code, data, and usage examples are available at https://github.com/mdeff/fmaComment: ISMIR 2017 camera-read

    MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

    Full text link
    Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is primarily due to the distinctive challenges associated with modelling musical knowledge, particularly its tonal and pitched characteristics of music. To address this research gap, we propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training. In our exploration, we identified a superior combination of teacher models, which outperforms conventional speech and audio approaches in terms of performance. This combination includes an acoustic teacher based on Residual Vector Quantization - Variational AutoEncoder (RVQ-VAE) and a musical teacher based on the Constant-Q Transform (CQT). These teachers effectively guide our student model, a BERT-style transformer encoder, to better model music audio. In addition, we introduce an in-batch noise mixture augmentation to enhance the representation robustness. Furthermore, we explore a wide range of settings to overcome the instability in acoustic language model pre-training, which allows our designed paradigm to scale from 95M to 330M parameters. Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attains state-of-the-art (SOTA) overall scores. The code and models are online: https://github.com/yizhilll/MERT

    Music On Canvas: A Quest to Generate Art That Evokes the Feeling of Music

    Get PDF
    Although the idea of connecting music and art dates back to ancient Greece, recent advancements in computing have made automating this feasible. This project represents a quest to transform music into art, using three methodologies where each is an improvement towards generating images that convey our feelings and imaginations during music listening. The three methods respectively involve: 1. An element-wise mapping of sound and colors2. Using song tags3. Tuning an Artificial Intelligence (AI) model to generate pictorial text captions. To create artistic images, methods two and three utilize an existing text-to-image generative AI

    ENSA dataset: a dataset of songs by non-superstar artists tested with an emotional analysis based on time-series

    Get PDF
    This paper presents a novel dataset of songs by non-superstar artists in which a set of musical data is collected, identifying for each song its musical structure, and the emotional perception of the artist through a categorical emotional labeling process. The generation of this preliminary dataset is motivated by the existence of biases that have been detected in the analysis of the most used datasets in the field of emotion-based music recommendation. This new dataset contains 234 min of audio and 60 complete and labeled songs. In addition, an emotional analysis is carried out based on the representation of dynamic emotional perception through a time-series approach, in which the similarity values generated by the dynamic time warping (DTW) algorithm are analyzed and then used to implement a clustering process with the K-means algorithm. In the same way, clustering is also implemented with a Uniform Manifold Approximation and Projection (UMAP) technique, which is a manifold learning and dimension reduction algorithm. The algorithm HDBSCAN is applied for determining the optimal number of clusters. The results obtained from the different clustering strategies are compared and, in a preliminary analysis, a significant consistency is found between them. With the findings and experimental results obtained, a discussion is presented highlighting the importance of working with complete songs, preferably with a well-defined musical structure, considering the emotional variation that characterizes a song during the listening experience, in which the intensity of the emotion usually changes between verse, bridge, and chorus

    Emotion and themes recognition in music utilising convolutional and recurrent neural networks

    Get PDF
    Emotion is an inherent aspect of music, and associations to music can be made via both life experience and specific musical techniques applied by the composer. Computational approaches for music recognition have been well-established in the research community; however, deep approaches have been limited and not yet comparable to conventional approaches. In this study, we present our fusion system of end-to-end convolutional recurrent neural networks (CRNN) and pre-trained convolutional feature extractors for music emotion and theme recognition1. We train 9 models and conduct various late fusion experiments. Our best performing model (team name: AugLi) achieves 74.2 % ROC-AUC on the test partition which is 1.6 percentage points over the baseline system of the MediaEval 2019 Emotion & Themes in Music task
    corecore