118,825 research outputs found
Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music Generation Task
Benefiting from large-scale datasets and pre-trained models, the field of
generative models has recently gained significant momentum. However, most
datasets for symbolic music are very small, which potentially limits the
performance of data-driven multimodal models. An intuitive solution to this
problem is to leverage pre-trained models from other modalities (e.g., natural
language) to improve the performance of symbolic music-related multimodal
tasks. In this paper, we carry out the first study of generating complete and
semantically consistent symbolic music scores from text descriptions, and
explore the efficacy of using publicly available checkpoints (i.e., BERT,
GPT-2, and BART) for natural language processing in the task of text-to-music
generation. Our experimental results show that the improvement from using
pre-trained checkpoints is statistically significant in terms of BLEU score and
edit distance similarity. We analyse the capabilities and limitations of our
model to better understand the potential of language-music models.Comment: 5 pages, 2 figures, 2 table
Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning
Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity
of large-scale publicly available music datasets with natural language
captions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA),
capable of answering music-related questions and generating captions for music
files. Our model utilizes audio representations from a pretrained MERT model to
extract music features. However, obtaining a suitable dataset for training the
MU-LLaMA model remains challenging, as existing publicly accessible audio
question answering datasets lack the necessary depth for open-ended music
question answering. To fill this gap, we present a methodology for generating
question-answer pairs from existing audio captioning datasets and introduce the
MusicQA Dataset designed for answering open-ended music-related questions. The
experiments demonstrate that the proposed MU-LLaMA model, trained on our
designed MusicQA dataset, achieves outstanding performance in both music
question answering and music caption generation across various metrics,
outperforming current state-of-the-art (SOTA) models in both fields and
offering a promising advancement in the T2M-Gen research field
MusCaps: generating captions for music audio
Content-based music information retrieval has seen rapid progress with the adoption of deep learning. Current approaches to high-level music description typically make use of classification models, such as in auto tagging or genre and mood classification. In this work, we propose to address music description via audio captioning, defined as the task of generating a natural language description of music audio content in a human-like manner. To this end, we present the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention. Our method combines convolutional and recurrent neural network architectures to jointly process audio-text inputs through a multimodal encoder and leverages pre-training on audio data to obtain representations that effectively capture and summarise musical features in the input. Evaluation of the generated captions through automatic metrics shows that our method outperforms a baseline designed for non-music audio captioning. Through an ablation study, we unveil that this performance boost can be mainly attributed to pre-training of the audio encoder, while other design choices – modality fusion, decoding strategy and the use of attention -- contribute only marginally. Our model represents a shift away from classification-based music description and combines tasks requiring both auditory and linguistic understanding to bridge the semantic gap in music information retrieval
Creation of a New Domain and Evaluation of Comparison Generation in a Natural Language Generation System
We describe the creation of a new domain for the Methodius Natural Language Generation System, and an evaluation of Methodius ’ parameterized comparison generation algorithm. The new domain was based around music and performers, and texts about the domain were generated using Methodius. Our evaluation showed that test subjects learned more from texts that contained comparisons than from those that did not. We also established that the comparison generation algorithm could generalize to the music domain.
- …