Search CORE

1,968 research outputs found

Contrastive audio-language learning for music

Author: 23rd International Society for Music Information Retrieval Conference (ISMIR)
Benetos E
Fazekas G
Manco I
QUINTON E
Publication venue: ISMIR
Publication date: 04/12/2022
Field of study

As one of the most intuitive interfaces known to humans, natural language has the potential to mediate many tasks that involve human-computer interaction, especially in application-focused fields like Music Information Retrieval. In this work, we explore cross-modal learning in an attempt to bridge audio and language in the music domain. To this end, we propose MusCALL, a framework for Music Contrastive Audio-Language Learning. Our approach consists of a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences, producing multimodal embeddings that can be used for text-to-audio and audio-to-text retrieval out-of-the-box. Thanks to this property, MusCALL can be transferred to virtually any task that can be cast as text-based retrieval. Our experiments show that our method performs significantly better than the baselines at retrieving audio that matches a textual description and, conversely, text that matches an audio query. We also demonstrate that the multimodal alignment capability of our model can be successfully extended to the zero-shot transfer scenario for genre classification and auto-tagging on two public datasets

Queen Mary Research Online

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

Author: Heittola Toni
Huttunen Heikki
Parascandolo Giambattista
Virtanen Tuomas
Çakır Emre
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/02/2017
Field of study

Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events.Comment: Accepted for IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Sound Scene and Event Analysi

arXiv.org e-Print Archive

Trepo - Institutional Repository of Tampere University

Segmentation process and spectral characteristics in the determination of musical genres

Author: amelec viloria
Cabrera Danelys
Pineda Lezama Omar Bonerge
Publication venue: 'Corporation Universidad de la Costa, CUC'
Publication date: 01/01/2020
Field of study

Over the past few years there has been a tendency to store audio tracks for later use on CD-DVDs, HDD-SSDs as well as on the internet, which makes it challenging to classify the information either online or offline. For this purpose, the audio tracks must be tagged. Tags are said to be texts based on the semantic information of the sound [1]. Thus, music analysis can be done in several ways [2] since music is identified by its genre, artist, instruments and structure, by a tagging system that can be manual or automatic. The manual tagging allows the visualization of the behavior of an audio track either in time domain or in frequency domain as in the spectrogram, making it possible to classify the songs without listening to them. However, this process is very time consuming and labor intensive, including health problems [3] which shows that "the volume, sound sensitivity, time and cost required for a manual labeling process is generally prohibitive. Three fundamental steps are required to carry out automatic labelling: pre-processing, feature extraction and classification [4]. The present study developed an algorithm for performing automatic classification of music genres using a segmentation process employing spectral characteristics such as centroid (SC), flatness (SF) and spread (SS), as well as a time spectral characteristic

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Digital CUC

An audio-visual approach to web video categorization

Author: Ionescu Bogdan
Lambert Patrick
Mironica Ionut
Seyerlehner Klaus
Vertan Constantin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

International audienceIn this paper we address the issue of automatic video genre categorization of web media using an audio-visual approach. To this end, we propose content descriptors which exploit audio, temporal structure and color information. The potential of our descriptors is experimentally validated both from the perspective of a classification system and as an information retrieval approach. Validation is carried out on a real scenario, namely on more than 288 hours of video footage and 26 video genres specific to blip.tv media platform. Additionally, to reduce semantic gap, we propose a new relevance feedback technique which is based on hierarchical clustering. Experimental tests prove that retrieval performance can be significantly increased in this case, becoming comparable to the one obtained with high level semantic textual descriptors

Hal - Université Grenoble Alpes

HAL Université de Savoie

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

Author: Benetos Emmanouil
Liang Jinhua
Liu Xubo
Phan Huy
Plumbley Mark D.
Wang Wenwu
Publication venue
Publication date: 30/11/2023
Field of study

The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. In this work, we introduce Acoustic Prompt Turning (APT), a new adapter extending LLMs and VLMs to the audio domain by soft prompting only. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as language model inputs. To mitigate the data scarcity in the audio domain, a multi-task learning strategy is proposed by formulating diverse audio tasks in a sequence-to-sequence manner. Moreover, we improve the framework of audio language model by using interleaved audio-text embeddings as the input sequence. This improved framework imposes zero constraints on the input format and thus is capable of tackling more understanding tasks, such as few-shot audio classification and audio reasoning. To further evaluate the reasoning ability of audio networks, we propose natural language audio reasoning (NLAR), a new task that analyses across two audio clips by comparison and summarization. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the targeted datasets) across various tasks. We finally demonstrate the APT's ability in extending frozen VLMs to the audio domain without finetuning, achieving promising results in the audio-visual question and answering task. Our code and model weights are released at https://github.com/JinhuaLiang/APT

arXiv.org e-Print Archive

Pre-training Music Classification Models via Music Source Separation

Author: Garoufis Christos
Maragos Petros
Zlatintsi Athanasia
Publication venue
Publication date: 24/10/2023
Field of study

In this paper, we study whether music source separation can be used as a pre-training strategy for music representation learning, targeted at music classification tasks. To this end, we first pre-train U-Net networks under various music source separation objectives, such as the isolation of vocal or instrumental sources from a musical piece; afterwards, we attach a convolutional tail network to the pre-trained U-Net and jointly finetune the whole network. The features learned by the separation network are also propagated to the tail network through skip connections. Experimental results in two widely used and publicly available datasets indicate that pre-training the U-Nets with a music source separation objective can improve performance compared to both training the whole network from scratch and using the tail network as a standalone in two music classification tasks: music auto-tagging, when vocal separation is used, and music genre classification for the case of multi-source separation.Comment: 5 pages (4+references), 3 figures. ICASSP-24 submissio

arXiv.org e-Print Archive