1,968 research outputs found
Contrastive audio-language learning for music
As one of the most intuitive interfaces known to humans, natural language has the potential to mediate many tasks that involve human-computer interaction, especially in application-focused fields like Music Information Retrieval. In this work, we explore cross-modal learning in an attempt to bridge audio and language in the music domain. To this end, we propose MusCALL, a framework for Music Contrastive Audio-Language Learning. Our approach consists of a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences, producing multimodal embeddings that can be used for text-to-audio and audio-to-text retrieval out-of-the-box. Thanks to this property, MusCALL can be transferred to virtually any task that can be cast as text-based retrieval. Our experiments show that our method performs significantly better than the baselines at retrieving audio that matches a textual description and, conversely, text that matches an audio query. We also demonstrate that the multimodal alignment capability of our model can be successfully extended to the zero-shot transfer scenario for genre classification and auto-tagging on two public datasets
Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
Sound events often occur in unstructured environments where they exhibit wide
variations in their frequency content and temporal structure. Convolutional
neural networks (CNN) are able to extract higher level features that are
invariant to local spectral and temporal variations. Recurrent neural networks
(RNNs) are powerful in learning the longer term temporal context in the audio
signals. CNNs and RNNs as classifiers have recently shown improved performances
over established methods in various sound recognition tasks. We combine these
two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it
on a polyphonic sound event detection task. We compare the performance of the
proposed CRNN method with CNN, RNN, and other established methods, and observe
a considerable improvement for four different datasets consisting of everyday
sound events.Comment: Accepted for IEEE Transactions on Audio, Speech and Language
Processing, Special Issue on Sound Scene and Event Analysi
Segmentation process and spectral characteristics in the determination of musical genres
Over the past few years there has been a tendency to store audio tracks for later use on CD-DVDs, HDD-SSDs as well as on the internet, which makes it challenging to classify the information either online or offline. For this purpose, the audio tracks must be tagged. Tags are said to be texts based on the semantic information of the sound [1]. Thus, music analysis can be done in several ways [2] since music is identified by its genre, artist, instruments and structure, by a tagging system that can be manual or automatic. The manual tagging allows the visualization of the behavior of an audio track either in time domain or in frequency domain as in the spectrogram, making it possible to classify the songs without listening to them. However, this process is very time consuming and labor intensive, including health problems [3] which shows that "the volume, sound sensitivity, time and cost required for a manual labeling process is generally prohibitive. Three fundamental steps are required to carry out automatic labelling: pre-processing, feature extraction and classification [4]. The present study developed an algorithm for performing automatic classification of music genres using a segmentation process employing spectral characteristics such as centroid (SC), flatness (SF) and spread (SS), as well as a time spectral characteristic
An audio-visual approach to web video categorization
International audienceIn this paper we address the issue of automatic video genre categorization of web media using an audio-visual approach. To this end, we propose content descriptors which exploit audio, temporal structure and color information. The potential of our descriptors is experimentally validated both from the perspective of a classification system and as an information retrieval approach. Validation is carried out on a real scenario, namely on more than 288 hours of video footage and 26 video genres specific to blip.tv media platform. Additionally, to reduce semantic gap, we propose a new relevance feedback technique which is based on hierarchical clustering. Experimental tests prove that retrieval performance can be significantly increased in this case, becoming comparable to the one obtained with high level semantic textual descriptors
Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities
The auditory system plays a substantial role in shaping the overall human
perceptual experience. While prevailing large language models (LLMs) and visual
language models (VLMs) have shown their promise in solving a wide variety of
vision and language understanding tasks, only a few of them can be generalised
to the audio domain without compromising their domain-specific capacity. In
this work, we introduce Acoustic Prompt Turning (APT), a new adapter extending
LLMs and VLMs to the audio domain by soft prompting only. Specifically, APT
applies an instruction-aware audio aligner to generate soft prompts,
conditioned on both input text and sounds, as language model inputs. To
mitigate the data scarcity in the audio domain, a multi-task learning strategy
is proposed by formulating diverse audio tasks in a sequence-to-sequence
manner. Moreover, we improve the framework of audio language model by using
interleaved audio-text embeddings as the input sequence. This improved
framework imposes zero constraints on the input format and thus is capable of
tackling more understanding tasks, such as few-shot audio classification and
audio reasoning. To further evaluate the reasoning ability of audio networks,
we propose natural language audio reasoning (NLAR), a new task that analyses
across two audio clips by comparison and summarization. Experiments show that
APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the
expert models (i.e., the networks trained on the targeted datasets) across
various tasks. We finally demonstrate the APT's ability in extending frozen
VLMs to the audio domain without finetuning, achieving promising results in the
audio-visual question and answering task. Our code and model weights are
released at https://github.com/JinhuaLiang/APT
Pre-training Music Classification Models via Music Source Separation
In this paper, we study whether music source separation can be used as a
pre-training strategy for music representation learning, targeted at music
classification tasks. To this end, we first pre-train U-Net networks under
various music source separation objectives, such as the isolation of vocal or
instrumental sources from a musical piece; afterwards, we attach a
convolutional tail network to the pre-trained U-Net and jointly finetune the
whole network. The features learned by the separation network are also
propagated to the tail network through skip connections. Experimental results
in two widely used and publicly available datasets indicate that pre-training
the U-Nets with a music source separation objective can improve performance
compared to both training the whole network from scratch and using the tail
network as a standalone in two music classification tasks: music auto-tagging,
when vocal separation is used, and music genre classification for the case of
multi-source separation.Comment: 5 pages (4+references), 3 figures. ICASSP-24 submissio
- …