115 research outputs found
Fusion of Multimodal Information in Music Content Analysis
Music is often processed through its acoustic realization. This is restrictive in the sense that music is clearly a highly multimodal concept where various types of heterogeneous information can be associated to a given piece of music (a musical score, musicians\u27 gestures, lyrics, user-generated metadata, etc.). This has recently led researchers to apprehend music through its various facets, giving rise to "multimodal music analysis" studies. This article gives a synthetic overview of methods that have been successfully employed in multimodal signal analysis. In particular, their use in music content processing is discussed in more details through five case studies that highlight different multimodal integration techniques. The case studies include an example of cross-modal correlation for music video analysis, an audiovisual drum transcription system, a description of the concept of informed source separation, a discussion of multimodal dance-scene analysis, and an example of user-interactive music analysis. In the light of these case studies, some perspectives of multimodality in music processing are finally suggested
Pretext Tasks selection for multitask self-supervised speech representation learning
Through solving pretext tasks, self-supervised learning leverages unlabeled
data to extract useful latent representations replacing traditional input
features in the downstream task. In audio/speech signal processing, a wide
range of features where engineered through decades of research efforts. As it
turns out, learning to predict such features (a.k.a pseudo-labels) has proven
to be a particularly relevant pretext task, leading to useful self-supervised
representations which prove to be effective for downstream tasks. However,
methods and common practices for combining such pretext tasks for better
performance on the downstream task have not been explored and understood
properly. In fact, the process relies almost exclusively on a computationally
heavy experimental procedure, which becomes intractable with the increase of
the number of pretext tasks. This paper introduces a method to select a group
of pretext tasks among a set of candidates. The method we propose estimates
calibrated weights for the partial losses corresponding to the considered
pretext tasks during the self-supervised training process. The experiments
conducted on automatic speech recognition, speaker and emotion recognition
validate our approach, as the groups selected and weighted with our method
perform better than classic baselines, thus facilitating the selection and
combination of relevant pseudo-labels for self-supervised representation
learning
Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations
Self-Supervised Learning (SSL) has allowed leveraging large amounts of
unlabeled speech data to improve the performance of speech recognition models
even with small annotated datasets. Despite this, speech SSL representations
may fail while facing an acoustic mismatch between the pretraining and target
datasets. To address this issue, we propose a novel supervised domain
adaptation method, designed for cases exhibiting such a mismatch in acoustic
domains. It consists in applying properly calibrated data augmentations on a
large clean dataset, bringing it closer to the target domain, and using it as
part of an initial fine-tuning stage. Augmentations are automatically selected
through the minimization of a conditional-dependence estimator, based on the
target dataset. The approach is validated during an oracle experiment with
controlled distortions and on two amateur-collected low-resource domains,
reaching better performances compared to the baselines in both cases.Comment: 6 pages,INTERSPEECH 202
Exploring new features for music classification
International audienceAutomatic music classification aims at grouping unknown songs in predefined categories such as music genre or induced emotion. To obtain perceptually relevant results, it is needed to design appropriate features that carry important information for semantic inference. In this paper, we explore novel features and evaluate them in a task of music automatic tagging. The proposed features span various aspects of the music: timbre, textual metadata, visual descriptors of cover art, and features characterizing the lyrics of sung music. The merit of these novel features is then evaluated using a classification system based on a boosting algorithm on binary decision trees. Their effectiveness for the task at hand is discussed with reference to the very common Mel Frequency Cepstral Coefficients features. We show that some of these features alone bring useful information, and that the classification system takes great advantage of a description covering such diverse aspects of songs
Attention-based distributed speech enhancement for unconstrained microphone arrays with varying number of nodes
Speech enhancement promises higher efficiency in ad-hoc microphone arrays
than in constrained microphone arrays thanks to the wide spatial coverage of
the devices in the acoustic scene. However, speech enhancement in ad-hoc
microphone arrays still raises many challenges. In particular, the algorithms
should be able to handle a variable number of microphones, as some devices in
the array might appear or disappear. In this paper, we propose a solution that
can efficiently process the spatial information captured by the different
devices of the microphone array, while being robust to a link failure. To do
this, we use an attention mechanism in order to put more weight on the relevant
signals sent throughout the array and to neglect the redundant or empty
channels
- …