Search CORE

116 research outputs found

Fusion of Multimodal Information in Music Content Analysis

Author: Essid Slim
Publication venue: Dagstuhl Follow-Ups. Multimodal Music Processing
Publication date: 01/01/2012
Field of study

Music is often processed through its acoustic realization. This is restrictive in the sense that music is clearly a highly multimodal concept where various types of heterogeneous information can be associated to a given piece of music (a musical score, musicians\u27 gestures, lyrics, user-generated metadata, etc.). This has recently led researchers to apprehend music through its various facets, giving rise to "multimodal music analysis" studies. This article gives a synthetic overview of methods that have been successfully employed in multimodal signal analysis. In particular, their use in music content processing is discussed in more details through five case studies that highlight different multimodal integration techniques. The case studies include an example of cross-modal correlation for music video analysis, an audiovisual drum transcription system, a description of the concept of informed source separation, a discussion of multimodal dance-scene analysis, and an example of user-interactive music analysis. In the light of these case studies, some perspectives of multimodality in music processing are finally suggested

Dagstuhl Research Online Publication Server

Pretext Tasks selection for multitask self-supervised speech representation learning

Author: Essid Slim
Parcollet Titouan
Zaiem Salah
Publication venue
Publication date: 15/10/2021
Field of study

Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features (a.k.a pseudo-labels) has proven to be a particularly relevant pretext task, leading to useful self-supervised representations which prove to be effective for downstream tasks. However, methods and common practices for combining such pretext tasks for better performance on the downstream task have not been explored and understood properly. In fact, the process relies almost exclusively on a computationally heavy experimental procedure, which becomes intractable with the increase of the number of pretext tasks. This paper introduces a method to select a group of pretext tasks among a set of candidates. The method we propose estimates calibrated weights for the partial losses corresponding to the considered pretext tasks during the self-supervised training process. The experiments conducted on automatic speech recognition, speaker and emotion recognition validate our approach, as the groups selected and weighted with our method perform better than classic baselines, thus facilitating the selection and combination of relevant pseudo-labels for self-supervised representation learning

arXiv.org e-Print Archive

Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations

Author: Essid Slim
Parcollet Titouan
Zaiem Salah
Publication venue
Publication date: 01/06/2023
Field of study

Self-Supervised Learning (SSL) has allowed leveraging large amounts of unlabeled speech data to improve the performance of speech recognition models even with small annotated datasets. Despite this, speech SSL representations may fail while facing an acoustic mismatch between the pretraining and target datasets. To address this issue, we propose a novel supervised domain adaptation method, designed for cases exhibiting such a mismatch in acoustic domains. It consists in applying properly calibrated data augmentations on a large clean dataset, bringing it closer to the target domain, and using it as part of an initial fine-tuning stage. Augmentations are automatically selected through the minimization of a conditional-dependence estimator, based on the target dataset. The approach is validated during an oracle experiment with controlled distortions and on two amateur-collected low-resource domains, reaching better performances compared to the baselines in both cases.Comment: 6 pages,INTERSPEECH 202

arXiv.org e-Print Archive

Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

Author: Duong Ngoc
Essid Slim
Ozerov Alexey
Parekh Sanjeel
Pérez Patrick
Richard Gaël
Publication venue
Publication date: 07/11/2018
Field of study

We tackle the problem of audiovisual scene analysis for weakly-labeled data. To this end, we build upon our previous audiovisual representation learning framework to perform object classification in noisy acoustic environments and integrate audio source enhancement capability. This is made possible by a novel use of non-negative matrix factorization for the audio modality. Our approach is founded on the multiple instance learning paradigm. Its effectiveness is established through experiments over a challenging dataset of music instrument performance videos. We also show encouraging visual object localization results

arXiv.org e-Print Archive

HAL-Rennes 1

Attention-based distributed speech enhancement for unconstrained microphone arrays with varying number of nodes

Author: Essid Slim
Furnon Nicolas
Illina Irina
Serizel Romain
Publication venue
Publication date: 15/06/2021
Field of study

Speech enhancement promises higher efficiency in ad-hoc microphone arrays than in constrained microphone arrays thanks to the wide spatial coverage of the devices in the acoustic scene. However, speech enhancement in ad-hoc microphone arrays still raises many challenges. In particular, the algorithms should be able to handle a variable number of microphones, as some devices in the array might appear or disappear. In this paper, we propose a solution that can efficiently process the spatial information captured by the different devices of the microphone array, while being robust to a link failure. To do this, we use an attention mechanism in order to put more weight on the relevant signals sent throughout the array and to neglect the redundant or empty channels

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Exploring new features for music classification

Author: Essid Slim
Foucard Rémi
Lagrange Mathieu
Richard Gael
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 03/07/2013
Field of study

International audienceAutomatic music classification aims at grouping unknown songs in predefined categories such as music genre or induced emotion. To obtain perceptually relevant results, it is needed to design appropriate features that carry important information for semantic inference. In this paper, we explore novel features and evaluate them in a task of music automatic tagging. The proposed features span various aspects of the music: timbre, textual metadata, visual descriptors of cover art, and features characterizing the lyrics of sung music. The merit of these novel features is then evaluated using a classification system based on a boosting algorithm on binary decision trees. Their effectiveness for the task at hand is discussed with reference to the very common Mel Frequency Cepstral Coefficients features. We show that some of these features alone bring useful information, and that the classification system takes great advantage of a description covering such diverse aspects of songs

MAD-EEG: an EEG dataset for decoding auditory attention to a target instrument in polyphonic music

Author: cantisani giorgia
Essid Slim
Richard Gaël
Trégoat Gabriel
Publication venue: HAL CCSD
Publication date: 14/09/2019
Field of study

International audienceWe present MAD-EEG, a new, freely available dataset for studying EEG-based auditory attention decoding considering the challenging case of subjects attending to a target instrument in polyphonic music. The dataset represents the first music-related EEG dataset of its kind, enabling, in particular, studies on single-trial EEG-based attention decoding, while also opening the path for research on other EEG-based music analysis tasks. MAD-EEG has so far collected 20-channel EEG signals recorded from 8 subjects listening to solo, duo and trio music excerpts and attending to one pre-specified instrument. The proposed experimental setting differs from the ones previously considered as the stimuli are polyphonic and are played to the subject using speakers instead of headphones. The stimuli were designed considering variations in terms of number and type of instruments in the mixture, spatial rendering, music genre and melody that is played. Preliminary results obtained with a state-of-the-art stimulus reconstruction algorithm commonly used for speech stimuli show that the audio representation reconstructed from the EEG response is more correlated with that of the attended source than with the one of the unattended source, proving the dataset to be suitable for such kind of studies