Search CORE

11 research outputs found

Nonnegative Feature Learning Methods for Acoustic Scene Classification

Author: Bisot Victor
Essid Slim
Richard Gael
Serizel Romain
Publication venue: HAL CCSD
Publication date: 16/11/2017
Field of study

International audienceThis paper introduces improvements to nonnegative feature learning-based methods for acoustic scene classification. We start by introducing modifications to the task-driven nonnegative matrix factorization algorithm. The proposed adapted scaling algorithm improves the generalization capability of task-driven nonneg-ative matrix factorization for the task. We then propose to exploit simple deep neural network architecture to classify both low level time-frequency representations and unsupervised nonnegative matrix factorization activation features independently. Moreover, we also propose a deep neural network architecture that exploits jointly unsupervised nonnegative matrix factorization activation features and low-level time frequency representations as inputs. Finally, we present a fusion of proposed systems in order to further improve performance. The resulting systems are our submission for the task 1 of the DCASE 2017 challenge

INRIA a CCSD electronic archive server

Supervised Group Nonnegative Matrix Factorisation With Similarity Constraints And Applications To Speaker Identification

Author: Bisot Victor
Essid Slim
Richard Gael
Serizel Romain
Publication venue: HAL CCSD
Publication date: 06/03/2017
Field of study

International audienceThis paper presents supervised feature learning approaches for speaker identification that rely on nonnegative matrix factorisa-tion. Recent studies have shown that group nonnegative matrix factorisation and task-driven supervised dictionary learning can help performing effective feature learning for audio classification problems. This paper proposes to integrate a recent method that relies on group nonnegative matrix factorisation into a task-driven supervised framework for speaker identification. The goal is to capture both the speaker variability and the session variability while exploiting the discriminative learning aspect of the task-driven approach. Results on a subset of the ESTER corpus prove that the proposed approach can be competitive with I-vectors. Index Terms— Nonnegative matrix factorisation, feature learning , dictionary learning, online learning, speaker identificatio

Crossref

INRIA a CCSD electronic archive server

Machine listening techniques as a complement to video image analysis in forensics

Author: Bisot Victor
Essid Slim
Richard Gael
Serizel Romain
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2016
Field of study

International audienceVideo is now one of the major sources of information for forensics. However, video documents can be originating from various recording devices (CCTV, mobile devices. . . ) with inconsistent quality and can sometimes be recorded in challenging light or motion conditions. Therefore, the amount of information that can be extracted relying solely on video image can vary to a great extent. Most of the videos however generally include audio recording as well. Machine listening can then become a valuable complement to video image analysis in challenging scenarios. In this paper, the authors present a brief overview of some machine listening techniques and their application to the analysis of video documents for forensics. The applicability of these techniques to forensics problems is then discussed in the light of machine listening system performances

Crossref

INRIA a CCSD electronic archive server

Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification

Author: Bisot Victor
Essid Slim
Richard Gael
Serizel Romain
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/09/2016
Field of study

International audienceIn this paper, we study the usefulness of various matrix factorization methods for learning features to be used for the specific Acoustic Scene Classification problem. A common way of addressing ASC has been to engineer features capable of capturing the specificities of acoustic environments. Instead, we show that better representations of the scenes can be automatically learned from time-frequency representations using matrix factorization techniques. We mainly focus on extensions including sparse, kernel-based, convolutive and a novel supervised dictionary learning variant of Principal Component Analysis and Nonnegative Matrix Factorization. An experimental evaluation is performed on two of the largest ASC datasets available in order to compare and discuss the usefulness of these methods for the task. We show that the unsupervised learning methods provide better representations of acoustic scenes than the best conventional hand-crafted features on both datasets. Furthermore, the introduction of a novel nonnegative supervised matrix factorization model and Deep Neural networks trained on spectrograms, allow us to reach further improvements

Crossref

INRIA a CCSD electronic archive server

Leveraging deep neural networks with nonnegative representations for improved environmental sound classification

Author: Bisot Victor
Essid Slim
Richard Gael
Serizel Romain
Publication venue: HAL CCSD
Publication date: 25/09/2017
Field of study

International audienceThis paper introduces the use of representations based on non-negative matrix factorization (NMF) to train deep neural networks with applications to environmental sound classification. Deep learning systems for sound classification usually rely on the network to learn meaningful representations from spectrograms or hand-crafted features. Instead, we introduce a NMF-based feature learning stage before training deep networks , whose usefulness is highlighted in this paper, especially for multi-source acoustic environments such as sound scenes. We rely on two established unsupervised and supervised NMF techniques to learn better input representations for deep neural networks. This will allow us, with simple architectures, to reach competitive performance with more complex systems such as convolutional networks for acoustic scene classification. The proposed systems outperform neu-ral networks trained on time-frequency representations on two acoustic scene classification datasets as well as the best systems from the 2016 DCASE challenge

INRIA a CCSD electronic archive server

Acoustic Features for Environmental Sound Analysis

Author: Bisot Victor
Essid Slim
Richard Gael
Serizel Romain
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

International audienceMost of the time it is nearly impossible to differentiate between particular type of sound events from a waveform only. Therefore, frequency domain and time-frequency domain representations have been used for years providing representations of the sound signals that are more inline with the human perception. However, these representations are usually too generic and often fail to describe specific content that is present in a sound recording. A lot of work have been devoted to design features that could allow extracting such specific information leading to a wide variety of hand-crafted features. During the past years, owing to the increasing availability of medium scale and large scale sound datasets, an alternative approach to feature extraction has become popular, the so-called feature learning. Finally, processing the amount of data that is at hand nowadays can quickly become overwhelming. It is therefore of paramount importance to be able to reduce the size of the dataset in the feature space. The general processing chain to convert an sound signal to a feature vector that can be efficiently exploited by a classifier and the relation to features used for speech and music processing are described is this chapter

Crossref

INRIA a CCSD electronic archive server

Representation learning for acoustic scene analysis

Author: Bisot Victor
Publication venue
Publication date: 16/03/2018
Field of study

Ce travail de thèse s’intéresse au problème de l’analyse des sons environnementaux avec pour objectif d’extraire automatiquement de l’information sur le contexte dans lequel un son a été enregistré. Ce domaine de recherche a connu un succès grandissant ces dernières années entraînant une rapide évolution du nombre de travaux et des méthodes employées. Nos travaux explorent et contribuent à plusieurs grandes familles d’approches pour l’analyse de scènes et événements sonores allant de l’ingénierie de descripteurs jusqu’aux réseaux de neurones profonds. Notre travail se focalise sur les techniques d’apprentissage de représentations par factorisation en matrices positives (NMF), qui sont particulièrement adaptées à l’analyse d’environnements multi-sources tels que les scènes sonores. Nous commençons par montrer que les spectrogrammes contiennent suffisamment d’information pour discriminer les scènes sonores en proposant une combinaison de descripteurs d’images extraits à partir des images temps-fréquence. Nous quittons ensuite le monde de l’ingénierie de descripteurs pour aller vers un apprentissage automatique des représentations. Nous entamons cette partie du travail en nous intéressant aux approches non-supervisées, en particulier à l’apprentissage de descripteurs par différentes variantes de la NMF. Plusieurs des approches proposées confirment l’intérêt de l’apprentissage de caractéristiques par NMF en obtenant des performances supérieures aux meilleures approches par extraction de descripteurs. Nous proposons ensuite d’améliorer les représentations apprises en introduisant le modèle TNMF, une variante supervisée de la NMF. Les modèles et algorithmes TNMF proposés se basent sur un apprentissage conjoint du classifieur et du dictionnaire de sorte à minimiser un coût de classification. Dans une dernière partie, nous discutons des liens de compatibilité entre la NMF et certaines approches par réseaux de neurones profonds. Nous proposons et adaptons des architectures de réseaux de neurones à l’utilisation de la NMF. Les modèles introduits nous permettent d’atteindre des performances état de l’art sur des tâches de classification de scènes et de détection d’événements sonores. Enfin nous explorons la possibilité d’entraîner conjointement la NMF et les paramètres du réseau, regroupant ainsi les différentes étapes de nos systèmes en un seul problème d’optimisation.This thesis work focuses on the computational analysis of environmental sound scenes and events. The objective of such tasks is to automatically extract information about the context in which a sound has been recorded. The interest for this area of research has been rapidly increasing in the last few years leading to a constant growth in the number of works and proposed approaches. We explore and contribute to the main families of approaches to sound scene and event analysis, going from feature engineering to deep learning. Our work is centered at representation learning techniques based on nonnegative matrix factorization, which are particularly suited to analyse multi-source environments such as acoustic scenes. As a first approach, we propose a combination of image processing features with the goal of confirming that spectrograms contain enough information to discriminate sound scenes and events. From there, we leave the world of feature engineering to go towards automatically learning the features. The first step we take in that direction is to study the usefulness of matrix factorization for unsupervised feature learning techniques, especially by relying on variants of NMF. Several of the compared approaches allow us indeed to outperform feature engineering approaches to such tasks. Next, we propose to improve the learned representations by introducing the TNMF model, a supervised variant of NMF. The proposed TNMF models and algorithms are based on jointly learning nonnegative dictionaries and classifiers by minimising a target classification cost. The last part of our work highlights the links and the compatibility between NMF and certain deep neural network systems by proposing and adapting neural network architectures to the use of NMF as an input representation. The proposed models allow us to get state of the art performance on scene classification and overlapping event detection tasks. Finally we explore the possibility of jointly learning NMF and neural networks parameters, grouping the different stages of our systems in one optimisation problem

Theses.fr

Apprentissage de représentations pour l’analyse de scènes sonores

Author: Bisot Victor
Publication venue: HAL CCSD
Publication date: 16/03/2018
Field of study

This thesis work focuses on the computational analysis of environmental sound scenes andevents. The objective of such tasks is to automatically extract information about the context inwhich a sound has been recorded. The interest for this area of research has been rapidly increasingin the last few years leading to a constant growth in the number of works and proposedapproaches. We explore and contribute to the main families of approaches to sound scene andevent analysis, going from feature engineering to deep learning. Our work is centered at representationlearning techniques based on nonnegative matrix factorization, which are particularly suitedto analyse multi-source environments such as acoustic scenes. As a first approach, we propose acombination of image processing features with the goal of confirming that spectrograms containenough information to discriminate sound scenes and events. From there, we leave the world offeature engineering to go towards automatically learning the features. The first step we take in thatdirection is to study the usefulness of matrix factorization for unsupervised feature learning techniques,especially by relying on variants of NMF. Several of the compared approaches allow usindeed to outperform feature engineering approaches to such tasks. Next, we propose to improvethe learned representations by introducing the TNMF model, a supervised variant of NMF. Theproposed TNMF models and algorithms are based on jointly learning nonnegative dictionaries andclassifiers by minimising a target classification cost. The last part of our work highlights the linksand the compatibility between NMF and certain deep neural network systems by proposing andadapting neural network architectures to the use of NMF as an input representation. The proposedmodels allow us to get state of the art performance on scene classification and overlapping eventdetection tasks. Finally we explore the possibility of jointly learning NMF and neural networksparameters, grouping the different stages of our systems in one optimisation problem.Ce travail de thèse s’intéresse au problème de l’analyse des sons environnementaux avec pourobjectif d’extraire automatiquement de l’information sur le contexte dans lequel un son a été enregistré.Ce domaine de recherche a connu un succès grandissant ces dernières années entraînantune rapide évolution du nombre de travaux et des méthodes employées. Nos travaux explorentet contribuent à plusieurs grandes familles d’approches pour l’analyse de scènes et événementssonores allant de l’ingénierie de descripteurs jusqu’aux réseaux de neurones profonds. Notre travailse focalise sur les techniques d’apprentissage de représentations par factorisation en matricespositives (NMF), qui sont particulièrement adaptées à l’analyse d’environnements multi-sourcestels que les scènes sonores. Nous commençons par montrer que les spectrogrammes contiennentsuffisamment d’information pour discriminer les scènes sonores en proposant une combinaisonde descripteurs d’images extraits à partir des images temps-fréquence. Nous quittons ensuite lemonde de l’ingénierie de descripteurs pour aller vers un apprentissage automatique des représentations.Nous entamons cette partie du travail en nous intéressant aux approches non-supervisées,en particulier à l’apprentissage de descripteurs par différentes variantes de la NMF. Plusieurs desapproches proposées confirment l’intérêt de l’apprentissage de caractéristiques par NMF en obtenantdes performances supérieures aux meilleures approches par extraction de descripteurs. Nousproposons ensuite d’améliorer les représentations apprises en introduisant le modèle TNMF, unevariante supervisée de la NMF. Les modèles et algorithmes TNMF proposés se basent sur unapprentissage conjoint du classifieur et du dictionnaire de sorte à minimiser un coût de classification.Dans une dernière partie, nous discutons des liens de compatibilité entre la NMF et certainesapproches par réseaux de neurones profonds. Nous proposons et adaptons des architectures deréseaux de neurones à l’utilisation de la NMF. Les modèles introduits nous permettent d’atteindredes performances état de l’art sur des tâches de classification de scènes et de détection d’événementssonores. Enfin nous explorons la possibilité d’entraîner conjointement la NMF et lesparamètres du réseau, regroupant ainsi les différentes étapes de nos systèmes en un seul problèmed’optimisation

Thèses en Ligne

thèses en ligne de ParisTech

Improving music structure segmentation using lag-priors

Author: Bisot Victor
Peeters Geoffroy
Publication venue: HAL CCSD
Publication date: 01/10/2014
Field of study

cote interne IRCAM: Peeters14aNone / NoneNational audienceImproving music structure segmentation using lag-prior

Hog and Subband power distribution image features for acoustic scene classification

Author: Bisot Victor
Essid Slim
Richard Gael
Publication venue: HAL CCSD
Publication date
Field of study

<p>Acoustic scene classification is a difficult problem mostly due to the high density of events concurrently occurring in audio scenes. In order to capture the occurrences of these events we propose to use the Subband Power Distribution (SPD) as a feature. We extract it by computing the histogram of amplitude values in each frequency band of a spectrogram image. The SPD allows us to model the density of events in each frequency band. Our method is evaluated on a large acoustic scene dataset using support vector machines. We outperform the previous methods when using the SPD in conjunction with the histogram of gradients. To reach further improvement, we also consider the use of an approximation of the earth mover's distance kernel to compare histograms in a more suitable way. Using the so-called Sinkhorn kernel improves the results on most of the feature configurations. Best performances reach a 92.8% F1 score.</p