Search CORE

34 research outputs found

Training Sound Event Detection On A Heterogeneous Dataset

Author: Serizel Romain
Turpault Nicolas
Publication venue
Publication date: 08/07/2020
Field of study

Training a sound event detection algorithm on a heterogeneous dataset including both recorded and synthetic soundscapes that can have various labeling granularity is a non-trivial task that can lead to systems requiring several technical choices. These technical choices are often passed from one system to another without being questioned. We propose to perform a detailed analysis of DCASE 2020 task 4 sound event detection baseline with regards to several aspects such as the type of data used for training, the parameters of the mean-teacher or the transformations applied while generating the synthetic soundscapes. Some of the parameters that are usually used as default are shown to be sub-optimal

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Analyse des problèmatiques liées à la reconnaissance de sons ambiants en environnement réel

Author: Turpault Nicolas
Publication venue: HAL CCSD
Publication date: 31/05/2021
Field of study

Notre vie est constamment bercée par les sons ambiants. Du bruit d’une voiture qui passe à un oiseau qui chante, de l’eau qui coule dans notre douche aux bruits de notre clavier, les sons ambiants sont partout. Les humains sans pertes auditives reconnaissent inconsciemment les sons qui les entourent et prennent de nombreuses décisions de la vie quotidienne en tenant compte des sons ambiants (réactions à des pleurs de bébé ou une alarme par exemple). Durant ces dernières années, la recherche autour de l’analyse automatique de ces sons ambiants s’est développée rapidement. L’analyse des sons ambiants est un problème difficile à résoudre en raison de la complexité des scènes sonores et de leur manque de structure apparente. Les événements sonores qui constituent les scènes sonores sont très variés et de nombreux événements peuvent être actifs simultanément. Afin de reconnaître les événements sonores de façon automatique, on a généralement recours à des méthodes d’apprentissage automatique. Les méthodes par apprentissage profond sont devenues très populaires ces dernières années grâce à leurs performances élevées pour des tâches diverses dont l’analyse de sons ambiants. Les méthodes d’apprentissage s’appuient sur l’utilisation de jeux de données contenant les événements que l’on souhaite reconnaître. Dans l’idéal, ces jeux de données contiennent des annotations concernant l’activité liée à chacune des classes d’événements sonores et éventuellement à leur temporalité (on parle alors d’annotations fortes). Ces dernières années, des jeux de données fortement annotés ont été collectés et publiés pour permettre l’analyse de sons ambiants, mais ils sont souvent composés d’une faible quantité de données qui ne sont pas toujours enregistrées en conditions réelles. Obtenir des annotations fortes coûte cher, et il est donc difficile d’obtenir un gros jeu de données fortement annotées. En revanche, la collecte de données non annotées ou annotées partiellement et sans indication de temporalité (annotations faibles) est plus facile. C’est dans ce cadre que s’inscrit cette thèse.Nous proposons de traiter le problème de la reconnaissance d’événements sonores en environnement domestique en utilisant des données non annotées et faiblement annotées. Le but est d’analyser les problèmes qui surviennent lors d’un scénario réel de reconnaissance d’événements sonores au sein d’une maison pour permettre l’assistance aux personnes en perte d’autonomie ou rendre la maison intelligente. Afin d’analyser ce problème, nous avons proposé une tâche de détection d’événements sonores dans un challenge international d’analyse de sons ambiants. Pour cette tâche nous avons défini un problème proche d’un scénario réel pour permettre l’analyse scientifique des différents problèmes qui apparaissent dans l’analyse de sons ambiants en environnement réel. Nous proposons un jeu de données pour permettre des analyses détaillées des problèmes scientifiques à résoudre pour permettre l’évolution continue de la tâche. Nous nous focalisons ensuite sur le problème de l’apprentissage semi-supervisé qui permet l’apprentissage de systèmes utilisant des données annotées et des données non annotées. Cette analyse se concentre sur l’apprentissage d’une représentation qui serait utile pour des applications finales d’étiquetage ou de détection d’événements sonores. Nous analysons enfin l’impact de l’annotation faible des données dans l’apprentissage d’un système de reconnaissance d’événements sonores afin de proposer des conseils pour l’annotation faible des jeux de données ou des pistes de solutions.We’re constantly surrounded by ambient sounds. From a car passing by to a bird’s song or from the running water in the shower to the sound of a keyboard, ambient sounds are everywhere. Humans without hearing loss unconsciously recognize them and take multiple decisions using the information provided by ambient sounds in their everyday life (reaction to a baby crying or to an alarm for example). In the last years, the research interest in automatic ambient sound analysis has rapidly grown. Ambient sound analysis is a difficult problem because of the complexity of the sound scenes and their lack of apparent structure.Sound events constituting these sound scenes are various and multiple events can appear simultaneously. To recognize sound events automatically, machine learning methods are usually used, in particular deep learning methods due to their good performance on a variety of tasks including ambient sound analysis. These methods require a training dataset containing the sound events to be recognized. Ideally, the dataset contains labels indicating the type of events and their time positions in the audio clips (strong labels). In recent years, some strongly annotated datasets have appeared that are designed for ambient sound analysis, but they usually contain only a small amount of data and are rarely recorded in real conditions. Strong annotations are expensive to collect, making it difficult to acquire a large scale strongly labeled dataset. However, collecting data without labels or with partial labels indicating the presence of some events without their time information (weak labels) is easier. This thesis fits in this context. We propose to address the problem of sound event recognition in domestic environments using unlabeled and weakly labeled data. Our goal is to analyze the different problems that can appear in a real world scenario of sound event recognition in domestic environment with applications to assisted living and smart house. To analyse this problem we have organized a domestic sound event detection task in an international ambient sound analysis challenge. We have defined this task in such a way that it allows us to analyze the different problems appearing in a real world scenario. We have collected, annotated and shared a dataset designed for this analysis. From 2018 to 2020, we have organized three evaluation campaigns to allow for a detailed analysis of the systems submitted by participants and a continuous improvement the task definition. Then, we focus on the problem of learning systems using both labeled and unlabeled training data (semi-supervised learning). The analysis concentrates on learning a representation which could be useful for a variety of tasks in sound event detection or tagging. Finally, we analyze the impact of weak labels in the training dataset of a sound event recognition system to understand if this is the main problem of a sound event recognition system and provide advice for the labelling of real world data

INRIA a CCSD electronic archive server

Training Sound Event Detection On A Heterogeneous Dataset

Author: Serizel Romain
Turpault Nicolas
Publication venue: HAL CCSD
Publication date: 01/11/2020
Field of study

International audienceTraining a sound event detection algorithm on a heterogeneous dataset including both recorded and synthetic soundscapes that can have various labeling granularity is a non-trivial task that can lead to systems requiring several technical choices. These technical choices are often passed from one system to another without being questioned. We propose to perform a detailed analysis of DCASE 2020 task 4 sound event detection baseline with regards to several aspects such as the type of data used for training, the parameters of the mean-teacher or the transformations applied while generating the synthetic soundscapes. Some of the parameters that are usually used as default are shown to be sub-optimal

INRIA a CCSD electronic archive server

Sound Event Detection from Partially Annotated Data: Trends and Challenges

Author: Serizel Romain
Turpault Nicolas
Publication venue: HAL CCSD
Publication date: 03/06/2019
Field of study

International audienceThis paper proposes an overview of the latest advances and challenges in sound event detection and classification with systems trained on partially annotated data. The paper fo-cuses on the scientific aspects highlighted by the task 4 of DCASE 2018 challenge: large-scale weakly labeled semi-supervised sound event detection in domestic environments. Given a small training set composed of weakly labeled audio clips (without timestamps) and a larger training set composed of unlabeled audio clips, the target of the task is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio clip. This paper proposes a detailed analysis of the impact of the time segmentation, the event classification and the methods used to exploit unlabeled data on the final performance of sound event detection systems

INRIA a CCSD electronic archive server

Analyse des annotations faibles pour l'étiquetage d'événements sonores.

Author: Serizel Romain
Turpault Nicolas
Vincent Emmanuel
Publication venue: HAL CCSD
Publication date: 21/04/2021
Field of study

Weak labels are a recurring problem in the context of ambient sound analysis. While multiple methods using neural networks have been proposed to address it, limited attention has been given to the analysis of the problem to have a better understanding of it. Many of these methods seem to improve detection or tagging performance, but they have been evaluated in scenarios where other problems such as unreliable labels, overlapping sound events, or class unbalance also occur. Therefore, it is difficult to conclude whether the observed improvement is due to solving the problem of weak labels or not. In this article, we provide for the first time a detailed analysis of the impact of weak labels independently of other problems on a sound event tagging system. We show that, in order to limit the negative impact of weak labels on the performance, the training clips must be at least as long as the test clips and longer training clip durations have a minor impact. We also show that good temporal aggregation can help to reduce this impact at test time and provide insight on the annotation granularity needed depending on the targeted scenario

INRIA a CCSD electronic archive server

HAL-Rennes 1

Semi-supervised triplet loss based learning of ambient audio embeddings

Author: Serizel Romain
Turpault Nicolas
Vincent Emmanuel
Publication venue: HAL CCSD
Publication date: 12/05/2019
Field of study

International audienceDeep neural networks are particularly useful to learn relevant repre-sentations from data. Recent studies have demonstrated the poten-tial of unsupervised representation learning for ambient sound anal-ysis using various flavors of the triplet loss. They have comparedthis approach to supervised learning. However, in real situations,it is common to have a small labeled dataset and a large unlabeledone. In this paper, we combine unsupervised and supervised tripletloss based learning into a semi-supervised representation learningapproach. We propose two flavors of this approach, whereby thepositive samples for those triplets whose anchors are unlabeled areobtained either by applying a transformation to the anchor, or byselecting the nearest sample in the training set. We compare ourapproach to supervised and unsupervised representation learning aswell as the ratio between the amount of labeled and unlabeled data.We evaluate all the above approaches on an audio tagging task usingthe DCASE 2018 Task 4 dataset, and we show the impact of thisratio on the tagging performance

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

Limitations of weak labels for embedding and tagging

Author: Serizel Romain
Turpault Nicolas
Vincent Emmanuel
Publication venue: HAL CCSD
Publication date: 04/05/2020
Field of study

International audienceMany datasets and approaches in ambient sound analysis use weakly labeled data.Weak labels are employed because annotating every data sample with a strong label is too expensive.Yet, their impact on the performance in comparison to strong labels remains unclear.Indeed, weak labels must often be dealt with at the same time as other challenges, namely multiple labels per sample, unbalanced classes and/or overlapping events.In this paper, we formulate a supervised learning problem which involves weak labels.We create a dataset that focuses on the difference between strong and weak labels as opposed to other challenges. We investigate the impact of weak labels when training an embedding or an end-to-end classifier.Different experimental scenarios are discussed to provide insights into which applications are most sensitive to weakly labeled data

INRIA a CCSD electronic archive server

Performance above all ? energy consumption vs. performance for machine listening, a study on dcase task 4 baseline

Author: Cornell Samuele
Serizel Romain
Turpault Nicolas
Publication venue: HAL CCSD
Publication date: 14/11/2022
Field of study

In machine listening there is a tendency to resort to models with a growing number of parameters raising thus concerns about the practical viability of these due to their energy consumption. Reporting energy consumption of the models could be a first step to raise awareness on this matter. Yet, estimating the energy consumption across different conditions (hyper-parameters, GPU types etc.) poses some challenges in terms of biases and fairness of the comparison between different models and works. In this paper we perform an extensive study using the DCASE task 4 baseline system and monitor energy consumption and training time for different GPU types and batch sizes. The goal is to identify which aspects can have an impact on the estimation of the energy consumption and should be normalized for a fair comparison across systems. Additionally, we propose an analysis of the relationship between the energy consumption and the sound event detection performance that calls into question our current way to evaluate systems

INRIA a CCSD electronic archive server

HAL-Rennes 1