13 research outputs found
Learning Audio Features with Metadata and Contrastive Learning
Methods based on supervised learning using annotations in an end-to-end
fashion have been the state-of-the-art for classification problems. However,
they may be limited in their generalization capability, especially in the low
data regime. In this study, we address this issue using supervised contrastive
learning combined with available metadata to solve multiple pretext tasks that
learn a good representation of data. We apply our approach on ICBHI, a
respiratory sound classification dataset suited for this setting. We show that
learning representations using only metadata, without class labels, obtains
similar performance as using cross entropy with those labels only. In addition,
we obtain state-of-the-art score when combining class labels with metadata
using multiple supervised contrastive learning. This work suggests the
potential of using multiple metadata sources in supervised contrastive
settings, in particular in settings with class imbalance and few data. Our code
is released at https://github.com/ilyassmoummad/scl_icbhi201
Regularized Contrastive Pre-training for Few-shot Bioacoustic Sound Detection
Bioacoustic sound event detection allows for better understanding of animal
behavior and for better monitoring biodiversity using audio. Deep learning
systems can help achieve this goal, however it is difficult to acquire
sufficient annotated data to train these systems from scratch. To address this
limitation, the Detection and Classification of Acoustic Scenes and Events
(DCASE) community has recasted the problem within the framework of few-shot
learning and organize an annual challenge for learning to detect animal sounds
from only five annotated examples. In this work, we regularize supervised
contrastive pre-training to learn features that can transfer well on new target
tasks with animal sounds unseen during training, achieving a high F-score of
61.52%(0.48) when no feature adaptation is applied, and an F-score of
68.19%(0.75) when we further adapt the learned features for each new target
task. This work aims to lower the entry bar to few-shot bioacoustic sound event
detection by proposing a simple and yet effective framework for this task, by
also providing open-source code
Pretraining Representations for Bioacoustic Few-shot Detection using Supervised Contrastive Learning
Deep learning has been widely used recently for sound event detection and
classification. Its success is linked to the availability of sufficiently large
datasets, possibly with corresponding annotations when supervised learning is
considered. In bioacoustic applications, most tasks come with few labelled
training data, because annotating long recordings is time consuming and costly.
Therefore supervised learning is not the best suited approach to solve
bioacoustic tasks. The bioacoustic community recasted the problem of sound
event detection within the framework of few-shot learning, i.e. training a
system with only few labeled examples. The few-shot bioacoustic sound event
detection task in the DCASE challenge focuses on detecting events in long audio
recordings given only five annotated examples for each class of interest. In
this paper, we show that learning a rich feature extractor from scratch can be
achieved by leveraging data augmentation using a supervised contrastive
learning framework. We highlight the ability of this framework to transfer well
for five-shot event detection on previously unseen classes in the training
data. We obtain an F-score of 63.46\% on the validation set and 42.7\% on the
test set, ranking second in the DCASE challenge. We provide an ablation study
for the critical choices of data augmentation techniques as well as for the
learning strategy applied on the training set
PRETRAINING RESPIRATORY SOUND REPRESENTATIONS USING METADATA AND CONTRASTIVE LEARNING
Methods based on supervised learning using annotations in an endto-end fashion have been the state-of-the-art for classification problems. However, they may be limited in their generalization capability, especially in the low data regime. In this study, we address this issue using supervised contrastive learning combined with available metadata to solve multiple pretext tasks that learn a good representation of data. We apply our approach on respiratory sound classification. This task is suited for this setting as demographic information such as sex and age are correlated with presence of lung diseases, and learning a system that implicitly encode this information may better detect anomalies. Supervised contrastive learning is a paradigm that learns similar representations to samples sharing the same class labels and dissimilar representations to samples with different class labels. The feature extractor learned using this paradigm extract useful features from the data, and we show that it outperforms cross-entropy in classifying respiratory anomalies in two different datasets. We also show that learning representations using only metadata, without class labels, obtains similar performance as using cross entropy with those labels only. In addition, when combining class labels with metadata using multiple supervised contrastive learning, an extension of supervised contrastive learning solving an additional task of grouping patients within the same sex and age group, more informative features are learned. This work suggests the potential of using multiple metadata sources in supervised contrastive settings, in particular in settings with class imbalance and few data
Contrastive Learning using Random Walk Laplacian Matrix
International audienceIn recent years, Self-Supervised Learning (SSL) has gained in popularity due to the availability of unlabeled data. SSL consists in training a neural network encoder capable of representing data in a low-dimensional space efficiently, i.e., capable of extracting useful features from data (the one it is trained on, as well as new data). This problem of learning good representations is generally called representation learning. To this end, SimCLR [1] is a popular contrastive method that optimizes an encoder to output similar representations for different views of the same data (positive pairs), and different representations for views of different data (negative pairs). In this work, we treat the representation learning problem as a random walk on a graph of data representations in the latent space (vertices) and their similarities (edges). This can be formulated as an optimization problem that maximizes the transition probability between views of the same data, and minimizes the transition probability between views of different data. This problem has been approximated in [2], which proposes a solution minimizing the sum of euclidean distances between positive pairs, and adds a decorrelation term to avoid representation collapse (i.e., convergence to a trivial solution in which representations are constant). In this work, we propose a simpler loss function, that leverages the random walk Laplacian matrix directly. We benchmark our approach on the CIFAR10 dataset using standard data augmentations from the literature to create different views of data [3], and compare our results to SimCLR
SUPERVISED CONTRASTIVE LEARNING FOR PRE-TRAINING BIOACOUSTIC FEW-SHOT SYSTEMS
We show in this work that learning a rich feature extractor from scratch using only official training data is feasible. We achieve this by learning representations using a supervised contrastive learning framework. We then transfer the learned feature extractor to the sets of validation and test for few-shot evaluation. For fewshot validation, we simply train a linear classifier on the negative and positive shots and obtain a F-score of 63.46% outperforming the baseline by a large margin. We don't use any external data or pretrained model. Our approach doesn't require choosing a threshold for prediction or any post-processing technique. Our code is publicly available on Github : https://github.com/ ilyassmoummad/dcase23_task5_sc
SUPERVISED CONTRASTIVE LEARNING FOR PRE-TRAINING BIOACOUSTIC FEW-SHOT SYSTEMS
We show in this work that learning a rich feature extractor from scratch using only official training data is feasible. We achieve this by learning representations using a supervised contrastive learning framework. We then transfer the learned feature extractor to the sets of validation and test for few-shot evaluation. For fewshot validation, we simply train a linear classifier on the negative and positive shots and obtain a F-score of 63.46% outperforming the baseline by a large margin. We don't use any external data or pretrained model. Our approach doesn't require choosing a threshold for prediction or any post-processing technique. Our code is publicly available on Github : https://github.com/ ilyassmoummad/dcase23_task5_sc
Pretraining Representations for Bioacoustic Few-Shot Detection using Supervised Contrastive Learning
International audienceDeep learning has been widely used recently for sound event detection and classification. Its success is linked to the availability of sufficiently large datasets, possibly with corresponding annotations when supervised learning is considered. In bioacoustic applications, most tasks come with few labelled training data, because annotating long recordings is time consuming and costly. Therefore supervised learning is not the best suited approach to solve bioacoustic tasks. The bioacoustic community recasted the problem of sound event detection within the framework of few-shot learning, i.e. training a system with only few labeled examples. The few-shot bioacoustic sound event detection task in the DCASE challenge focuses on detecting events in long audio recordings given only five annotated examples for each class of interest. In this paper, we show that learning a rich feature extractor from scratch can be achieved by leveraging data augmentation using a supervised contrastive learning framework. We highlight the ability of this framework to transfer well for five-shot event detection on previously unseen classes in the training data. We obtain an F-score of 63.46% on the validation set and 42.7% on the test set, ranking second in the DCASE challenge. We provide an ablation study for the critical choices of data augmentation techniques as well as for the learning strategy applied on the training set
Pretraining Representations for Bioacoustic Few-Shot Detection using Supervised Contrastive Learning
International audienceDeep learning has been widely used recently for sound event detection and classification. Its success is linked to the availability of sufficiently large datasets, possibly with corresponding annotations when supervised learning is considered. In bioacoustic applications, most tasks come with few labelled training data, because annotating long recordings is time consuming and costly. Therefore supervised learning is not the best suited approach to solve bioacoustic tasks. The bioacoustic community recasted the problem of sound event detection within the framework of few-shot learning, i.e. training a system with only few labeled examples. The few-shot bioacoustic sound event detection task in the DCASE challenge focuses on detecting events in long audio recordings given only five annotated examples for each class of interest. In this paper, we show that learning a rich feature extractor from scratch can be achieved by leveraging data augmentation using a supervised contrastive learning framework. We highlight the ability of this framework to transfer well for five-shot event detection on previously unseen classes in the training data. We obtain an F-score of 63.46% on the validation set and 42.7% on the test set, ranking second in the DCASE challenge. We provide an ablation study for the critical choices of data augmentation techniques as well as for the learning strategy applied on the training set
Domain-Invariant Representation Learning of Bird Sounds
Passive acoustic monitoring (PAM) is crucial for bioacoustic research, enabling non-invasive species tracking and biodiversity monitoring. Citizen science platforms like Xeno-Canto provide large annotated datasets from focal recordings, where the target species is intentionally recorded. However, PAM requires monitoring in passive soundscapes, creating a domain shift between focal and passive recordings, which challenges deep learning models trained on focal recordings. To address this, we leverage supervised contrastive learning to improve domain generalization in bird sound classification, enforcing domain invariance across same-class examples from different domains. We also propose ProtoCLR (Prototypical Contrastive Learning of Representations), which reduces the computational complexity of the SupCon loss by comparing examples to class prototypes instead of pairwise comparisons. Additionally, we present a new few-shot classification benchmark based on BirdSet, a large-scale bird sound dataset, and demonstrate the effectiveness of our approach in achieving strong transfer performance