423 research outputs found
Cosine-similarity penalty to discriminate sound classes in weakly-supervised sound event detection.
The design of new methods and models when only weakly-labeled data are available is of paramount importance in order to reduce the costs of manual annotation and the considerable human effort associated with it. In this work, we address Sound Event Detection in the case where a weakly annotated dataset is available for training. The weak annotations provide tags of audio events but do not provide temporal boundaries. The objective is twofold: 1) audio tagging, i.e. multi-label classification at recording level, 2) sound event detection, i.e. localization of the event boundaries within the recordings. This work focuses mainly on the second objective. We explore an approach inspired by Multiple Instance Learning, in which we train a convolutional recurrent neural network to give predictions at frame-level, using a custom loss function based on the weak labels and the statistics of the frame-based predictions. Since some sound classes cannot be distinguished with this approach, we improve the method by penalizing similarity between the predictions of the positive classes during training. On the test set used in the DCASE 2018 challenge, consisting of 288 recordings and 10 sound classes, the addition of a penalty resulted in a localization F-score of 34.75%, and brought 10% relative improvement compared to not using the penalty. Our best model achieved a 26.20% F-score on the DCASE-2018 official Eval subset close to the 10-system ensemble approach that ranked second in the challenge with a 29.9% F-score
Simple to Complex Cross-modal Learning to Rank
The heterogeneity-gap between different modalities brings a significant
challenge to multimedia information retrieval. Some studies formalize the
cross-modal retrieval tasks as a ranking problem and learn a shared multi-modal
embedding space to measure the cross-modality similarity. However, previous
methods often establish the shared embedding space based on linear mapping
functions which might not be sophisticated enough to reveal more complicated
inter-modal correspondences. Additionally, current studies assume that the
rankings are of equal importance, and thus all rankings are used
simultaneously, or a small number of rankings are selected randomly to train
the embedding space at each iteration. Such strategies, however, always suffer
from outliers as well as reduced generalization capability due to their lack of
insightful understanding of procedure of human cognition. In this paper, we
involve the self-paced learning theory with diversity into the cross-modal
learning to rank and learn an optimal multi-modal embedding space based on
non-linear mapping functions. This strategy enhances the model's robustness to
outliers and achieves better generalization via training the model gradually
from easy rankings by diverse queries to more complex ones. An efficient
alternative algorithm is exploited to solve the proposed challenging problem
with fast convergence in practice. Extensive experimental results on several
benchmark datasets indicate that the proposed method achieves significant
improvements over the state-of-the-arts in this literature.Comment: 14 pages; Accepted by Computer Vision and Image Understandin
Advanced informatics for event detection and temporal localization
PhD ThesisThe primary objective of a Sound Event Detection (SED) system is to detect the prescene
of an acoustic event (i.e., audio tagging) and to return the onset and offset of the identified acoustic event within an audio clip (i.e., temporal localization). Such a system
can be promising in wildlife and biodiversity monitoring, surveillance, and smart-home
applications.
However, developing a system to be adept at both subtasks is not a trivial task. It can
be hindered by the need for a large amount of strongly labeled data, where the event tags
and the corresponding onsets and offsets are known with certainty. This is a limiting factor
as strongly labeled data is challenging to collect and is prone to annotation errors due to
the ambiguity in the perception of onsets and offsets.
In this thesis, we propose to address the lack of strongly labeled data by using pseudo
strongly labeled data, where the event tags are known with certainty while the corresponding onsets and offsets are estimated. While Nonnegative Matrix Factorization can be
used directly for SED but with limited accuracy, we show that it can be a useful tool
for pseudo labeling. We further show that pseudo strongly labeled data estimated using
our proposed methods can improve the accuracy of a SED system developed using deep
learning approaches.
Subsequent work then focused on improving a SED system as a whole rather than a
single subtask. This leads to the proposal of a novel student-teacher training framework
that incorporates a noise-robust loss function, a new cyclic training scheme, an improved
depthwise separable convolution, a triple instance-level temporal pooling approach, and an
improved Transformer encoding layer. Together with synthetic strongly labeled data and a
large corpus of unlabeled data, we show that a SED system developed using our proposed
method is capable of producing state-of-the-art performance
Machine Learning Methods with Noisy, Incomplete or Small Datasets
In many machine learning applications, available datasets are sometimes incomplete, noisy or affected by artifacts. In supervised scenarios, it could happen that label information has low quality, which might include unbalanced training sets, noisy labels and other problems. Moreover, in practice, it is very common that available data samples are not enough to derive useful supervised or unsupervised classifiers. All these issues are commonly referred to as the low-quality data problem. This book collects novel contributions on machine learning methods for low-quality datasets, to contribute to the dissemination of new ideas to solve this challenging problem, and to provide clear examples of application in real scenarios
- âŠ