423 research outputs found

    Cosine-similarity penalty to discriminate sound classes in weakly-supervised sound event detection.

    Get PDF
    The design of new methods and models when only weakly-labeled data are available is of paramount importance in order to reduce the costs of manual annotation and the considerable human effort associated with it. In this work, we address Sound Event Detection in the case where a weakly annotated dataset is available for training. The weak annotations provide tags of audio events but do not provide temporal boundaries. The objective is twofold: 1) audio tagging, i.e. multi-label classification at recording level, 2) sound event detection, i.e. localization of the event boundaries within the recordings. This work focuses mainly on the second objective. We explore an approach inspired by Multiple Instance Learning, in which we train a convolutional recurrent neural network to give predictions at frame-level, using a custom loss function based on the weak labels and the statistics of the frame-based predictions. Since some sound classes cannot be distinguished with this approach, we improve the method by penalizing similarity between the predictions of the positive classes during training. On the test set used in the DCASE 2018 challenge, consisting of 288 recordings and 10 sound classes, the addition of a penalty resulted in a localization F-score of 34.75%, and brought 10% relative improvement compared to not using the penalty. Our best model achieved a 26.20% F-score on the DCASE-2018 official Eval subset close to the 10-system ensemble approach that ranked second in the challenge with a 29.9% F-score

    Simple to Complex Cross-modal Learning to Rank

    Get PDF
    The heterogeneity-gap between different modalities brings a significant challenge to multimedia information retrieval. Some studies formalize the cross-modal retrieval tasks as a ranking problem and learn a shared multi-modal embedding space to measure the cross-modality similarity. However, previous methods often establish the shared embedding space based on linear mapping functions which might not be sophisticated enough to reveal more complicated inter-modal correspondences. Additionally, current studies assume that the rankings are of equal importance, and thus all rankings are used simultaneously, or a small number of rankings are selected randomly to train the embedding space at each iteration. Such strategies, however, always suffer from outliers as well as reduced generalization capability due to their lack of insightful understanding of procedure of human cognition. In this paper, we involve the self-paced learning theory with diversity into the cross-modal learning to rank and learn an optimal multi-modal embedding space based on non-linear mapping functions. This strategy enhances the model's robustness to outliers and achieves better generalization via training the model gradually from easy rankings by diverse queries to more complex ones. An efficient alternative algorithm is exploited to solve the proposed challenging problem with fast convergence in practice. Extensive experimental results on several benchmark datasets indicate that the proposed method achieves significant improvements over the state-of-the-arts in this literature.Comment: 14 pages; Accepted by Computer Vision and Image Understandin

    Advanced informatics for event detection and temporal localization

    Get PDF
    PhD ThesisThe primary objective of a Sound Event Detection (SED) system is to detect the prescene of an acoustic event (i.e., audio tagging) and to return the onset and offset of the identified acoustic event within an audio clip (i.e., temporal localization). Such a system can be promising in wildlife and biodiversity monitoring, surveillance, and smart-home applications. However, developing a system to be adept at both subtasks is not a trivial task. It can be hindered by the need for a large amount of strongly labeled data, where the event tags and the corresponding onsets and offsets are known with certainty. This is a limiting factor as strongly labeled data is challenging to collect and is prone to annotation errors due to the ambiguity in the perception of onsets and offsets. In this thesis, we propose to address the lack of strongly labeled data by using pseudo strongly labeled data, where the event tags are known with certainty while the corresponding onsets and offsets are estimated. While Nonnegative Matrix Factorization can be used directly for SED but with limited accuracy, we show that it can be a useful tool for pseudo labeling. We further show that pseudo strongly labeled data estimated using our proposed methods can improve the accuracy of a SED system developed using deep learning approaches. Subsequent work then focused on improving a SED system as a whole rather than a single subtask. This leads to the proposal of a novel student-teacher training framework that incorporates a noise-robust loss function, a new cyclic training scheme, an improved depthwise separable convolution, a triple instance-level temporal pooling approach, and an improved Transformer encoding layer. Together with synthetic strongly labeled data and a large corpus of unlabeled data, we show that a SED system developed using our proposed method is capable of producing state-of-the-art performance

    Proceedings of the 7th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2022)

    Get PDF

    Machine Learning Methods with Noisy, Incomplete or Small Datasets

    Get PDF
    In many machine learning applications, available datasets are sometimes incomplete, noisy or affected by artifacts. In supervised scenarios, it could happen that label information has low quality, which might include unbalanced training sets, noisy labels and other problems. Moreover, in practice, it is very common that available data samples are not enough to derive useful supervised or unsupervised classifiers. All these issues are commonly referred to as the low-quality data problem. This book collects novel contributions on machine learning methods for low-quality datasets, to contribute to the dissemination of new ideas to solve this challenging problem, and to provide clear examples of application in real scenarios
    • 

    corecore