3 research outputs found
Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019
Sound event localization and detection is a novel area of research that
emerged from the combined interest of analyzing the acoustic scene in terms of
the spatial and temporal activity of sounds of interest. This paper presents an
overview of the first international evaluation on sound event localization and
detection, organized as a task of the DCASE 2019 Challenge. A large-scale
realistic dataset of spatialized sound events was generated for the challenge,
to be used for training of learning-based approaches, and for evaluation of the
submissions in an unlabeled subset. The overview presents in detail how the
systems were evaluated and ranked and the characteristics of the
best-performing systems. Common strategies in terms of input features, model
architectures, training approaches, exploitation of prior knowledge, and data
augmentation are discussed. Since ranking in the challenge was based on
individually evaluating localization and event classification performance, part
of the overview focuses on presenting metrics for the joint measurement of the
two, together with a reevaluation of submissions using these new metrics. The
new analysis reveals submissions that performed better on the joint task of
detecting the correct type of event close to its original location than some of
the submissions that were ranked higher in the challenge. Consequently, ranking
of submissions which performed strongly when evaluated separately on detection
or localization, but not jointly on both, was affected negatively
FSD50K: an Open Dataset of Human-Labeled Sound Events
Most existing datasets for sound event recognition (SER) are relatively small
and/or domain-specific, with the exception of AudioSet, based on a massive
amount of audio tracks from YouTube videos and encompassing over 500 classes of
everyday sounds. However, AudioSet is not an open dataset---its release
consists of pre-computed audio features (instead of waveforms), which limits
the adoption of some SER methods. Downloading the original audio tracks is also
problematic due to constituent YouTube videos gradually disappearing and usage
rights issues, which casts doubts over the suitability of this resource for
systems' benchmarking. To provide an alternative benchmark dataset and thus
foster SER research, we introduce FSD50K, an open dataset containing over 51k
audio clips totalling over 100h of audio manually labeled using 200 classes
drawn from the AudioSet Ontology. The audio clips are licensed under Creative
Commons licenses, making the dataset freely distributable (including
waveforms). We provide a detailed description of the FSD50K creation process,
tailored to the particularities of Freesound data, including challenges
encountered and solutions adopted. We include a comprehensive dataset
characterization along with discussion of limitations and key factors to allow
its audio-informed usage. Finally, we conduct sound event classification
experiments to provide baseline systems as well as insight on the main factors
to consider when splitting Freesound audio data for SER. Our goal is to develop
a dataset to be widely adopted by the community as a new open benchmark for SER
research
A Hybrid Parametric-Deep Learning Approach for Sound Event Localization and Detection
This work describes and discusses an algorithm submitted to the Sound Event Localization and Detection Task of DCASE2019 Challenge. The proposed methodology relies on parametric spatial audio analysis for source localization and detection, combined with a deep learning-based monophonic event classifier. The evaluation of the proposed algorithm yields overall results comparable to the baseline system. The main highlight is a reduction of the localization error on the evaluation dataset by a factor of 2.6, compared with the baseline performance.18919