9 research outputs found
Event-Independent Network for Polyphonic Sound Event Localization and Detection
Polyphonic sound event localization and detection is not only detecting what
sound events are happening but localizing corresponding sound sources. This
series of tasks was first introduced in DCASE 2019 Task 3. In 2020, the sound
event localization and detection task introduces additional challenges in
moving sound sources and overlapping-event cases, which include two events of
the same type with two different direction-of-arrival (DoA) angles. In this
paper, a novel event-independent network for polyphonic sound event
localization and detection is proposed. Unlike the two-stage method we proposed
in DCASE 2019 Task 3, this new network is fully end-to-end. Inputs to the
network are first-order Ambisonics (FOA) time-domain signals, which are then
fed into a 1-D convolutional layer to extract acoustic features. The network is
then split into two parallel branches. The first branch is for sound event
detection (SED), and the second branch is for DoA estimation. There are three
types of predictions from the network, SED predictions, DoA predictions, and
event activity detection (EAD) predictions that are used to combine the SED and
DoA features for on-set and off-set estimation. All of these predictions have
the format of two tracks indicating that there are at most two overlapping
events. Within each track, there could be at most one event happening. This
architecture introduces a problem of track permutation. To address this
problem, a frame-level permutation invariant training method is used.
Experimental results show that the proposed method can detect polyphonic sound
events and their corresponding DoAs. Its performance on the Task 3 dataset is
greatly increased as compared with that of the baseline method.Comment: conferenc
An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection
Polyphonic sound event localization and detection (SELD), which jointly
performs sound event detection (SED) and direction-of-arrival (DoA) estimation,
detects the type and occurrence time of sound events as well as their
corresponding DoA angles simultaneously. We study the SELD task from a
multi-task learning perspective. Two open problems are addressed in this paper.
Firstly, to detect overlapping sound events of the same type but with different
DoAs, we propose to use a trackwise output format and solve the accompanying
track permutation problem with permutation-invariant training. Multi-head
self-attention is further used to separate tracks. Secondly, a previous finding
is that, by using hard parameter-sharing, SELD suffers from a performance loss
compared with learning the subtasks separately. This is solved by a soft
parameter-sharing scheme. We term the proposed method as Event Independent
Network V2 (EINV2), which is an improved version of our previously-proposed
method and an end-to-end network for SELD. We show that our proposed EINV2 for
joint SED and DoA estimation outperforms previous methods by a large margin,
and has comparable performance to state-of-the-art ensemble models.Comment: 5 pages, 2021 IEEE International Conference on Acoustics, Speech and
Signal Processin
Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019
Sound event localization and detection is a novel area of research that
emerged from the combined interest of analyzing the acoustic scene in terms of
the spatial and temporal activity of sounds of interest. This paper presents an
overview of the first international evaluation on sound event localization and
detection, organized as a task of the DCASE 2019 Challenge. A large-scale
realistic dataset of spatialized sound events was generated for the challenge,
to be used for training of learning-based approaches, and for evaluation of the
submissions in an unlabeled subset. The overview presents in detail how the
systems were evaluated and ranked and the characteristics of the
best-performing systems. Common strategies in terms of input features, model
architectures, training approaches, exploitation of prior knowledge, and data
augmentation are discussed. Since ranking in the challenge was based on
individually evaluating localization and event classification performance, part
of the overview focuses on presenting metrics for the joint measurement of the
two, together with a reevaluation of submissions using these new metrics. The
new analysis reveals submissions that performed better on the joint task of
detecting the correct type of event close to its original location than some of
the submissions that were ranked higher in the challenge. Consequently, ranking
of submissions which performed strongly when evaluated separately on detection
or localization, but not jointly on both, was affected negatively
A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection
In this paper, we propose a novel four-stage data augmentation approach to
ResNet-Conformer based acoustic modeling for sound event localization and
detection (SELD). First, we explore two spatial augmentation techniques, namely
audio channel swapping (ACS) and multi-channel simulation (MCS), to deal with
data sparsity in SELD. ACS and MDS focus on augmenting the limited training
data with expanding direction of arrival (DOA) representations such that the
acoustic models trained with the augmented data are robust to localization
variations of acoustic sources. Next, time-domain mixing (TDM) and
time-frequency masking (TFM) are also investigated to deal with overlapping
sound events and data diversity. Finally, ACS, MCS, TDM and TFM are combined in
a step-by-step manner to form an effective four-stage data augmentation scheme.
Tested on the Detection and Classification of Acoustic Scenes and Events
(DCASE) 2020 data sets, our proposed augmentation approach greatly improves the
system performance, ranking our submitted system in the first place in the SELD
task of DCASE 2020 Challenge. Furthermore, we employ a ResNet-Conformer
architecture to model both global and local context dependencies of an audio
sequence to yield further gains over those architectures used in the DCASE 2020
SELD evaluations.Comment: 12 pages, 8 figure
Joint Measurement of Localization and Detection of Sound Events
Sound event detection and sound localization or tracking have historically been two separate areas of research. Recent development of sound event detection methods approach also the localization side, but lack a consistent way of measuring the joint performance of the system; instead, they measure the separate abilities for detection and for localization. This paper proposes augmentation of the localization metrics with a condition related to the detection, and conversely, use of location information in calculating the true positives for detection. An extensive evaluation example is provided to illustrate the behavior of such joint metrics. The comparison to the detection only and localization only performance shows that the proposed joint metrics operate in a consistent and logical manner, and characterize adequately both aspects.acceptedVersionPeer reviewe