516 research outputs found
Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy
Sound event detection (SED) and localization refer to recognizing sound events and estimating their spatial and temporal locations. Using neural networks has become the prevailing method for SED. In the area of sound localization, which is usually performed by estimating the direction of arrival (DOA), learning-based methods have recently been developed. In this paper, it is experimentally shown that the trained SED model is able to contribute to the direction of arrival estimation (DOAE). However, joint training of SED and DOAE degrades the performance of both. Based on these results, a two-stage polyphonic sound event detection and localization method is proposed. The method learns SED first, after which the learned feature layers are transferred for DOAE. It then uses the SED ground truth as a mask to train DOAE. The proposed method is evaluated on the DCASE 2019 Task 3 dataset, which contains different overlapping sound events in different environments. Experimental results show that the proposed method is able to improve the performance of both SED and DOAE, and also performs significantly better than the baseline method.303
Polyphonic Sound Event Detection by using Capsule Neural Networks
Artificial sound event detection (SED) has the aim to mimic the human ability
to perceive and understand what is happening in the surroundings. Nowadays,
Deep Learning offers valuable techniques for this goal such as Convolutional
Neural Networks (CNNs). The Capsule Neural Network (CapsNet) architecture has
been recently introduced in the image processing field with the intent to
overcome some of the known limitations of CNNs, specifically regarding the
scarce robustness to affine transformations (i.e., perspective, size,
orientation) and the detection of overlapped images. This motivated the authors
to employ CapsNets to deal with the polyphonic-SED task, in which multiple
sound events occur simultaneously. Specifically, we propose to exploit the
capsule units to represent a set of distinctive properties for each individual
sound event. Capsule units are connected through a so-called "dynamic routing"
that encourages learning part-whole relationships and improves the detection
performance in a polyphonic context. This paper reports extensive evaluations
carried out on three publicly available datasets, showing how the CapsNet-based
algorithm not only outperforms standard CNNs but also allows to achieve the
best results with respect to the state of the art algorithms
Event-Independent Network for Polyphonic Sound Event Localization and Detection
Polyphonic sound event localization and detection is not only detecting what
sound events are happening but localizing corresponding sound sources. This
series of tasks was first introduced in DCASE 2019 Task 3. In 2020, the sound
event localization and detection task introduces additional challenges in
moving sound sources and overlapping-event cases, which include two events of
the same type with two different direction-of-arrival (DoA) angles. In this
paper, a novel event-independent network for polyphonic sound event
localization and detection is proposed. Unlike the two-stage method we proposed
in DCASE 2019 Task 3, this new network is fully end-to-end. Inputs to the
network are first-order Ambisonics (FOA) time-domain signals, which are then
fed into a 1-D convolutional layer to extract acoustic features. The network is
then split into two parallel branches. The first branch is for sound event
detection (SED), and the second branch is for DoA estimation. There are three
types of predictions from the network, SED predictions, DoA predictions, and
event activity detection (EAD) predictions that are used to combine the SED and
DoA features for on-set and off-set estimation. All of these predictions have
the format of two tracks indicating that there are at most two overlapping
events. Within each track, there could be at most one event happening. This
architecture introduces a problem of track permutation. To address this
problem, a frame-level permutation invariant training method is used.
Experimental results show that the proposed method can detect polyphonic sound
events and their corresponding DoAs. Its performance on the Task 3 dataset is
greatly increased as compared with that of the baseline method.Comment: conferenc
A Sequence Matching Network for Polyphonic Sound Event Localization and Detection
Polyphonic sound event detection and direction-of-arrival estimation require
different input features from audio signals. While sound event detection mainly
relies on time-frequency patterns, direction-of-arrival estimation relies on
magnitude or phase differences between microphones. Previous approaches use the
same input features for sound event detection and direction-of-arrival
estimation, and train the two tasks jointly or in a two-stage transfer-learning
manner. We propose a two-step approach that decouples the learning of the sound
event detection and directional-of-arrival estimation systems. In the first
step, we detect the sound events and estimate the directions-of-arrival
separately to optimize the performance of each system. In the second step, we
train a deep neural network to match the two output sequences of the event
detector and the direction-of-arrival estimator. This modular and hierarchical
approach allows the flexibility in the system design, and increase the
performance of the whole sound event localization and detection system. The
experimental results using the DCASE 2019 sound event localization and
detection dataset show an improved performance compared to the previous
state-of-the-art solutions.Comment: to be published in 2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP
An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection
Polyphonic sound event localization and detection (SELD), which jointly
performs sound event detection (SED) and direction-of-arrival (DoA) estimation,
detects the type and occurrence time of sound events as well as their
corresponding DoA angles simultaneously. We study the SELD task from a
multi-task learning perspective. Two open problems are addressed in this paper.
Firstly, to detect overlapping sound events of the same type but with different
DoAs, we propose to use a trackwise output format and solve the accompanying
track permutation problem with permutation-invariant training. Multi-head
self-attention is further used to separate tracks. Secondly, a previous finding
is that, by using hard parameter-sharing, SELD suffers from a performance loss
compared with learning the subtasks separately. This is solved by a soft
parameter-sharing scheme. We term the proposed method as Event Independent
Network V2 (EINV2), which is an improved version of our previously-proposed
method and an end-to-end network for SELD. We show that our proposed EINV2 for
joint SED and DoA estimation outperforms previous methods by a large margin,
and has comparable performance to state-of-the-art ensemble models.Comment: 5 pages, 2021 IEEE International Conference on Acoustics, Speech and
Signal Processin
Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019
Sound event localization and detection is a novel area of research that
emerged from the combined interest of analyzing the acoustic scene in terms of
the spatial and temporal activity of sounds of interest. This paper presents an
overview of the first international evaluation on sound event localization and
detection, organized as a task of the DCASE 2019 Challenge. A large-scale
realistic dataset of spatialized sound events was generated for the challenge,
to be used for training of learning-based approaches, and for evaluation of the
submissions in an unlabeled subset. The overview presents in detail how the
systems were evaluated and ranked and the characteristics of the
best-performing systems. Common strategies in terms of input features, model
architectures, training approaches, exploitation of prior knowledge, and data
augmentation are discussed. Since ranking in the challenge was based on
individually evaluating localization and event classification performance, part
of the overview focuses on presenting metrics for the joint measurement of the
two, together with a reevaluation of submissions using these new metrics. The
new analysis reveals submissions that performed better on the joint task of
detecting the correct type of event close to its original location than some of
the submissions that were ranked higher in the challenge. Consequently, ranking
of submissions which performed strongly when evaluated separately on detection
or localization, but not jointly on both, was affected negatively
- …