31 research outputs found
Regression and Classification for Direction-of-Arrival Estimation with Convolutional Recurrent Neural Networks
We present a novel learning-based approach to estimate the
direction-of-arrival (DOA) of a sound source using a convolutional recurrent
neural network (CRNN) trained via regression on synthetic data and Cartesian
labels. We also describe an improved method to generate synthetic data to train
the neural network using state-of-the-art sound propagation algorithms that
model specular as well as diffuse reflections of sound. We compare our model
against three other CRNNs trained using different formulations of the same
problem: classification on categorical labels, and regression on spherical
coordinate labels. In practice, our model achieves up to 43% decrease in
angular error over prior methods. The use of diffuse reflection results in 34%
and 41% reduction in angular prediction errors on LOCATA and SOFA datasets,
respectively, over prior methods based on image-source methods. Our method
results in an additional 3% error reduction over prior schemes that use
classification based networks, and we use 36% fewer network parameters
A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection
In this paper, we propose a novel four-stage data augmentation approach to
ResNet-Conformer based acoustic modeling for sound event localization and
detection (SELD). First, we explore two spatial augmentation techniques, namely
audio channel swapping (ACS) and multi-channel simulation (MCS), to deal with
data sparsity in SELD. ACS and MDS focus on augmenting the limited training
data with expanding direction of arrival (DOA) representations such that the
acoustic models trained with the augmented data are robust to localization
variations of acoustic sources. Next, time-domain mixing (TDM) and
time-frequency masking (TFM) are also investigated to deal with overlapping
sound events and data diversity. Finally, ACS, MCS, TDM and TFM are combined in
a step-by-step manner to form an effective four-stage data augmentation scheme.
Tested on the Detection and Classification of Acoustic Scenes and Events
(DCASE) 2020 data sets, our proposed augmentation approach greatly improves the
system performance, ranking our submitted system in the first place in the SELD
task of DCASE 2020 Challenge. Furthermore, we employ a ResNet-Conformer
architecture to model both global and local context dependencies of an audio
sequence to yield further gains over those architectures used in the DCASE 2020
SELD evaluations.Comment: 12 pages, 8 figure
Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019
Sound event localization and detection is a novel area of research that
emerged from the combined interest of analyzing the acoustic scene in terms of
the spatial and temporal activity of sounds of interest. This paper presents an
overview of the first international evaluation on sound event localization and
detection, organized as a task of the DCASE 2019 Challenge. A large-scale
realistic dataset of spatialized sound events was generated for the challenge,
to be used for training of learning-based approaches, and for evaluation of the
submissions in an unlabeled subset. The overview presents in detail how the
systems were evaluated and ranked and the characteristics of the
best-performing systems. Common strategies in terms of input features, model
architectures, training approaches, exploitation of prior knowledge, and data
augmentation are discussed. Since ranking in the challenge was based on
individually evaluating localization and event classification performance, part
of the overview focuses on presenting metrics for the joint measurement of the
two, together with a reevaluation of submissions using these new metrics. The
new analysis reveals submissions that performed better on the joint task of
detecting the correct type of event close to its original location than some of
the submissions that were ranked higher in the challenge. Consequently, ranking
of submissions which performed strongly when evaluated separately on detection
or localization, but not jointly on both, was affected negatively
Event-Independent Network for Polyphonic Sound Event Localization and Detection
Polyphonic sound event localization and detection is not only detecting what
sound events are happening but localizing corresponding sound sources. This
series of tasks was first introduced in DCASE 2019 Task 3. In 2020, the sound
event localization and detection task introduces additional challenges in
moving sound sources and overlapping-event cases, which include two events of
the same type with two different direction-of-arrival (DoA) angles. In this
paper, a novel event-independent network for polyphonic sound event
localization and detection is proposed. Unlike the two-stage method we proposed
in DCASE 2019 Task 3, this new network is fully end-to-end. Inputs to the
network are first-order Ambisonics (FOA) time-domain signals, which are then
fed into a 1-D convolutional layer to extract acoustic features. The network is
then split into two parallel branches. The first branch is for sound event
detection (SED), and the second branch is for DoA estimation. There are three
types of predictions from the network, SED predictions, DoA predictions, and
event activity detection (EAD) predictions that are used to combine the SED and
DoA features for on-set and off-set estimation. All of these predictions have
the format of two tracks indicating that there are at most two overlapping
events. Within each track, there could be at most one event happening. This
architecture introduces a problem of track permutation. To address this
problem, a frame-level permutation invariant training method is used.
Experimental results show that the proposed method can detect polyphonic sound
events and their corresponding DoAs. Its performance on the Task 3 dataset is
greatly increased as compared with that of the baseline method.Comment: conferenc
L3DAS21 Challenge: Machine Learning for 3D Audio Signal Processing
The L3DAS21 Challenge is aimed at encouraging and fostering collaborative
research on machine learning for 3D audio signal processing, with particular
focus on 3D speech enhancement (SE) and 3D sound localization and detection
(SELD). Alongside with the challenge, we release the L3DAS21 dataset, a 65
hours 3D audio corpus, accompanied with a Python API that facilitates the data
usage and results submission stage. Usually, machine learning approaches to 3D
audio tasks are based on single-perspective Ambisonics recordings or on arrays
of single-capsule microphones. We propose, instead, a novel multichannel audio
configuration based multiple-source and multiple-perspective Ambisonics
recordings, performed with an array of two first-order Ambisonics microphones.
To the best of our knowledge, it is the first time that a dual-mic Ambisonics
configuration is used for these tasks. We provide baseline models and results
for both tasks, obtained with state-of-the-art architectures: FaSNet for SE and
SELDNet for SELD. This report is aimed at providing all needed information to
participate in the L3DAS21 Challenge, illustrating the details of the L3DAS21
dataset, the challenge tasks and the baseline models.Comment: Documentation paper for the L3DAS21 Challenge for IEEE MLSP 2021.
Further information on www.l3das.com/mlsp202