169 research outputs found
Sound Source Distance Estimation in Diverse and Dynamic Acoustic Conditions
Localizing a moving sound source in the real world involves determining its
direction-of-arrival (DOA) and distance relative to a microphone. Advancements
in DOA estimation have been facilitated by data-driven methods optimized with
large open-source datasets with microphone array recordings in diverse
environments. In contrast, estimating a sound source's distance remains
understudied. Existing approaches assume recordings by non-coincident
microphones to use methods that are susceptible to differences in room
reverberation. We present a CRNN able to estimate the distance of moving sound
sources across multiple datasets featuring diverse rooms, outperforming a
recently-published approach. We also characterize our model's performance as a
function of sound source distance and different training losses. This analysis
reveals optimal training using a loss that weighs model errors as an inverse
function of the sound source true distance. Our study is the first to
demonstrate that sound source distance estimation can be performed across
diverse acoustic conditions using deep learning.Comment: Accepted in WASPAA 202
L3DAS21 Challenge: Machine Learning for 3D Audio Signal Processing
The L3DAS21 Challenge is aimed at encouraging and fostering collaborative
research on machine learning for 3D audio signal processing, with particular
focus on 3D speech enhancement (SE) and 3D sound localization and detection
(SELD). Alongside with the challenge, we release the L3DAS21 dataset, a 65
hours 3D audio corpus, accompanied with a Python API that facilitates the data
usage and results submission stage. Usually, machine learning approaches to 3D
audio tasks are based on single-perspective Ambisonics recordings or on arrays
of single-capsule microphones. We propose, instead, a novel multichannel audio
configuration based multiple-source and multiple-perspective Ambisonics
recordings, performed with an array of two first-order Ambisonics microphones.
To the best of our knowledge, it is the first time that a dual-mic Ambisonics
configuration is used for these tasks. We provide baseline models and results
for both tasks, obtained with state-of-the-art architectures: FaSNet for SE and
SELDNet for SELD. This report is aimed at providing all needed information to
participate in the L3DAS21 Challenge, illustrating the details of the L3DAS21
dataset, the challenge tasks and the baseline models.Comment: Documentation paper for the L3DAS21 Challenge for IEEE MLSP 2021.
Further information on www.l3das.com/mlsp202
Joint Direction and Proximity Classification of Overlapping Sound Events from Binaural Audio
Sound source proximity and distance estimation are of great interest in many practical applications, since they provide significant information for acoustic scene analysis. As both tasks share complementary qualities, ensuring efficient interaction between these two is crucial for a complete picture of an aural environment. In this paper, we aim to investigate several ways of performing joint proximity and direction estimation from binaural recordings, both defined as coarse classification problems based on Deep Neural Networks (DNNs). Considering the limitations of binaural audio, we propose two methods of splitting the sphere into angular areas in order to obtain a set of directional classes. For each method we study different model types to acquire information about the direction-of-arrival (DoA). Finally, we propose various ways of combining the proximity and direction estimation problems into a joint task providing temporal information about the onsets and offsets of the appearing sources. Experiments are performed for a synthetic reverberant binaural dataset consisting of up to two overlapping sound events.acceptedVersionPeer reviewe
A Sequence Matching Network for Polyphonic Sound Event Localization and Detection
Polyphonic sound event detection and direction-of-arrival estimation require
different input features from audio signals. While sound event detection mainly
relies on time-frequency patterns, direction-of-arrival estimation relies on
magnitude or phase differences between microphones. Previous approaches use the
same input features for sound event detection and direction-of-arrival
estimation, and train the two tasks jointly or in a two-stage transfer-learning
manner. We propose a two-step approach that decouples the learning of the sound
event detection and directional-of-arrival estimation systems. In the first
step, we detect the sound events and estimate the directions-of-arrival
separately to optimize the performance of each system. In the second step, we
train a deep neural network to match the two output sequences of the event
detector and the direction-of-arrival estimator. This modular and hierarchical
approach allows the flexibility in the system design, and increase the
performance of the whole sound event localization and detection system. The
experimental results using the DCASE 2019 sound event localization and
detection dataset show an improved performance compared to the previous
state-of-the-art solutions.Comment: to be published in 2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP
Robust sound event detection in bioacoustic sensor networks
Bioacoustic sensors, sometimes known as autonomous recording units (ARUs),
can record sounds of wildlife over long periods of time in scalable and
minimally invasive ways. Deriving per-species abundance estimates from these
sensors requires detection, classification, and quantification of animal
vocalizations as individual acoustic events. Yet, variability in ambient noise,
both over time and across sensors, hinders the reliability of current automated
systems for sound event detection (SED), such as convolutional neural networks
(CNN) in the time-frequency domain. In this article, we develop, benchmark, and
combine several machine listening techniques to improve the generalizability of
SED models across heterogeneous acoustic environments. As a case study, we
consider the problem of detecting avian flight calls from a ten-hour recording
of nocturnal bird migration, recorded by a network of six ARUs in the presence
of heterogeneous background noise. Starting from a CNN yielding
state-of-the-art accuracy on this task, we introduce two noise adaptation
techniques, respectively integrating short-term (60 milliseconds) and long-term
(30 minutes) context. First, we apply per-channel energy normalization (PCEN)
in the time-frequency domain, which applies short-term automatic gain control
to every subband in the mel-frequency spectrogram. Secondly, we replace the
last dense layer in the network by a context-adaptive neural network (CA-NN)
layer. Combining them yields state-of-the-art results that are unmatched by
artificial data augmentation alone. We release a pre-trained version of our
best performing system under the name of BirdVoxDetect, a ready-to-use detector
of avian flight calls in field recordings.Comment: 32 pages, in English. Submitted to PLOS ONE journal in February 2019;
revised August 2019; published October 201
Audio-based localization for ubiquitous sensor networks
Thesis (S.M.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2005.Includes bibliographical references (p. 97-101).This research presents novel techniques for acoustic-source location for both actively triggered, and passively detected signals using pervasive, distributed networks of devices, and investigates the combination of existing resources available in personal electronics to build a digital sensing 'commons'. By connecting personal resources with those of the people nearby, tasks can be achieved, through distributed placement and statistical improvement, that a single device could not do alone. The utility and benefits of spatio-temporal acoustic sensing are presented, in the context of ubiquitous computing and machine listening history. An active audio self-localisation algorithm is described which is effective in distributed sensor networks even if only coarse temporal synchronisation can be established. Pseudo-noise 'chirps' are emitted and recorded at each of the nodes. Pair-wise distances are calculated by comparing the difference in the audio delays between the peaks measured in each recording. By removing dependence on fine grained temporal synchronisation it is hoped that this technique can be used concurrently across a wide range of devices to better leverage the existing audio sensing resources that surround us.(cont.) A passive acoustic source location estimation method is then derived which is suited to the microphone resources of network-connected heterogeneous devices containing asynchronous processors and uncalibrated sensors. Under these constraints position coordinates must be simultaneously determined for pairs of sounds and recorded at each microphone to form a chain of acoustic events. It is shown that an iterative, numerical least-squares estimator can be used. Initial position estimates of the source pair can be first found from the previous estimate in the chain and a closed-form least squares approach, improving the convergence rate of the second step. Implementations of these methods using the Smart Architectural Surfaces development platform are described and assessed. The viability of the active ranging technique is further demonstrated in a mixed-device ad-hoc sensor network case using existing off-the-shelf technology. Finally, drawing on human-centric onset detection as a means of discovering suitable sound features, to be passed between nodes for comparison, the extension of the source location algorithm beyond the use of pseudo-noise test sounds to enable the location of extraneous noises and acoustic streams is discussed for further study.Benjamin Christopher Dalton.S.M
Tracking interacting targets in multi-modal sensors
PhDObject tracking is one of the fundamental tasks in various applications such as surveillance,
sports, video conferencing and activity recognition. Factors such as occlusions,
illumination changes and limited field of observance of the sensor make tracking a challenging
task. To overcome these challenges the focus of this thesis is on using multiple
modalities such as audio and video for multi-target, multi-modal tracking. Particularly,
this thesis presents contributions to four related research topics, namely, pre-processing of
input signals to reduce noise, multi-modal tracking, simultaneous detection and tracking,
and interaction recognition.
To improve the performance of detection algorithms, especially in the presence
of noise, this thesis investigate filtering of the input data through spatio-temporal feature
analysis as well as through frequency band analysis. The pre-processed data from multiple
modalities is then fused within Particle filtering (PF). To further minimise the discrepancy
between the real and the estimated positions, we propose a strategy that associates the
hypotheses and the measurements with a real target, using a Weighted Probabilistic Data
Association (WPDA). Since the filtering involved in the detection process reduces the
available information and is inapplicable on low signal-to-noise ratio data, we investigate
simultaneous detection and tracking approaches and propose a multi-target track-beforedetect
Particle filtering (MT-TBD-PF). The proposed MT-TBD-PF algorithm bypasses
the detection step and performs tracking in the raw signal. Finally, we apply the proposed
multi-modal tracking to recognise interactions between targets in regions within, as well
as outside the cameras’ fields of view.
The efficiency of the proposed approaches are demonstrated on large uni-modal,
multi-modal and multi-sensor scenarios from real world detections, tracking and event
recognition datasets and through participation in evaluation campaigns
- …