31,318 research outputs found
Feature Aggregation in Joint Sound Classification and Localization Neural Networks
This study addresses the application of deep learning techniques in joint
sound signal classification and localization networks. Current state-of-the-art
sound source localization deep learning networks lack feature aggregation
within their architecture. Feature aggregation enhances model performance by
enabling the consolidation of information from different feature scales,
thereby improving feature robustness and invariance. This is particularly
important in SSL networks, which must differentiate direct and indirect
acoustic signals. To address this gap, we adapt feature aggregation techniques
from computer vision neural networks to signal detection neural networks.
Additionally, we propose the Scale Encoding Network (SEN) for feature
aggregation to encode features from various scales, compressing the network for
more computationally efficient aggregation. To evaluate the efficacy of feature
aggregation in SSL networks, we integrated the following computer vision
feature aggregation sub-architectures into a SSL control architecture: Path
Aggregation Network (PANet), Weighted Bi-directional Feature Pyramid Network
(BiFPN), and SEN. These sub-architectures were evaluated using two metrics for
signal classification and two metrics for direction-of-arrival regression.
PANet and BiFPN are established aggregators in computer vision models, while
the proposed SEN is a more compact aggregator. The results suggest that models
incorporating feature aggregations outperformed the control model, the Sound
Event Localization and Detection network (SELDnet), in both sound signal
classification and localization. The feature aggregation techniques enhance the
performance of sound detection neural networks, particularly in
direction-of-arrival regression
Source localization in reverberant rooms using Deep Learning and microphone arrays
International audienceSound sources localization (SSL) is a subject of active research in the field of multi-channel signal processing since many years, and could benefit from the emergence of data-driven approaches. In the present paper, we present our recent developments on the use of a deep neural network, fed with raw multichannel audio in order to achieve sound source localization in reverberating and noisy environments. This paradigm allows to avoid the simplifying assumptions that most traditional localization methods incorporate using source models and propagating models. However, for an efficient training process, supervised machine learning algorithms rely on large-sized and precisely labelled datasets. There is therefore a critical need to generate a large number of audio data recorded by microphone arrays in various environments. When the dataset is built either with numerical simulations or with experimental 3D soundfield synthesis, the physical validity is also critical. We therefore present an efficient tensor GPU-based computation of synthetic room impulse responses using fractional delays for image source models, and analyze the localization performances of the proposed neural network fed with this dataset, which allows a significant improvement in terms of SSL accuracy over the traditional MUSIC and SRP-PHAT methods
Towards End-to-End Acoustic Localization using Deep Learning: from Audio Signal to Source Position Coordinates
This paper presents a novel approach for indoor acoustic source localization
using microphone arrays and based on a Convolutional Neural Network (CNN). The
proposed solution is, to the best of our knowledge, the first published work in
which the CNN is designed to directly estimate the three dimensional position
of an acoustic source, using the raw audio signal as the input information
avoiding the use of hand crafted audio features. Given the limited amount of
available localization data, we propose in this paper a training strategy based
on two steps. We first train our network using semi-synthetic data, generated
from close talk speech recordings, and where we simulate the time delays and
distortion suffered in the signal that propagates from the source to the array
of microphones. We then fine tune this network using a small amount of real
data. Our experimental results show that this strategy is able to produce
networks that significantly improve existing localization methods based on
\textit{SRP-PHAT} strategies. In addition, our experiments show that our CNN
method exhibits better resistance against varying gender of the speaker and
different window sizes compared with the other methods.Comment: 18 pages, 3 figures, 8 table
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Regression and Classification for Direction-of-Arrival Estimation with Convolutional Recurrent Neural Networks
We present a novel learning-based approach to estimate the
direction-of-arrival (DOA) of a sound source using a convolutional recurrent
neural network (CRNN) trained via regression on synthetic data and Cartesian
labels. We also describe an improved method to generate synthetic data to train
the neural network using state-of-the-art sound propagation algorithms that
model specular as well as diffuse reflections of sound. We compare our model
against three other CRNNs trained using different formulations of the same
problem: classification on categorical labels, and regression on spherical
coordinate labels. In practice, our model achieves up to 43% decrease in
angular error over prior methods. The use of diffuse reflection results in 34%
and 41% reduction in angular prediction errors on LOCATA and SOFA datasets,
respectively, over prior methods based on image-source methods. Our method
results in an additional 3% error reduction over prior schemes that use
classification based networks, and we use 36% fewer network parameters
- …