19 research outputs found
Broadband DOA estimation using Convolutional neural networks trained with noise signals
A convolution neural network (CNN) based classification method for broadband
DOA estimation is proposed, where the phase component of the short-time Fourier
transform coefficients of the received microphone signals are directly fed into
the CNN and the features required for DOA estimation are learnt during
training. Since only the phase component of the input is used, the CNN can be
trained with synthesized noise signals, thereby making the preparation of the
training data set easier compared to using speech signals. Through experimental
evaluation, the ability of the proposed noise trained CNN framework to
generalize to speech sources is demonstrated. In addition, the robustness of
the system to noise, small perturbations in microphone positions, as well as
its ability to adapt to different acoustic conditions is investigated using
experiments with simulated and real data.Comment: Published in Proceedings of IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics (WASPAA) 201
Exploiting Attention-based Sequence-to-Sequence Architectures for Sound Event Localization
Sound event localization frameworks based on deep neural networks have shown
increased robustness with respect to reverberation and noise in comparison to
classical parametric approaches. In particular, recurrent architectures that
incorporate temporal context into the estimation process seem to be well-suited
for this task. This paper proposes a novel approach to sound event localization
by utilizing an attention-based sequence-to-sequence model. These types of
models have been successfully applied to problems in natural language
processing and automatic speech recognition. In this work, a multi-channel
audio signal is encoded to a latent representation, which is subsequently
decoded to a sequence of estimated directions-of-arrival. Herein, attentions
allow for capturing temporal dependencies in the audio signal by focusing on
specific frames that are relevant for estimating the activity and
direction-of-arrival of sound events at the current time-step. The framework is
evaluated on three publicly available datasets for sound event localization. It
yields superior localization performance compared to state-of-the-art methods
in both anechoic and reverberant conditions.Comment: Published in Proceedings of the 28th European Signal Processing
Conference (EUSIPCO), 202
SELD-TCN: Sound Event Localization & Detection via Temporal Convolutional Networks
The understanding of the surrounding environment plays a critical role in
autonomous robotic systems, such as self-driving cars. Extensive research has
been carried out concerning visual perception. Yet, to obtain a more complete
perception of the environment, autonomous systems of the future should also
take acoustic information into account. Recent sound event localization and
detection (SELD) frameworks utilize convolutional recurrent neural networks
(CRNNs). However, considering the recurrent nature of CRNNs, it becomes
challenging to implement them efficiently on embedded hardware. Not only are
their computations strenuous to parallelize, but they also require high memory
bandwidth and large memory buffers. In this work, we develop a more robust and
hardware-friendly novel architecture based on a temporal convolutional
network(TCN). The proposed framework (SELD-TCN) outperforms the
state-of-the-art SELDnet performance on four different datasets. Moreover,
SELD-TCN achieves 4x faster training time per epoch and 40x faster inference
time on an ordinary graphics processing unit (GPU).Comment: 5 pages, 3 tables, 2 figures. Submitted to EUSIPCO 202
Source localization in reverberant rooms using Deep Learning and microphone arrays
International audienceSound sources localization (SSL) is a subject of active research in the field of multi-channel signal processing since many years, and could benefit from the emergence of data-driven approaches. In the present paper, we present our recent developments on the use of a deep neural network, fed with raw multichannel audio in order to achieve sound source localization in reverberating and noisy environments. This paradigm allows to avoid the simplifying assumptions that most traditional localization methods incorporate using source models and propagating models. However, for an efficient training process, supervised machine learning algorithms rely on large-sized and precisely labelled datasets. There is therefore a critical need to generate a large number of audio data recorded by microphone arrays in various environments. When the dataset is built either with numerical simulations or with experimental 3D soundfield synthesis, the physical validity is also critical. We therefore present an efficient tensor GPU-based computation of synthetic room impulse responses using fractional delays for image source models, and analyze the localization performances of the proposed neural network fed with this dataset, which allows a significant improvement in terms of SSL accuracy over the traditional MUSIC and SRP-PHAT methods