28 research outputs found
Regression and Classification for Direction-of-Arrival Estimation with Convolutional Recurrent Neural Networks
We present a novel learning-based approach to estimate the
direction-of-arrival (DOA) of a sound source using a convolutional recurrent
neural network (CRNN) trained via regression on synthetic data and Cartesian
labels. We also describe an improved method to generate synthetic data to train
the neural network using state-of-the-art sound propagation algorithms that
model specular as well as diffuse reflections of sound. We compare our model
against three other CRNNs trained using different formulations of the same
problem: classification on categorical labels, and regression on spherical
coordinate labels. In practice, our model achieves up to 43% decrease in
angular error over prior methods. The use of diffuse reflection results in 34%
and 41% reduction in angular prediction errors on LOCATA and SOFA datasets,
respectively, over prior methods based on image-source methods. Our method
results in an additional 3% error reduction over prior schemes that use
classification based networks, and we use 36% fewer network parameters
Direction of Arrival Estimation with Microphone Arrays Using SRP-PHAT and Neural Networks
The Steered Response Power with phase transform (SRP-PHAT) is one of the most employed techniques for Direction of Arrival (DOA) estimation with microphone arrays, but its computational complexity grows when the search space increases. To solve this issue, we propose the use of Neural Networks (NN) to obtain the DOA from low-resolution SRP-PHAT power maps
Multi-scale aggregation of phase information for reducing computational cost of CNN based DOA estimation
In a recent work on direction-of-arrival (DOA) estimation of multiple
speakers with convolutional neural networks (CNNs), the phase component of
short-time Fourier transform (STFT) coefficients of the microphone signal is
given as input and small filters are used to learn the phase relations between
neighboring microphones. Due to this chosen filter size, convolution
layers are required to achieve the best performance for a microphone array with
M microphones. For arrays with large number of microphones, this requirement
leads to a high computational cost making the method practically infeasible. In
this work, we propose to use systematic dilations of the convolution filters in
each of the convolution layers of the previously proposed CNN for expansion of
the receptive field of the filters to reduce the computational cost of the
method. Different strategies for expansion of the receptive field of the
filters for a specific microphone array are explored. With experimental
analysis of the different strategies, it is shown that an aggressive expansion
strategy results in a considerable reduction in computational cost while a
relatively gradual expansion of the receptive field exhibits the best DOA
estimation performance along with reduction in the computational cost.Comment: arXiv admin note: text overlap with arXiv:1807.1172
SELD-TCN: Sound Event Localization & Detection via Temporal Convolutional Networks
The understanding of the surrounding environment plays a critical role in
autonomous robotic systems, such as self-driving cars. Extensive research has
been carried out concerning visual perception. Yet, to obtain a more complete
perception of the environment, autonomous systems of the future should also
take acoustic information into account. Recent sound event localization and
detection (SELD) frameworks utilize convolutional recurrent neural networks
(CRNNs). However, considering the recurrent nature of CRNNs, it becomes
challenging to implement them efficiently on embedded hardware. Not only are
their computations strenuous to parallelize, but they also require high memory
bandwidth and large memory buffers. In this work, we develop a more robust and
hardware-friendly novel architecture based on a temporal convolutional
network(TCN). The proposed framework (SELD-TCN) outperforms the
state-of-the-art SELDnet performance on four different datasets. Moreover,
SELD-TCN achieves 4x faster training time per epoch and 40x faster inference
time on an ordinary graphics processing unit (GPU).Comment: 5 pages, 3 tables, 2 figures. Submitted to EUSIPCO 202
A Sequence Matching Network for Polyphonic Sound Event Localization and Detection
Polyphonic sound event detection and direction-of-arrival estimation require
different input features from audio signals. While sound event detection mainly
relies on time-frequency patterns, direction-of-arrival estimation relies on
magnitude or phase differences between microphones. Previous approaches use the
same input features for sound event detection and direction-of-arrival
estimation, and train the two tasks jointly or in a two-stage transfer-learning
manner. We propose a two-step approach that decouples the learning of the sound
event detection and directional-of-arrival estimation systems. In the first
step, we detect the sound events and estimate the directions-of-arrival
separately to optimize the performance of each system. In the second step, we
train a deep neural network to match the two output sequences of the event
detector and the direction-of-arrival estimator. This modular and hierarchical
approach allows the flexibility in the system design, and increase the
performance of the whole sound event localization and detection system. The
experimental results using the DCASE 2019 sound event localization and
detection dataset show an improved performance compared to the previous
state-of-the-art solutions.Comment: to be published in 2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP
Detecting multiple, simultaneous talkers through localising speech recorded by ad-hoc microphone arrays
This paper proposes a novel approach to detecting multiple, simultaneous talkers in multi-party meetings using localisation of active speech sources recorded with an ad-hoc microphone array. Cues indicating the relative distance between sources and microphones are derived from speech signals and room impulse responses recorded by each of the microphones distributed at unknown locations within a room. Multiple active sources are localised by analysing a surface formed from these cues and derived at different locations within the room. The number of localised active sources per each frame or utterance is then counted to estimate when multiple sources are active. The proposed approach does not require prior information about the number and locations of sources or microphones. Synchronisation between microphones is also not required. A meeting scenario with competing speakers is simulated and results show that simultaneously active sources can be detected with an average accuracy of 75% and the number of active sources counted accurately 65% of the time