734 research outputs found
Towards End-to-End Acoustic Localization using Deep Learning: from Audio Signal to Source Position Coordinates
This paper presents a novel approach for indoor acoustic source localization
using microphone arrays and based on a Convolutional Neural Network (CNN). The
proposed solution is, to the best of our knowledge, the first published work in
which the CNN is designed to directly estimate the three dimensional position
of an acoustic source, using the raw audio signal as the input information
avoiding the use of hand crafted audio features. Given the limited amount of
available localization data, we propose in this paper a training strategy based
on two steps. We first train our network using semi-synthetic data, generated
from close talk speech recordings, and where we simulate the time delays and
distortion suffered in the signal that propagates from the source to the array
of microphones. We then fine tune this network using a small amount of real
data. Our experimental results show that this strategy is able to produce
networks that significantly improve existing localization methods based on
\textit{SRP-PHAT} strategies. In addition, our experiments show that our CNN
method exhibits better resistance against varying gender of the speaker and
different window sizes compared with the other methods.Comment: 18 pages, 3 figures, 8 table
Acoustic Source Localisation in constrained environments
Acoustic Source Localisation (ASL) is a problem with real-world applications
across multiple domains, from smart assistants to acoustic detection and tracking.
And yet, despite the level of attention in recent years, a technique for rapid and
robust ASL remains elusive – not least in the constrained environments in which
such techniques are most likely to be deployed.
In this work, we seek to address some of these current limitations by presenting
improvements to the ASL method for three commonly encountered constraints: the
number and configuration of sensors; the limited signal sampling potentially available;
and the nature and volume of training data required to accurately estimate Direction
of Arrival (DOA) when deploying a particular supervised machine learning technique.
In regard to the number and configuration of sensors, we find that accuracy can be
maintained at state-of-the-art levels, Steered Response Power (SRP), while reducing
computation sixfold, based on direct optimisation of well known ASL formulations.
Moreover, we find that the circular microphone configuration is the least desirable
as it yields the highest localisation error.
In regard to signal sampling, we demonstrate that the computer vision inspired
algorithm presented in this work, which extracts selected keypoints from the signal spectrogram, and uses them to select signal samples, outperforms an audio
fingerprinting baseline while maintaining a compression ratio of 40:1.
In regard to the training data employed in machine learning ASL techniques,
we show that the use of music training data yields an improvement of 19% against
a noise data baseline while maintaining accuracy using only 25% of the training
data, while training with speech as opposed to noise improves DOA estimation by
an average of 17%, outperforming the Generalised Cross-Correlation technique by
125% in scenarios in which the test and training acoustic environments are matched.Heriot-Watt University James Watt
Scholarship (JSW) in the School of Engineering & Physical Sciences
SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning
We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio
rendering for 3D environments. Given a 3D mesh of a real-world environment,
SoundSpaces can generate highly realistic acoustics for arbitrary sounds
captured from arbitrary microphone locations. Together with existing 3D visual
assets, it supports an array of audio-visual research tasks, such as
audio-visual navigation, mapping, source localization and separation, and
acoustic matching. Compared to existing resources, SoundSpaces 2.0 has the
advantages of allowing continuous spatial sampling, generalization to novel
environments, and configurable microphone and material properties. To our
knowledge, this is the first geometry-based acoustic simulation that offers
high fidelity and realism while also being fast enough to use for embodied
learning. We showcase the simulator's properties and benchmark its performance
against real-world audio measurements. In addition, we demonstrate two
downstream tasks -- embodied navigation and far-field automatic speech
recognition -- and highlight sim2real performance for the latter. SoundSpaces
2.0 is publicly available to facilitate wider research for perceptual systems
that can both see and hear.Comment: Camera-ready version. Website: https://soundspaces.org. Project page:
https://vision.cs.utexas.edu/projects/soundspaces
Locating and extracting acoustic and neural signals
This dissertation presents innovate methodologies for locating, extracting, and separating multiple incoherent sound sources in three-dimensional (3D) space; and applications of the time reversal (TR) algorithm to pinpoint the hyper active neural activities inside the brain auditory structure that are correlated to the tinnitus pathology. Specifically, an acoustic modeling based method is developed for locating arbitrary and incoherent sound sources in 3D space in real time by using a minimal number of microphones, and the Point Source Separation (PSS) method is developed for extracting target signals from directly measured mixed signals. Combining these two approaches leads to a novel technology known as Blind Sources Localization and Separation (BSLS) that enables one to locate multiple incoherent sound signals in 3D space and separate original individual sources simultaneously, based on the directly measured mixed signals. These technologies have been validated through numerical simulations and experiments conducted in various non-ideal environments where there are non-negligible, unspecified sound reflections and reverberation as well as interferences from random background noise. Another innovation presented in this dissertation is concerned with applications of the TR algorithm to pinpoint the exact locations of hyper-active neurons in the brain auditory structure that are directly correlated to the tinnitus perception. Benchmark tests conducted on normal rats have confirmed the localization results provided by the TR algorithm. Results demonstrate that the spatial resolution of this source localization can be as high as the micrometer level. This high precision localization may lead to a paradigm shift in tinnitus diagnosis, which may in turn produce a more cost-effective treatment for tinnitus than any of the existing ones
On the performance of multi-GPU-based expert systems for acoustic localization involving massive microphone array
Sound source localization is an important topic in expert systems involving microphone arrays, such as automatic camera steering systems, human-machine interaction, video gaming or audio surveillance. The Steered Response Power with Phase Transform (SRP-PHAT) algorithm is a well-known approach for sound source localization due to its robust performance in noisy and reverberant environments. This algorithm analyzes the sound power captured by an acoustic beamformer on a defined spatial grid, estimating the source location as the point that maximizes the output power. Since localization accuracy can be improved by using high-resolution spatial grids and a high number of microphones, accurate acoustic localization systems require high computational power. Graphics Processing Units (GPUs) are highly parallel programmable co-processors that provide massive computation when the needed operations are properly parallelized. Emerging GPUs offer multiple parallelism levels; however, properly managing their computational resources becomes a very challenging task. In fact, management issues become even more difficult when multiple GPUs are involved, adding one more level of parallelism. In this paper, the performance of an acoustic source localization system using distributed microphones is analyzed over a massive multichannel processing framework in a multi-GPU system. The paper evaluates and points out the influence that the number of microphones and the available computational resources have in the overall system performance. Several acoustic environments are considered to show the impact that noise and reverberation have in the localization accuracy and how the use of massive microphone systems combined with parallelized GPU algorithms can help to mitigate substantially adverse acoustic effects. In this context, the proposed implementation is able to work in real time with high-resolution spatial grids and using up to 48 microphones. These results confirm the advantages of suitable GPU architectures in the development of real-time massive acoustic signal processing systems.This work has been partially funded by the Spanish Ministerio de Economia y Competitividad (TEC2009-13741, TEC2012-38142-C04-01, and TEC2012-37945-C02-02), Generalitat Valenciana PROMETEO 2009/2013, and Universitat Politecnica de Valencia through Programa de Apoyo a la Investigacion y Desarrollo (PAID-05-11 and PAID-05-12).Belloch Rodríguez, JA.; Gonzalez, A.; Vidal Maciá, AM.; Cobos Serrano, M. (2015). On the performance of multi-GPU-based expert systems for acoustic localization involving massive microphone array. Expert Systems with Applications. 42(13):5607-5620. https://doi.org/10.1016/j.eswa.2015.02.056S56075620421
SoundCam: A Dataset for Finding Humans Using Room Acoustics
A room's acoustic properties are a product of the room's geometry, the
objects within the room, and their specific positions. A room's acoustic
properties can be characterized by its impulse response (RIR) between a source
and listener location, or roughly inferred from recordings of natural signals
present in the room. Variations in the positions of objects in a room can
effect measurable changes in the room's acoustic properties, as characterized
by the RIR. Existing datasets of RIRs either do not systematically vary
positions of objects in an environment, or they consist of only simulated RIRs.
We present SoundCam, the largest dataset of unique RIRs from in-the-wild rooms
publicly released to date. It includes 5,000 10-channel real-world measurements
of room impulse responses and 2,000 10-channel recordings of music in three
different rooms, including a controlled acoustic lab, an in-the-wild living
room, and a conference room, with different humans in positions throughout each
room. We show that these measurements can be used for interesting tasks, such
as detecting and identifying humans, and tracking their positions.Comment: In NeurIPS 2023 Datasets and Benchmarks Track. Project page:
https://masonlwang.com/soundcam/. Wang and Clarke contributed equally to this
wor
Proceedings of the EAA Spatial Audio Signal Processing symposium: SASP 2019
International audienc
- …