Search CORE

94 research outputs found

Sound Source Localization and Modeling: Spherical Harmonics Domain Approaches

Author: Hu Yonggang
Publication venue
Publication date: 01/01/2021
Field of study

Sound source localization has been an important research topic in the acoustic signal processing community because of its wide use in many acoustic applications, including speech separation, speech enhancement, sound event detection, automatic speech recognition, automated camera steering, and virtual reality. In the recent decade, there is a growing interest in the research of sound source localization using higher-order microphone arrays, which are capable of recording and analyzing the soundfield over a target spatial area. This thesis studies a novel source feature called the relative harmonic coefficient, that easily estimated from the higher-order microphone measurements. This source feature has direct applications for sound source localization due to its sole dependence on the source position. This thesis proposes two novel sound source localization algorithms using the relative harmonic coefficients: (i) a low-complexity single source localization approach that localizes the source' elevation and azimuth separately. This approach is also appliable to acoustic enhancement for the higher-order microphone array recordings; (ii) a semi-supervised multi-source localization algorithm in a noisy and reverberant environment. Although this approach uses a learning schema, it still has a strong potential to be implemented in practice because only a limited number of labeled measurements are required. However, this algorithm has an inherent limitation as it requires the availability of single-source components. Thus, it is unusable in scenarios where the original recordings have limited single-source components (e.g., multiple sources simultaneously active). To address this issue, we develop a novel MUSIC framework based approach that directly uses simultaneous multi-source recordings. This developed MUSIC approach uses robust measurements of relative sound pressure from the higher-order microphone and is shown to be more suitable in noisy environments than the traditional MUSIC method. While the proposed approaches address the source localization problems, in practice, the broader problem of source localization has some more common challenges, which have received less attention. One such challenge is the common assumption of the sound sources being omnidirectional, which is hardly the case with a typical commercial loudspeaker. Therefore, in this thesis, we analyze the broader problem of analyzing directional characteristics of the commercial loudspeakers by deriving equivalent theoretical acoustic models. Several acoustic models are investigated, including plane waves decomposition, point source decomposition, and mixed source decomposition. We finally conduct extensive experimental examinations to see which acoustic model has more similar characteristics with commercial loudspeakers

The Australian National University

Parametric first-order ambisonic decoding for headphones utilising the cross-pattern coherence algorithm

Author: Delikaris-Manias Symeon
McCormack Leo
Publication venue: HAL CCSD
Publication date: 06/09/2019
Field of study

International audienceRegarding the reproduction of recorded or synthesised spatial sound scenes, perhaps the most convenient and flexible approach is to employ the Ambisonics framework. The Ambisonics framework allows for linear and non-parametric storage, manipulation and reproduction of sound-fields, described using spherical harmonics up to a given order of expansion. Binaural Ambisonic reproduction can be realised by matching the spherical harmonic patterns to a set of binaural filters, in manner which is frequency-dependent, linear and time-invariant. However, the perceptual performance of this approach is largely dependent on the spatial resolution of the input format. When employing lower-order material as input, perceptual deficiencies may easily occur, such as poor localisation accuracy and colouration. This is especially problematic, as the vast majority of existing Ambisonic recordings are often made available as first-order only. The detrimental effects associated with lower-order Ambisonics reproduction have been well studied and documented. To improve upon the perceived spatial accuracy of the method, the simplest solution is to increase the spherical harmonic order at the recording stage. However, microphone arrays capable of capturing higher-order components, are generally much more expensive than first-order arrays; while more affordable options tend to offer higher-order components only at limited frequency ranges. Additionally, an increase in spherical harmonic order also requires an increase in the number of channels and storage, and in the case of transmission, more bandwidth is needed. Furthermore, it is important to note that this solution does not aid in the reproduction of existing lower-order recordings. It is for these reasons that this work focuses on alternative methods which improve the reproduction of first-order material for headphone playback. For the task of binaural sound-field reproduction, an alternative is to employ a parametric approach, which divides the sound-field decoding into analysis and synthesis stages. Unlike Ambisonic reproduction, which operates via a linear combination of the input signals, parametric approaches operate in the time-frequency domain and rely on the extraction of spatial parameters during their analysis stage. These spatial parameters are then utilised to conduct a more informed reproduction in the synthesis stage. Parametric methods are capable of reproducing sounds at a spatial resolution that far exceeds their linear and time-invariant counterparts, as they are not bounded by the resolution of the input format. For example, they can elect to directly convolve the analysed source signals with Head-Related Transfer Functions (HRTF), which correspond to their analysed directions. An infinite order of spherical harmonic components would be required to attain the same resolution with a binaural Ambisonic decoder. The most well-known and established parametric reproduction method is Directional Audio Coding (DirAC), which employs a sound-field model consisting of one plane-wave and one diffuseness estimate per time-frequency tile. These parameters are derived from the active-intensity vector, in the case of first-order input. More recent formulations allow for multiple plane-wave and diffuseness estimates via spatially-localised active-intensity vectors, using higher-order input. Another parametric method is High Angular Resolution plane-wave Expansion (HARPEX), which extracts two plane-waves per frequency and is first-order only. The Sparse-Recovery method extracts a number of plane-waves, which corresponds to up to half the number of input channels of arbitrary order. The COding and Multi-Parameterisation of Ambisonic Sound Scenes (COMPASS) method also extracts source components up to half the number of input channels, but employs an additional residual stream that encapsulates the remaining diffuse and ambient components in the scene. In this paper, a new binaural parametric decoder for first-order input is proposed. The method employs a sound-field model of one plane-wave and one diffuseness estimate per frequency, much like the DirAC model. However, the source component directions are identified via a plane-wave decomposition using a dense scanning grid and peak-finding, which is shown to be more robust than the active-intensity vector for multiple narrow-band sources. The source and ambient components per time-frequency tile are then segregated, and their relative energetic contributions are established, using the Cross-Pattern Coherence (CroPaC) spatial-filter. This approach is shown to be more robust than deriving this energy information from the active-intensity-based diffuseness estimates. A real-time audio plug-in implementation of the proposed approach is also described.A multiple-stimulus listening test was conducted to evaluate the perceived spatial accuracy and fidelity of the proposed method, alongside both first-order and third-order Ambisonics reproduction. The listening test results indicate that the proposed parametric decoder, using only first-order signals, is capable of delivering perceptual accuracy that matches or surpasses that of third-order ambisonics decoding

Proceedings of the EAA Spatial Audio Signal Processing symposium: SASP 2019

Author: Katz Brian F. G.
Noisternig Markus
Rafaely Boaz
Publication venue: HAL CCSD
Publication date: 01/09/2019
Field of study

International audienc

A robust sequential hypothesis testing method for brake squeal localisation

Author: Attia F.
Cong F.
DiBiase J. H.
Elko G.
Gerkmann T.
Hou J.
Liang Y.
Madhu N.
Madhu N.
Madhu N.
Madhu N.
Mauer G.
Nilesh Madhu
Philippen B.
Rainer Martin
Scheuing J.
Sebastian Gergen
Shannon C.
Thiergart O.
van Trees H. L.
Ward D. B.
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2019
Field of study

This contribution deals with the in situ detection and localisation of brake squeal in an automobile. As brake squeal is emitted from regions known a priori, i.e., near the wheels, the localisation is treated as a hypothesis testing problem. Distributed microphone arrays, situated under the automobile, are used to capture the directional properties of the sound field generated by a squealing brake. The spatial characteristics of the sampled sound field is then used to formulate the hypothesis tests. However, in contrast to standard hypothesis testing approaches of this kind, the propagation environment is complex and time-varying. Coupled with inaccuracies in the knowledge of the sensor and source positions as well as sensor gain mismatches, modelling the sound field is difficult and standard approaches fail in this case. A previously proposed approach implicitly tried to account for such incomplete system knowledge and was based on ad hoc likelihood formulations. The current paper builds upon this approach and proposes a second approach, based on more solid theoretical foundations, that can systematically account for the model uncertainties. Results from tests in a real setting show that the proposed approach is more consistent than the prior state-of-the-art. In both approaches, the tasks of detection and localisation are decoupled for complexity reasons. The localisation (hypothesis testing) is subject to a prior detection of brake squeal and identification of the squeal frequencies. The approaches used for the detection and identification of squeal frequencies are also presented. The paper, further, briefly addresses some practical issues related to array design and placement. (C) 2019 Author(s)

Crossref

Ghent University Academic Bibliography

Wavefield modeling and signal processing for sensor arrays of arbitrary geometry

Author: Costa Mário Jorge
Publication venue: Aalto-yliopisto
Publication date: 01/01/2013
Field of study

Sensor arrays and related signal processing methods are key technologies in many areas of engineering including wireless communication systems, radar and sonar as well as in biomedical applications. Sensor arrays are a collection of sensors that are placed at distinct locations in order to sense physical phenomena or synthesize wavefields. Spatial processing from the multichannel output of the sensor array is a typical task. Such processing is useful in areas including wireless communications, radar, surveillance and indoor positioning. In this dissertation, fundamental theory and practical methods of wavefield modeling for radio-frequency array processing applications are developed. Also, computationally-efficient high-resolution and optimal signal processing methods for sensor arrays of arbitrary geometry are proposed. Methods for taking into account array nonidealities are introduced as well. Numerical results illustrating the performance of the proposed methods are given using real-world antenna arrays. Wavefield modeling and manifold separation for vector-fields such as completely polarized electromagnetic wavefields and polarization sensitive arrays are proposed. Wavefield modeling is used for writing the array output in terms of two independent parts, namely the sampling matrix depending on the employed array including nonidealities and the coefficient vector depending on the wavefield. The superexponentially decaying property of the sampling matrix for polarization sensitive arrays is established. Two estimators of the sampling matrix from calibration measurements are proposed and their statistical properties are established. The array processing methods developed in this dissertation concentrate on polarimetric beamforming as well as on high-resolution and optimal azimuth, elevation and polarization parameter estimation. The proposed methods take into account array nonidealities such as mutual coupling, cross-polarization effects and mounting platform reflections. Computationally-efficient solutions based on polynomial rooting techniques and fast Fourier transform are achieved without restricting the proposed methods to regular array geometries. A novel expression for the Cramér-Rao bound in array processing that is tight for real-world arrays with nonidealities in the asymptotic regime is also proposed. A relationship between spherical harmonics and 2-D Fourier basis, called equivalence matrix, is established. A novel fast spherical harmonic transform is proposed, and a one-to-one mapping between spherical harmonic and 2-D Fourier spectra is found. Improvements to the minimum number of samples on the sphere that are needed in order to avoid aliasing are also proposed

Aaltodoc Publication Archive

Design of Beam Steering Electronic Circuits for Medical Applications

Author: Safar Mohammad A A A
Publication venue
Publication date: 01/01/2011
Field of study

This thesis deals with the theory and design of a hemispherical antenna array circuit that is capable to operate in the intermediate zones. By doing that, this array can be used in Hyperthermia Treatment for Brain Cancer in which the aim is to noninvasively focus the fields at microwave frequencies to the location of the tumor cells in the brain. Another possible application of the array is to offer an alternative means of sustaining Deep Brain Stimulation other than using the traditional (surgical) approach. The new noninvasive technique is accomplished by the use of a hemispherical antenna array placed on the human's head. The array uses a new beamforming technique that achieves 3 dimensional beamforming or focusing of the magnetic field of antennas to desired points in the brain to achieve either cell death by temperature rise (Hyperthermia Application) or to cause brain stimulation and hopefully alleviate the affects of Parkinson's Disease (Deep Brain Stimulation). The main obstacle in this design was that the far field approximation that is usually used when designing antenna arrays does not apply in this case since the hemispherical array is in close proximity to where the magnetic field is desired to be focused. The antenna array problem is approached as a boundary-valued problem with the human head being modeled as a three layered hemisphere. The exact expressions for electromagnetic fields are derived. Health issues such as electric field exposure and specific absorption rate (SAR) are considered. After developing the main antenna and beamforming theory, a neural network is designed to accomplish the beamforming technique used. The radio-frequency (RF) transmitter was designed to transmit the fields at a frequency of 1.8 GHz. The antenna array can also be used as a receiver. The antenna and beamforming theory is presented. A new reception technique is shown which enables the array to receive multiple magnetic field sources from within the hemispherical surface. The receiver is designed to operate at 500 kHz with the RF receiver circuit designed to receive any signal from within the hemispherical surface at a frequency of 500 kHz

Digital Repository at the University of Maryland

Signal compaction using polynomial EVD for spherical array processing with applications

Author: Evers Christine
Naylor Patrick A.
Neo Vincent W.
Weiss Stephan
Publication venue
Publication date: 17/08/2023
Field of study

Multi-channel signals captured by spatially separated sensors often contain a high level of data redundancy. A compact signal representation enables more efficient storage and processing, which has been exploited for data compression, noise reduction, and speech and image coding. This paper focuses on the compact representation of speech signals acquired by spherical microphone arrays. A polynomial matrix eigenvalue decomposition (PEVD) can spatially decorrelate signals over a range of time lags and is known to achieve optimum multi-channel data compaction. However, the complexity of PEVD algorithms scales at best cubically with the number of channel signals, e.g., the number of microphones comprised in a spherical array used for processing. In contrast, the spherical harmonic transform (SHT) provides a compact spatial representation of the 3-dimensional sound field measured by spherical microphone arrays, referred to as eigenbeam signals, at a cost that rises only quadratically with the number of microphones. Yet, the SHT’s spatially orthogonal basis functions cannot completely decorrelate sound field components over a range of time lags. In this work, we propose to exploit the compact representation offered by the SHT to reduce the number of channels used for subsequent PEVD processing. In the proposed framework for signal representation, we show that the diagonality factor improves by up to 7 dB over the microphone signal representation with a significantly lower computation cost. Moreover, when applying this framework to speech enhancement and source separation, the proposed method improves metrics known as short-time objective intelligibility (STOI) and source-to-distortion ratio (SDR) by up to 0.2 and 20 dB, respectively

University of Strathclyde Institutional Repository

Spiral - Imperial College Digital Repository

Effizientes binaurales Rendering von virtuellen akustischen Realitäten : technische und wahrnehmungsbezogene Konzepte

Author: Arend Johannes Mathias
Publication venue
Publication date: 05/04/2022
Field of study

Binaural rendering aims to immerse the listener in a virtual acoustic scene, making it an essential method for spatial audio reproduction in virtual or augmented reality (VR/AR) applications. The growing interest and research in VR/AR solutions yielded many different methods for the binaural rendering of virtual acoustic realities, yet all of them share the fundamental idea that the auditory experience of any sound field can be reproduced by reconstructing its sound pressure at the listener's eardrums. This thesis addresses various state-of-the-art methods for 3 or 6 degrees of freedom (DoF) binaural rendering, technical approaches applied in the context of headphone-based virtual acoustic realities, and recent technical and psychoacoustic research questions in the field of binaural technology. The publications collected in this dissertation focus on technical or perceptual concepts and methods for efficient binaural rendering, which has become increasingly important in research and development due to the rising popularity of mobile consumer VR/AR devices and applications. The thesis is organized into five research topics: Head-Related Transfer Function Processing and Interpolation, Parametric Spatial Audio, Auditory Distance Perception of Nearby Sound Sources, Binaural Rendering of Spherical Microphone Array Data, and Voice Directivity. The results of the studies included in this dissertation extend the current state of research in the respective research topic, answer specific psychoacoustic research questions and thereby yield a better understanding of basic spatial hearing processes, and provide concepts, methods, and design parameters for the future implementation of technically and perceptually efficient binaural rendering.Binaurales Rendering zielt darauf ab, dass der Hörer in eine virtuelle akustische Szene eintaucht, und ist somit eine wesentliche Methode für die räumliche Audiowiedergabe in Anwendungen der virtuellen Realität (VR) oder der erweiterten Realität (AR – aus dem Englischen Augmented Reality). Das wachsende Interesse und die zunehmende Forschung an VR/AR-Lösungen führte zu vielen verschiedenen Methoden für das binaurale Rendering virtueller akustischer Realitäten, die jedoch alle die grundlegende Idee teilen, dass das Hörerlebnis eines beliebigen Schallfeldes durch die Rekonstruktion seines Schalldrucks am Trommelfell des Hörers reproduziert werden kann. Diese Arbeit befasst sich mit verschiedenen modernsten Methoden zur binauralen Wiedergabe mit 3 oder 6 Freiheitsgraden (DoF – aus dem Englischen Degree of Freedom), mit technischen Ansätzen, die im Kontext kopfhörerbasierter virtueller akustischer Realitäten angewandt werden, und mit aktuellen technischen und psychoakustischen Forschungsfragen auf dem Gebiet der Binauraltechnik. Die in dieser Dissertation gesammelten Publikationen befassen sich mit technischen oder wahrnehmungsbezogenen Konzepten und Methoden für effizientes binaurales Rendering, was in der Forschung und Entwicklung aufgrund der zunehmenden Beliebtheit von mobilen Verbraucher-VR/AR-Geräten und -Anwendungen zunehmend an Relevanz gewonnen hat. Die Arbeit ist in fünf Forschungsthemen gegliedert: Verarbeitung und Interpolation von Außenohrübertragungsfunktionen, parametrisches räumliches Audio, auditive Entfernungswahrnehmung ohrnaher Schallquellen, binaurales Rendering von sphärischen Mikrofonarraydaten und Richtcharakteristik der Stimme. Die Ergebnisse der in dieser Dissertation enthaltenen Studien erweitern den aktuellen Forschungsstand im jeweiligen Forschungsfeld, beantworten spezifische psychoakustische Forschungsfragen und führen damit zu einem besseren Verständnis grundlegender räumlicher Hörprozesse, und liefern Konzepte, Methoden und Gestaltungsparameter für die zukünftige Umsetzung eines technisch und wahrnehmungsbezogen effizienten binauralen Renderings.BMBF, 03FH014IX5, Natürliche raumbezogene Darbietung selbsterzeugter Schallereignisse in virtuellen auditiven Umgebungen (NarDasS

DepositOnce

A Geometric Deep Learning Approach to Sound Source Localization and Tracking

Author: Beltrán Blázquez José Ramón
Díaz-Guerra Aparicio David
Miguel Artiaga Antonio
Publication venue: Universidad de Zaragoza, Prensas de la Universidad
Publication date: 01/01/2023
Field of study

La localización y el tracking de fuentes sonoras mediante agrupaciones de micrófonos es un problema que, pese a llevar décadas siendo estudiado, permanece abierto. En los últimos años, modelos basados en deep learning han superado el estado del arte que había sido establecido por las técnicas clásicas de procesado de señal, pero estos modelos todavía presentan problemas para trabajar en espacios con alta reverberación o para realizar el tracking de varias fuentes sonoras, especialmente cuando no es posible aplicar ningún criterio para clasificarlas u ordenarlas. En esta tesis, se proponen nuevos modelos que, basados en las ideas del Geometric Deep Learning, suponen un avance en el estado del arte para las situaciones mencionadas previamente.Los modelos propuestos utilizan como entrada mapas de potencia acústica calculados con el algoritmo SRP-PHAT, una técnica clásica de procesado de señal que permite estimar la energía acústica recibida desde cualquier dirección del espacio. Además, también proponemos una nueva técnica para suprimir analíticamente el efecto de una fuente en las funciones de correlación cruzada usadas para calcular los mapas SRP-PHAT. Basándonos en técnicas de banda estrecha, se demuestra que es posible proyectar las funciones de correlación cruzada de las señales capturadas por una agrupación de micrófonos a un espacio ortogonal a una dirección dada simplemente usando una combinación lineal de las funciones originales con retardos temporales. La técnica propuesta puede usarse para diseñar sistemas iterativos de localización de múltiples fuentes que, tras localizar la fuente con mayor energía en las funciones de correlación cruzada o en los mapas SRP-PHAT, la cancelen para poder encontrar otras fuentes que estuvieran enmascaradas por ella.Antes de poder entrenar modelos de deep learning necesitamos datos. Esto, en el caso de seguir un esquema de aprendizaje supervisado, supone un dataset de grabaciones de audio multicanal con la posición de las fuentes etiquetada con precisión. Pese a que existen algunos datasets con estas características, estos no son lo suficientemente extensos para entrenar una red neuronal y los entornos acústicos que incluyen no son suficientemente variados. Para solventar el problema de la falta de datos, presentamos una técnica para simular escenas acústicas con una o varias fuentes en movimiento y, para realizar estas simulaciones conforme son necesarias durante el entrenamiento de la red, presentamos la que es, que sepamos, la primera librería de software libre para la simulación de acústica de salas con aceleración por GPU. Tal y como queda demostrado en esta tesis, esta librería es más de dos órdenes de magnitud más rápida que otras librerías del estado del arte.La idea principal del Geometric Deep Learning es que los modelos deberían compartir las simetrías (i.e. las invarianzas y equivarianzas) de los datos y el problema que se quiere resolver. Para la estimación de la dirección de llegada de una única fuente, el uso de mapas SRP-PHAT como entrada de nuestros modelos hace que la equivarianza a las rotaciones sea obvia y, tras presentar una primera aproximación usando redes convolucionales tridimensionales, presentamos un modelo basado en convoluciones icosaédricas que son capaces de aproximar la equivarianza al grupo continuo de rotaciones esféricas por la equivarianza al grupo discreto de las 60 simetrías del icosaedro. En la tesis se demuestra que los mapas SRP-PHAT son una característica de entrada mucho más robusta que los espectrogramas que se usan típicamente en muchos modelos del estado del arte y que el uso de las convoluciones icosaédricas, combinado con una nueva función softargmax que obtiene una salida de regresión a partir del resultado de una red convolucional interpretándolo como una distribución de probabilidad y calculando su valor esperado, permite reducir enormemente el número de parámetros entrenables de los modelos sin reducir la precisión de sus estimaciones.Cuando queremos realizar el tracking de varias fuentes en movimiento y no podemos aplicar ningún criterio para ordenarlas o clasificarlas, el problema se vuelve invariante a las permutaciones de las estimaciones, por lo que no podemos compararlas directamente con las etiquetas de referencia dado que no podemos esperar que sigan el mismo orden. Este tipo de modelos se han entrenado típicamente usando estrategias de entrenamiento invariantes a las permutaciones, pero estas normalmente no penalizan los cambios de identidad por lo que los modelos entrenados con ellas no mantienen la identidad de cada fuente de forma consistente. Para resolver este problema, en esta tesis proponemos una nueva estrategia de entrenamiento, a la que llamamos sliding permutation invariant training (sPIT), que es capaz de optimizar todas las características que podemos esperar de un sistema de tracking de múltiples fuentes: la precisión de sus estimaciones de dirección de llegada, la exactitud de sus detecciones y la consistencia de las identidades asignadas a cada fuente.Finalmente, proponemos un nuevo tipo de red recursiva que usa conjuntos de vectores en lugar de vectores para representar su entrada y su estado y que es invariante a las permutaciones de los elementos del conjunto de entrada y equivariante a las del conjunto de estado. En esta tesis se muestra como este es el comportamiento que deberíamos esperar de un sistema de tracking que toma como entradas las estimaciones de un modelo de localización multifuente y se compara el rendimiento de estas redes recursivas invariantes a las permutaciones con redes recursivas GRU convencionales para aplicaciones de tracking de fuentes sonoras.The localization and tracking of sound sources using microphone arrays is a problem that, even if it has attracted attention from the signal processing research community for decades, remains open. In recent years, deep learning models have surpassed the state-of-the-art that had been established by classic signal processing techniques, but these models still struggle with handling rooms with strong reverberations or tracking multiple sources that dynamically appear and disappear, especially when we cannot apply any criteria to classify or order them. In this thesis, we follow the ideas of the Geometric Deep Learning framework to propose new models and techniques that mean an advance of the state-of-the-art in the aforementioned scenarios. As the input of our models, we use acoustic power maps computed using the SRP-PHAT algorithm, a classic signal processing technique that allows us to estimate the acoustic energy received from any direction of the space and, therefore, compute arbitrary-shaped power maps. In addition, we also propose a new technique to analytically cancel a source from the generalized cross-correlations used to compute the SRP-PHAT maps. Based on previous narrowband cancellation techniques, we prove that we can project the cross-correlation functions of the signals captured by a microphone array into a space orthogonal to a given direction by just computing a linear combination of time-shifted versions of the original cross-correlations. The proposed cancellation technique can be used to design iterative multi-source localization systems where, after having found the strongest source in the generalized cross-correlation functions or in the SRP-PHAT maps, we can cancel it and find new sources that were previously masked by thefirst source. Before being able to train deep learning models we need data, which, in the case of following a supervised learning approach, means a dataset of multichannel recordings with the position of the sources accurately labeled. Although there exist some datasets like this, they are not large enough to train a neural network and the acoustic environments they include are not diverse enough. To overcome this lack of real data, we present a technique to simulate acoustic scenes with one or several moving sound sources and, to be able to perform these simulations as they are needed during the training, we present what is, to the best of our knowledge, the first free and open source room acoustics simulation library with GPU acceleration. As we prove in this thesis, the presented library is more than two orders of magnitude faster than other state-of-the-art CPU libraries. The main idea of the Geometric Deep Learning philosophy is that the models should fit the symmetries (i.e. the invariances and equivariances) of the data and the problem we want to solve. For single-source direction of arrival estimation, the use of SRP-PHAT maps as inputs of our models makes the rotational equivariance of the problem undeniably clear and, after a first approach using 3D convolutional neural networks, we present a model using icosahedral convolutions that approximate the equivariance to the continuous group of spherical rotations by the discrete group of the 60 icosahedral symmetries. We prove that the SRP-PHAT maps are a much more robust input feature than the spectrograms typically used in many state-of-the-art models and that the use of the icosahedral convolutions, combined with a new soft-argmax function that obtains a regression output from the output of the convolutional neural network by interpreting it as a probability distribution and computing its expected value, allows us to dramatically reduce the number of trainable parameters of the models without losing accuracy in their estimations. When we want to track multiple moving sources and we cannot use any criteria to order or classify them, the problem becomes invariant to the permutations of the estimates, so we cannot directly compare them with the ground truth labels since we cannot expect them to be in the same order. This kind of models has typically been trained using permutation invariant training strategies, but these strategies usually do not penalize the identity switches and the models trained with them do not keep the identity of every source consistent during the tracking. To solve this issue, we propose a new training strategy, which we call sliding permutation invariant training, that is able to optimize all the features that we could expect from a multi-source tracking system: the precision of the direction of arrival estimates, the accuracy of the source detections, and the consistency of the assigned identities. Finally, we propose a new kind of recursive neural network that, instead of using vectors as their input and their state, uses sets of vectors and is invariant to the permutation of the elements of the input set and equivariant to the permutations of the elements of the state set. We show how this is the behavior that we should expect from a tracking model which takes as inputs the estimates of a multi-source localization model and compare these permutation-invariant recursive neural networks with the conventional gated recurrent units for sound source tracking applications.<br /

Repositorio Universidad de Zaragoza

Array signal processing algorithms for localization and equalization in complex acoustic channels

Author: Talagala Dumidu Sanjaya
Publication venue
Publication date: 01/01/2013
Field of study

The reproduction of realistic soundscapes in consumer electronic applications has been a driving force behind the development of spatial audio signal processing techniques. In order to accurately reproduce or decompose a particular spatial sound field, being able to exploit or estimate the effects of the acoustic environment becomes essential. This requires both an understanding of the source of the complexity in the acoustic channel (the acoustic path between a source and a receiver) and the ability to characterize its spatial attributes. In this thesis, we explore how to exploit or overcome the effects of the acoustic channel for sound source localization and sound field reproduction. The behaviour of a typical acoustic channel can be visualized as a transformation of its free field behaviour, due to scattering and reflections off the measurement apparatus and the surfaces in a room. These spatial effects can be modelled using the solutions to the acoustic wave equation, yet the physical nature of these scatterers typically results in complex behaviour with frequency. The first half of this thesis explores how to exploit this diversity in the frequency-domain for sound source localization, a concept that has not been considered previously. We first extract down-converted subband signals from the broadband audio signal, and collate these signals, such that the spatial diversity is retained. A signal model is then developed to exploit the channel's spatial information using a signal subspace approach. We show that this concept can be applied to multi-sensor arrays on complex-shaped rigid bodies as well as the special case of binaural localization. In both c! ases, an improvement in the closely spaced source resolution is demonstrated over traditional techniques, through simulations and experiments using a KEMAR manikin. The binaural analysis further indicates that the human localization performance in certain spatial regions is limited by the lack of spatial diversity, as suggested in perceptual experiments in the literature. Finally, the possibility of exploiting known inter-subband correlated sources (e.g., speech) for localization in under-determined systems is demonstrated. The second half of this thesis considers reverberation control, where reverberation is modelled as a superposition of sound fields created by a number of spatially distributed sources. We consider the mode/wave-domain description of the sound field, and propose modelling the reverberant modes as linear transformations of the desired sound field modes. This is a novel concept, as we consider each mode transformation to be independent of other modes. This model is then extended to sound field control, and used to derive the compensation signals required at the loudspeakers to equalize the reverberation. We show that estimating the reverberant channel and controlling the sound field now becomes a single adaptive filtering problem in the mode-domain, where the modes can be adapted independently. The performance of the proposed method is compared with existing adaptive and non-adaptive sound field control techniques through simulations. Finally, it is shown that an order of magnitude reduction in the computational complexity can be achieved, while maintaining comparable performance to existing adaptive control techniques

The Australian National University