39 research outputs found

    PSD Estimation of Multiple Sound Sources in a Reverberant Room Using a Spherical Microphone Array

    Full text link
    We propose an efficient method to estimate source power spectral densities (PSDs) in a multi-source reverberant environment using a spherical microphone array. The proposed method utilizes the spatial correlation between the spherical harmonics (SH) coefficients of a sound field to estimate source PSDs. The use of the spatial cross-correlation of the SH coefficients allows us to employ the method in an environment with a higher number of sources compared to conventional methods. Furthermore, the orthogonality property of the SH basis functions saves the effort of designing specific beampatterns of a conventional beamformer-based method. We evaluate the performance of the algorithm with different number of sources in practical reverberant and non-reverberant rooms. We also demonstrate an application of the method by separating source signals using a conventional beamformer and a Wiener post-filter designed from the estimated PSDs.Comment: Accepted for WASPAA 201

    PSD Estimation and Source Separation in a Noisy Reverberant Environment using a Spherical Microphone Array

    Get PDF
    In this paper, we propose an efficient technique for estimating individual power spectral density (PSD) components, i.e., PSD of each desired sound source as well as of noise and reverberation, in a multi-source reverberant sound scene with coherent background noise. We formulate the problem in the spherical harmonics domain to take the advantage of the inherent orthogonality of the spherical harmonics basis functions and extract the PSD components from the cross-correlation between the different sound field modes. We also investigate an implementation issue that occurs at the nulls of the Bessel functions and offer an engineering solution. The performance evaluation takes place in a practical environment with a commercial microphone array in order to measure the robustness of the proposed algorithm against all the deviations incurred in practice. We also exhibit an application of the proposed PSD estimator through a source septation algorithm and compare the performance with a contemporary method in terms of different objective measures

    Spatial dissection of a soundfield using spherical harmonic decomposition

    Get PDF
    A real-world soundfield is often contributed by multiple desired and undesired sound sources. The performance of many acoustic systems such as automatic speech recognition, audio surveillance, and teleconference relies on its ability to extract the desired sound components in such a mixed environment. The existing solutions to the above problem are constrained by various fundamental limitations and require to enforce different priors depending on the acoustic condition such as reverberation and spatial distribution of sound sources. With the growing emphasis and integration of audio applications in diverse technologies such as smart home and virtual reality appliances, it is imperative to advance the source separation technology in order to overcome the limitations of the traditional approaches. To that end, we exploit the harmonic decomposition model to dissect a mixed soundfield into its underlying desired and undesired components based on source and signal characteristics. By analysing the spatial projection of a soundfield, we achieve multiple outcomes such as (i) soundfield separation with respect to distinct source regions, (ii) source separation in a mixed soundfield using modal coherence model, and (iii) direction of arrival (DOA) estimation of multiple overlapping sound sources through pattern recognition of the modal coherence of a soundfield. We first employ an array of higher order microphones for soundfield separation in order to reduce hardware requirement and implementation complexity. Subsequently, we develop novel mathematical models for modal coherence of noisy and reverberant soundfields that facilitate convenient ways for estimating DOA and power spectral densities leading to robust source separation algorithms. The modal domain approach to the soundfield/source separation allows us to circumvent several practical limitations of the existing techniques and enhance the performance and robustness of the system. The proposed methods are presented with several practical applications and performance evaluations using simulated and real-life dataset

    A Geometric Deep Learning Approach to Sound Source Localization and Tracking

    Get PDF
    La localización y el tracking de fuentes sonoras mediante agrupaciones de micrófonos es un problema que, pese a llevar décadas siendo estudiado, permanece abierto. En los últimos años, modelos basados en deep learning han superado el estado del arte que había sido establecido por las técnicas clásicas de procesado de señal, pero estos modelos todavía presentan problemas para trabajar en espacios con alta reverberación o para realizar el tracking de varias fuentes sonoras, especialmente cuando no es posible aplicar ningún criterio para clasificarlas u ordenarlas. En esta tesis, se proponen nuevos modelos que, basados en las ideas del Geometric Deep Learning, suponen un avance en el estado del arte para las situaciones mencionadas previamente.Los modelos propuestos utilizan como entrada mapas de potencia acústica calculados con el algoritmo SRP-PHAT, una técnica clásica de procesado de señal que permite estimar la energía acústica recibida desde cualquier dirección del espacio. Además, también proponemos una nueva técnica para suprimir analíticamente el efecto de una fuente en las funciones de correlación cruzada usadas para calcular los mapas SRP-PHAT. Basándonos en técnicas de banda estrecha, se demuestra que es posible proyectar las funciones de correlación cruzada de las señales capturadas por una agrupación de micrófonos a un espacio ortogonal a una dirección dada simplemente usando una combinación lineal de las funciones originales con retardos temporales. La técnica propuesta puede usarse para diseñar sistemas iterativos de localización de múltiples fuentes que, tras localizar la fuente con mayor energía en las funciones de correlación cruzada o en los mapas SRP-PHAT, la cancelen para poder encontrar otras fuentes que estuvieran enmascaradas por ella.Antes de poder entrenar modelos de deep learning necesitamos datos. Esto, en el caso de seguir un esquema de aprendizaje supervisado, supone un dataset de grabaciones de audio multicanal con la posición de las fuentes etiquetada con precisión. Pese a que existen algunos datasets con estas características, estos no son lo suficientemente extensos para entrenar una red neuronal y los entornos acústicos que incluyen no son suficientemente variados. Para solventar el problema de la falta de datos, presentamos una técnica para simular escenas acústicas con una o varias fuentes en movimiento y, para realizar estas simulaciones conforme son necesarias durante el entrenamiento de la red, presentamos la que es, que sepamos, la primera librería de software libre para la simulación de acústica de salas con aceleración por GPU. Tal y como queda demostrado en esta tesis, esta librería es más de dos órdenes de magnitud más rápida que otras librerías del estado del arte.La idea principal del Geometric Deep Learning es que los modelos deberían compartir las simetrías (i.e. las invarianzas y equivarianzas) de los datos y el problema que se quiere resolver. Para la estimación de la dirección de llegada de una única fuente, el uso de mapas SRP-PHAT como entrada de nuestros modelos hace que la equivarianza a las rotaciones sea obvia y, tras presentar una primera aproximación usando redes convolucionales tridimensionales, presentamos un modelo basado en convoluciones icosaédricas que son capaces de aproximar la equivarianza al grupo continuo de rotaciones esféricas por la equivarianza al grupo discreto de las 60 simetrías del icosaedro. En la tesis se demuestra que los mapas SRP-PHAT son una característica de entrada mucho más robusta que los espectrogramas que se usan típicamente en muchos modelos del estado del arte y que el uso de las convoluciones icosaédricas, combinado con una nueva función softargmax que obtiene una salida de regresión a partir del resultado de una red convolucional interpretándolo como una distribución de probabilidad y calculando su valor esperado, permite reducir enormemente el número de parámetros entrenables de los modelos sin reducir la precisión de sus estimaciones.Cuando queremos realizar el tracking de varias fuentes en movimiento y no podemos aplicar ningún criterio para ordenarlas o clasificarlas, el problema se vuelve invariante a las permutaciones de las estimaciones, por lo que no podemos compararlas directamente con las etiquetas de referencia dado que no podemos esperar que sigan el mismo orden. Este tipo de modelos se han entrenado típicamente usando estrategias de entrenamiento invariantes a las permutaciones, pero estas normalmente no penalizan los cambios de identidad por lo que los modelos entrenados con ellas no mantienen la identidad de cada fuente de forma consistente. Para resolver este problema, en esta tesis proponemos una nueva estrategia de entrenamiento, a la que llamamos sliding permutation invariant training (sPIT), que es capaz de optimizar todas las características que podemos esperar de un sistema de tracking de múltiples fuentes: la precisión de sus estimaciones de dirección de llegada, la exactitud de sus detecciones y la consistencia de las identidades asignadas a cada fuente.Finalmente, proponemos un nuevo tipo de red recursiva que usa conjuntos de vectores en lugar de vectores para representar su entrada y su estado y que es invariante a las permutaciones de los elementos del conjunto de entrada y equivariante a las del conjunto de estado. En esta tesis se muestra como este es el comportamiento que deberíamos esperar de un sistema de tracking que toma como entradas las estimaciones de un modelo de localización multifuente y se compara el rendimiento de estas redes recursivas invariantes a las permutaciones con redes recursivas GRU convencionales para aplicaciones de tracking de fuentes sonoras.The localization and tracking of sound sources using microphone arrays is a problem that, even if it has attracted attention from the signal processing research community for decades, remains open. In recent years, deep learning models have surpassed the state-of-the-art that had been established by classic signal processing techniques, but these models still struggle with handling rooms with strong reverberations or tracking multiple sources that dynamically appear and disappear, especially when we cannot apply any criteria to classify or order them. In this thesis, we follow the ideas of the Geometric Deep Learning framework to propose new models and techniques that mean an advance of the state-of-the-art in the aforementioned scenarios. As the input of our models, we use acoustic power maps computed using the SRP-PHAT algorithm, a classic signal processing technique that allows us to estimate the acoustic energy received from any direction of the space and, therefore, compute arbitrary-shaped power maps. In addition, we also propose a new technique to analytically cancel a source from the generalized cross-correlations used to compute the SRP-PHAT maps. Based on previous narrowband cancellation techniques, we prove that we can project the cross-correlation functions of the signals captured by a microphone array into a space orthogonal to a given direction by just computing a linear combination of time-shifted versions of the original cross-correlations. The proposed cancellation technique can be used to design iterative multi-source localization systems where, after having found the strongest source in the generalized cross-correlation functions or in the SRP-PHAT maps, we can cancel it and find new sources that were previously masked by thefirst source. Before being able to train deep learning models we need data, which, in the case of following a supervised learning approach, means a dataset of multichannel recordings with the position of the sources accurately labeled. Although there exist some datasets like this, they are not large enough to train a neural network and the acoustic environments they include are not diverse enough. To overcome this lack of real data, we present a technique to simulate acoustic scenes with one or several moving sound sources and, to be able to perform these simulations as they are needed during the training, we present what is, to the best of our knowledge, the first free and open source room acoustics simulation library with GPU acceleration. As we prove in this thesis, the presented library is more than two orders of magnitude faster than other state-of-the-art CPU libraries. The main idea of the Geometric Deep Learning philosophy is that the models should fit the symmetries (i.e. the invariances and equivariances) of the data and the problem we want to solve. For single-source direction of arrival estimation, the use of SRP-PHAT maps as inputs of our models makes the rotational equivariance of the problem undeniably clear and, after a first approach using 3D convolutional neural networks, we present a model using icosahedral convolutions that approximate the equivariance to the continuous group of spherical rotations by the discrete group of the 60 icosahedral symmetries. We prove that the SRP-PHAT maps are a much more robust input feature than the spectrograms typically used in many state-of-the-art models and that the use of the icosahedral convolutions, combined with a new soft-argmax function that obtains a regression output from the output of the convolutional neural network by interpreting it as a probability distribution and computing its expected value, allows us to dramatically reduce the number of trainable parameters of the models without losing accuracy in their estimations. When we want to track multiple moving sources and we cannot use any criteria to order or classify them, the problem becomes invariant to the permutations of the estimates, so we cannot directly compare them with the ground truth labels since we cannot expect them to be in the same order. This kind of models has typically been trained using permutation invariant training strategies, but these strategies usually do not penalize the identity switches and the models trained with them do not keep the identity of every source consistent during the tracking. To solve this issue, we propose a new training strategy, which we call sliding permutation invariant training, that is able to optimize all the features that we could expect from a multi-source tracking system: the precision of the direction of arrival estimates, the accuracy of the source detections, and the consistency of the assigned identities. Finally, we propose a new kind of recursive neural network that, instead of using vectors as their input and their state, uses sets of vectors and is invariant to the permutation of the elements of the input set and equivariant to the permutations of the elements of the state set. We show how this is the behavior that we should expect from a tracking model which takes as inputs the estimates of a multi-source localization model and compare these permutation-invariant recursive neural networks with the conventional gated recurrent units for sound source tracking applications.<br /

    A room acoustics measurement system using non-invasive microphone arrays

    Get PDF
    This thesis summarises research into adaptive room correction for small rooms and pre-recorded material, for example music of films. A measurement system to predict the sound at a remote location within a room, without a microphone at that location was investigated. This would allow the sound within a room to be adaptively manipulated to ensure that all listeners received optimum sound, therefore increasing their enjoyment. The solution presented used small microphone arrays, mounted on the room's walls. A unique geometry and processing system was designed, incorporating three processing stages, temporal, spatial and spectral. The temporal processing identifies individual reflection arrival times from the recorded data. Spatial processing estimates the angles of arrival of the reflections so that the three-dimensional coordinates of the reflections' origin can be calculated. The spectral processing then estimates the frequency response of the reflection. These estimates allow a mathematical model of the room to be calculated, based on the acoustic measurements made in the actual room. The model can then be used to predict the sound at different locations within the room. A simulated model of a room was produced to allow fast development of algorithms. Measurements in real rooms were then conducted and analysed to verify the theoretical models developed and to aid further development of the system. Results from these measurements and simulations, for each processing stage are presented

    The design and implementation of an acoustic phased array transmitter for the demonstration of MIMO techniques

    Get PDF
    MIMO radar algorithms are the latest generation of techniques that can be applied to array radars. They offer the potential to improve the radar resolution, increase the number of targets that can be identified and give added flexibility in beampattern design. However, little experimental data demonstrating MIMO radar is available because radar arrays are already expensive systems and MIMO extends the com- plexity and cost further. An acoustic array, which works on the same principles as a radio frequency radar array, can be built at a fraction of the cost of a real radar system. The novel contribution of this project was the demonstration ofMIMO radar techniques on an acoustic array, which was designed and built for this purpose. To achieve the project objectives, the theory of traditional phased array radar techniques and MIMO techniques was researched. The phased array and MIMO techniques were also simulated under narrowband and wideband conditions, and the strengths and weaknesses of each were highlighted. This was followed by the design and implementation of a low cost audible acoustic transmitter array to be used with an existing receiver array to demonstrate the investigated array radar techniques. Finally, the techniques were tested on the hardware platform. The simulation and hardware test results were used to evaluate and compare the performance of phased array and MIMO radar techniques. The beampattern design flexibility that is offered by MIMO radar was demonstrated with the transmission and measurement of omnidirectional, single-lobed and multi-lobed MIMO beampat- terns. Also, parameter estimation experiments were performed where phased array and MIMO radar signals were transmitted. Phased array techniques were shown to be simple, effective and robust. The MIMO Capon, APES and GLRT parameter estimation techniques were shown to be sensitive to the type of signals transmitted, and in most cases, the added complexity of these techniques did not lead to improved target parameter estimation results. However, the MIMO technique of transmitter beamforming on reception gave high resolution target range and angle estimates, living up to the expectations placed on MIMO radar

    Proceedings of the EAA Spatial Audio Signal Processing symposium: SASP 2019

    Get PDF
    International audienc

    Mixture of beamformers for speech separation and extraction

    Get PDF
    In many audio applications, the signal of interest is corrupted by acoustic background noise, interference, and reverberation. The presence of these contaminations can significantly degrade the quality and intelligibility of the audio signal. This makes it important to develop signal processing methods that can separate the competing sources and extract a source of interest. The estimated signals may then be either directly listened to, transmitted, or further processed, giving rise to a wide range of applications such as hearing aids, noise-cancelling headphones, human-computer interaction, surveillance, and hands-free telephony. Many of the existing approaches to speech separation/extraction relied on beamforming techniques. These techniques approach the problem from a spatial point of view; a microphone array is used to form a spatial filter which can extract a signal from a specific direction and reduce the contamination of signals from other directions. However, when there are fewer microphones than sources (the underdetermined case), perfect attenuation of all interferers becomes impossible and only partial interference attenuation is possible. In this thesis, we present a framework which extends the use of beamforming techniques to underdetermined speech mixtures. We describe frequency domain non-linear mixture of beamformers that can extract a speech source from a known direction. Our approach models the data in each frequency bin via Gaussian mixture distributions, which can be learned using the expectation maximization algorithm. The model learning is performed using the observed mixture signals only, and no prior training is required. The signal estimator comprises of a set of minimum mean square error (MMSE), minimum variance distortionless response (MVDR), or minimum power distortionless response (MPDR) beamformers. In order to estimate the signal, all beamformers are concurrently applied to the observed signal, and the weighted sum of the beamformers’ outputs is used as the signal estimator, where the weights are the estimated posterior probabilities of the Gaussian mixture states. These weights are specific to each timefrequency point. The resulting non-linear beamformers do not need to know or estimate the number of sources, and can be applied to microphone arrays with two or more microphones with arbitrary array configuration. We test and evaluate the described methods on underdetermined speech mixtures. Experimental results for the non-linear beamformers in underdetermined mixtures with room reverberation confirm their capability to successfully extract speech sources

    Applications of compressive sensing to direction of arrival estimation

    Get PDF
    Die Schätzung der Einfallsrichtungen (Directions of Arrival/DOA) mehrerer ebener Wellenfronten mit Hilfe eines Antennen-Arrays ist eine der prominentesten Fragestellungen im Gebiet der Array-Signalverarbeitung. Das nach wie vor starke Forschungsinteresse in dieser Richtung konzentriert sich vor allem auf die Reduktion des Hardware-Aufwands, im Sinne der Komplexität und des Energieverbrauchs der Empfänger, bei einem vorgegebenen Grad an Genauigkeit und Robustheit gegen Mehrwegeausbreitung. Diese Dissertation beschäftigt sich mit der Anwendung von Compressive Sensing (CS) auf das Gebiet der DOA-Schätzung mit dem Ziel, hiermit die Komplexität der Empfängerhardware zu reduzieren und gleichzeitig eine hohe Richtungsauflösung und Robustheit zu erreichen. CS wurde bereits auf das DOA-Problem angewandt unter der Ausnutzung der Tatsache, dass eine Superposition ebener Wellenfronten mit einer winkelabhängigen Leistungsdichte korrespondiert, die über den Winkel betrachtet sparse ist. Basierend auf der Idee wurden CS-basierte Algorithmen zur DOA-Schätzung vorgeschlagen, die sich durch eine geringe Rechenkomplexität, Robustheit gegenüber Quellenkorrelation und Flexibilität bezüglich der Wahl der Array-Geometrie auszeichnen. Die Anwendung von CS führt darüber hinaus zu einer erheblichen Reduktion der Hardware-Komplexität, da weniger Empfangskanäle benötigt werden und eine geringere Datenmenge zu verarbeiten und zu speichern ist, ohne dabei wesentliche Informationen zu verlieren. Im ersten Teil der Arbeit wird das Problem des Modellfehlers bei der CS-basierten DOA-Schätzung mit gitterbehafteten Verfahren untersucht. Ein häufig verwendeter Ansatz um das CS-Framework auf das DOA-Problem anzuwenden ist es, den kontinuierlichen Winkel-Parameter zu diskreditieren und damit ein Dictionary endlicher Größe zu bilden. Da die tatsächlichen Winkel fast sicher nicht auf diesem Gitter liegen werden, entsteht dabei ein unvermeidlicher Modellfehler, der sich auf die Schätzalgorithmen auswirkt. In der Arbeit wird ein analytischer Ansatz gewählt, um den Effekt der Gitterfehler auf die rekonstruierten Spektra zu untersuchen. Es wird gezeigt, dass sich die Messung einer Quelle aus beliebiger Richtung sehr gut durch die erwarteten Antworten ihrer beiden Nachbarn auf dem Gitter annähern lässt. Darauf basierend wird ein einfaches und effizientes Verfahren vorgeschlagen, den Gitterversatz zu schätzen. Dieser Ansatz ist anwendbar auf einzelne Quellen oder mehrere, räumlich gut separierte Quellen. Für den Fall mehrerer dicht benachbarter Quellen wird ein numerischer Ansatz zur gemeinsamen Schätzung des Gitterversatzes diskutiert. Im zweiten Teil der Arbeit untersuchen wir das Design kompressiver Antennenarrays für die DOA-Schätzung. Die Kompression im Sinne von Linearkombinationen der Antennensignale, erlaubt es, Arrays mit großer Apertur zu entwerfen, die nur wenige Empfangskanäle benötigen und sich konfigurieren lassen. In der Arbeit wird eine einfache Empfangsarchitektur vorgeschlagen und ein allgemeines Systemmodell diskutiert, welches verschiedene Optionen der tatsächlichen Hardware-Realisierung dieser Linearkombinationen zulässt. Im Anschluss wird das Design der Gewichte des analogen Kombinations-Netzwerks untersucht. Numerische Simulationen zeigen die Überlegenheit der vorgeschlagenen kompressiven Antennen-Arrays im Vergleich mit dünn besetzten Arrays der gleichen Komplexität sowie kompressiver Arrays mit zufällig gewählten Gewichten. Schließlich werden zwei weitere Anwendungen der vorgeschlagenen Ansätze diskutiert: CS-basierte Verzögerungsschätzung und kompressives Channel Sounding. Es wird demonstriert, dass die in beiden Gebieten durch die Anwendung der vorgeschlagenen Ansätze erhebliche Verbesserungen erzielt werden können.Direction of Arrival (DOA) estimation of plane waves impinging on an array of sensors is one of the most important tasks in array signal processing, which have attracted tremendous research interest over the past several decades. The estimated DOAs are used in various applications like localization of transmitting sources, massive MIMO and 5G Networks, tracking and surveillance in radar, and many others. The major objective in DOA estimation is to develop approaches that allow to reduce the hardware complexity in terms of receiver costs and power consumption, while providing a desired level of estimation accuracy and robustness in the presence of multiple sources and/or multiple paths. Compressive sensing (CS) is a novel sampling methodology merging signal acquisition and compression. It allows for sampling a signal with a rate below the conventional Nyquist bound. In essence, it has been shown that signals can be acquired at sub-Nyquist sampling rates without loss of information provided they possess a sufficiently sparse representation in some domain and that the measurement strategy is suitably chosen. CS has been recently applied to DOA estimation, leveraging the fact that a superposition of planar wavefronts corresponds to a sparse angular power spectrum. This dissertation investigates the application of compressive sensing to the DOA estimation problem with the goal to reduce the hardware complexity and/or achieve a high resolution and a high level of robustness. Many CS-based DOA estimation algorithms have been proposed in recent years showing tremendous advantages with respect to the complexity of the numerical solution while being insensitive to source correlation and allowing arbitrary array geometries. Moreover, CS has also been suggested to be applied in the spatial domain with the main goal to reduce the complexity of the measurement process by using fewer RF chains and storing less measured data without the loss of any significant information. In the first part of the work we investigate the model mismatch problem for CS based DOA estimation algorithms off the grid. To apply the CS framework a very common approach is to construct a finite dictionary by sampling the angular domain with a predefined sampling grid. Therefore, the target locations are almost surely not located exactly on a subset of these grid points. This leads to a model mismatch which deteriorates the performance of the estimators. We take an analytical approach to investigate the effect of such grid offsets on the recovered spectra showing that each off-grid source can be well approximated by the two neighboring points on the grid. We propose a simple and efficient scheme to estimate the grid offset for a single source or multiple well-separated sources. We also discuss a numerical procedure for the joint estimation of the grid offsets of closer sources. In the second part of the thesis we study the design of compressive antenna arrays for DOA estimation that aim to provide a larger aperture with a reduced hardware complexity and allowing reconfigurability, by a linear combination of the antenna outputs to a lower number of receiver channels. We present a basic receiver architecture of such a compressive array and introduce a generic system model that includes different options for the hardware implementation. We then discuss the design of the analog combining network that performs the receiver channel reduction. Our numerical simulations demonstrate the superiority of the proposed optimized compressive arrays compared to the sparse arrays of the same complexity and to compressive arrays with randomly chosen combining kernels. Finally, we consider two other applications of the sparse recovery and compressive arrays. The first application is CS based time delay estimation and the other one is compressive channel sounding. We show that the proposed approaches for sparse recovery off the grid and compressive arrays show significant improvements in the considered applications compared to conventional methods

    Design of large polyphase filters in the Quadratic Residue Number System

    Full text link
    corecore