345 research outputs found
Enabling technologies for audio augmented reality systems
Audio augmented reality (AAR) refers to technology that embeds computer-generated auditory content into a user's real acoustic environment. An AAR system has specific requirements that set it apart from regular human--computer interfaces: an audio playback system to allow the simultaneous perception of real and virtual sounds; motion tracking to enable interactivity and location-awareness; the design and implementation of auditory display to deliver AAR content; and spatial rendering to display spatialised AAR content. This thesis presents a series of studies on enabling technologies to meet these requirements.
A binaural headset with integrated microphones is assumed as the audio playback system, as it allows mobility and precise control over the ear input signals. Here, user position and orientation tracking methods are proposed that rely on speech signals recorded at the binaural headset microphones. To evaluate the proposed methods, the head orientations and positions of three conferees engaged in a discussion were tracked. The binaural microphones improved tracking performance substantially. The proposed methods are applicable to acoustic tracking with other forms of user-worn microphones.
Results from a listening test investigating the effect of auditory display parameters on user performance are reported. The parameters studied were derived from the design choices to be made when implementing auditory display. The results indicate that users are able to detect a sound sample among distractors and estimate sample numerosity accurately with both speech and non-speech audio, if the samples are presented with adequate temporal separation. Whether or not samples were separated spatially had no effect on user performance. However, with spatially separated samples, users were able to detect a sample among distractors and simultaneously localise it. The results of this study are applicable to a variety of AAR applications that require conveying sample presence or numerosity.
Spatial rendering is commonly implemented by convolving virtual sounds with head-related transfer functions (HRTFs). Here, a framework is proposed that interpolates HRTFs measured at arbitrary directions and distances. The framework employs Delaunay triangulation to group HRTFs into subsets suitable for interpolation and barycentric coordinates as interpolation weights. The proposed interpolation framework allows the realtime rendering of virtual sources in the near-field via HRTFs measured at various distances
Use of Pattern Classification Algorithms to Interpret Passive and Active Data Streams from a Walking-Speed Robotic Sensor Platform
In order to perform useful tasks for us, robots must have the ability to notice, recognize, and respond to objects and events in their environment. This requires the acquisition and synthesis of information from a variety of sensors. Here we investigate the performance of a number of sensor modalities in an unstructured outdoor environment, including the Microsoft Kinect, thermal infrared camera, and coffee can radar. Special attention is given to acoustic echolocation measurements of approaching vehicles, where an acoustic parametric array propagates an audible signal to the oncoming target and the Kinect microphone array records the reflected backscattered signal. Although useful information about the target is hidden inside the noisy time domain measurements, the Dynamic Wavelet Fingerprint process (DWFP) is used to create a time-frequency representation of the data. A small-dimensional feature vector is created for each measurement using an intelligent feature selection process for use in statistical pattern classification routines. Using our experimentally measured data from real vehicles at 50 m, this process is able to correctly classify vehicles into one of five classes with 94% accuracy. Fully three-dimensional simulations allow us to study the nonlinear beam propagation and interaction with real-world targets to improve classification results
A Geometric Deep Learning Approach to Sound Source Localization and Tracking
La localizaciĂłn y el tracking de fuentes sonoras mediante agrupaciones de micrĂłfonos es un problema que, pese a llevar dĂ©cadas siendo estudiado, permanece abierto. En los Ăşltimos años, modelos basados en deep learning han superado el estado del arte que habĂa sido establecido por las tĂ©cnicas clásicas de procesado de señal, pero estos modelos todavĂa presentan problemas para trabajar en espacios con alta reverberaciĂłn o para realizar el tracking de varias fuentes sonoras, especialmente cuando no es posible aplicar ningĂşn criterio para clasificarlas u ordenarlas. En esta tesis, se proponen nuevos modelos que, basados en las ideas del Geometric Deep Learning, suponen un avance en el estado del arte para las situaciones mencionadas previamente.Los modelos propuestos utilizan como entrada mapas de potencia acĂşstica calculados con el algoritmo SRP-PHAT, una tĂ©cnica clásica de procesado de señal que permite estimar la energĂa acĂşstica recibida desde cualquier direcciĂłn del espacio. Además, tambiĂ©n proponemos una nueva tĂ©cnica para suprimir analĂticamente el efecto de una fuente en las funciones de correlaciĂłn cruzada usadas para calcular los mapas SRP-PHAT. Basándonos en tĂ©cnicas de banda estrecha, se demuestra que es posible proyectar las funciones de correlaciĂłn cruzada de las señales capturadas por una agrupaciĂłn de micrĂłfonos a un espacio ortogonal a una direcciĂłn dada simplemente usando una combinaciĂłn lineal de las funciones originales con retardos temporales. La tĂ©cnica propuesta puede usarse para diseñar sistemas iterativos de localizaciĂłn de mĂşltiples fuentes que, tras localizar la fuente con mayor energĂa en las funciones de correlaciĂłn cruzada o en los mapas SRP-PHAT, la cancelen para poder encontrar otras fuentes que estuvieran enmascaradas por ella.Antes de poder entrenar modelos de deep learning necesitamos datos. Esto, en el caso de seguir un esquema de aprendizaje supervisado, supone un dataset de grabaciones de audio multicanal con la posiciĂłn de las fuentes etiquetada con precisiĂłn. Pese a que existen algunos datasets con estas caracterĂsticas, estos no son lo suficientemente extensos para entrenar una red neuronal y los entornos acĂşsticos que incluyen no son suficientemente variados. Para solventar el problema de la falta de datos, presentamos una tĂ©cnica para simular escenas acĂşsticas con una o varias fuentes en movimiento y, para realizar estas simulaciones conforme son necesarias durante el entrenamiento de la red, presentamos la que es, que sepamos, la primera librerĂa de software libre para la simulaciĂłn de acĂşstica de salas con aceleraciĂłn por GPU. Tal y como queda demostrado en esta tesis, esta librerĂa es más de dos Ăłrdenes de magnitud más rápida que otras librerĂas del estado del arte.La idea principal del Geometric Deep Learning es que los modelos deberĂan compartir las simetrĂas (i.e. las invarianzas y equivarianzas) de los datos y el problema que se quiere resolver. Para la estimaciĂłn de la direcciĂłn de llegada de una Ăşnica fuente, el uso de mapas SRP-PHAT como entrada de nuestros modelos hace que la equivarianza a las rotaciones sea obvia y, tras presentar una primera aproximaciĂłn usando redes convolucionales tridimensionales, presentamos un modelo basado en convoluciones icosaĂ©dricas que son capaces de aproximar la equivarianza al grupo continuo de rotaciones esfĂ©ricas por la equivarianza al grupo discreto de las 60 simetrĂas del icosaedro. En la tesis se demuestra que los mapas SRP-PHAT son una caracterĂstica de entrada mucho más robusta que los espectrogramas que se usan tĂpicamente en muchos modelos del estado del arte y que el uso de las convoluciones icosaĂ©dricas, combinado con una nueva funciĂłn softargmax que obtiene una salida de regresiĂłn a partir del resultado de una red convolucional interpretándolo como una distribuciĂłn de probabilidad y calculando su valor esperado, permite reducir enormemente el nĂşmero de parámetros entrenables de los modelos sin reducir la precisiĂłn de sus estimaciones.Cuando queremos realizar el tracking de varias fuentes en movimiento y no podemos aplicar ningĂşn criterio para ordenarlas o clasificarlas, el problema se vuelve invariante a las permutaciones de las estimaciones, por lo que no podemos compararlas directamente con las etiquetas de referencia dado que no podemos esperar que sigan el mismo orden. Este tipo de modelos se han entrenado tĂpicamente usando estrategias de entrenamiento invariantes a las permutaciones, pero estas normalmente no penalizan los cambios de identidad por lo que los modelos entrenados con ellas no mantienen la identidad de cada fuente de forma consistente. Para resolver este problema, en esta tesis proponemos una nueva estrategia de entrenamiento, a la que llamamos sliding permutation invariant training (sPIT), que es capaz de optimizar todas las caracterĂsticas que podemos esperar de un sistema de tracking de mĂşltiples fuentes: la precisiĂłn de sus estimaciones de direcciĂłn de llegada, la exactitud de sus detecciones y la consistencia de las identidades asignadas a cada fuente.Finalmente, proponemos un nuevo tipo de red recursiva que usa conjuntos de vectores en lugar de vectores para representar su entrada y su estado y que es invariante a las permutaciones de los elementos del conjunto de entrada y equivariante a las del conjunto de estado. En esta tesis se muestra como este es el comportamiento que deberĂamos esperar de un sistema de tracking que toma como entradas las estimaciones de un modelo de localizaciĂłn multifuente y se compara el rendimiento de estas redes recursivas invariantes a las permutaciones con redes recursivas GRU convencionales para aplicaciones de tracking de fuentes sonoras.The localization and tracking of sound sources using microphone arrays is a problem that, even if it has attracted attention from the signal processing research community for decades, remains open. In recent years, deep learning models have surpassed the state-of-the-art that had been established by classic signal processing techniques, but these models still struggle with handling rooms with strong reverberations or tracking multiple sources that dynamically appear and disappear, especially when we cannot apply any criteria to classify or order them. In this thesis, we follow the ideas of the Geometric Deep Learning framework to propose new models and techniques that mean an advance of the state-of-the-art in the aforementioned scenarios. As the input of our models, we use acoustic power maps computed using the SRP-PHAT algorithm, a classic signal processing technique that allows us to estimate the acoustic energy received from any direction of the space and, therefore, compute arbitrary-shaped power maps. In addition, we also propose a new technique to analytically cancel a source from the generalized cross-correlations used to compute the SRP-PHAT maps. Based on previous narrowband cancellation techniques, we prove that we can project the cross-correlation functions of the signals captured by a microphone array into a space orthogonal to a given direction by just computing a linear combination of time-shifted versions of the original cross-correlations. The proposed cancellation technique can be used to design iterative multi-source localization systems where, after having found the strongest source in the generalized cross-correlation functions or in the SRP-PHAT maps, we can cancel it and find new sources that were previously masked by thefirst source. Before being able to train deep learning models we need data, which, in the case of following a supervised learning approach, means a dataset of multichannel recordings with the position of the sources accurately labeled. Although there exist some datasets like this, they are not large enough to train a neural network and the acoustic environments they include are not diverse enough. To overcome this lack of real data, we present a technique to simulate acoustic scenes with one or several moving sound sources and, to be able to perform these simulations as they are needed during the training, we present what is, to the best of our knowledge, the first free and open source room acoustics simulation library with GPU acceleration. As we prove in this thesis, the presented library is more than two orders of magnitude faster than other state-of-the-art CPU libraries. The main idea of the Geometric Deep Learning philosophy is that the models should fit the symmetries (i.e. the invariances and equivariances) of the data and the problem we want to solve. For single-source direction of arrival estimation, the use of SRP-PHAT maps as inputs of our models makes the rotational equivariance of the problem undeniably clear and, after a first approach using 3D convolutional neural networks, we present a model using icosahedral convolutions that approximate the equivariance to the continuous group of spherical rotations by the discrete group of the 60 icosahedral symmetries. We prove that the SRP-PHAT maps are a much more robust input feature than the spectrograms typically used in many state-of-the-art models and that the use of the icosahedral convolutions, combined with a new soft-argmax function that obtains a regression output from the output of the convolutional neural network by interpreting it as a probability distribution and computing its expected value, allows us to dramatically reduce the number of trainable parameters of the models without losing accuracy in their estimations. When we want to track multiple moving sources and we cannot use any criteria to order or classify them, the problem becomes invariant to the permutations of the estimates, so we cannot directly compare them with the ground truth labels since we cannot expect them to be in the same order. This kind of models has typically been trained using permutation invariant training strategies, but these strategies usually do not penalize the identity switches and the models trained with them do not keep the identity of every source consistent during the tracking. To solve this issue, we propose a new training strategy, which we call sliding permutation invariant training, that is able to optimize all the features that we could expect from a multi-source tracking system: the precision of the direction of arrival estimates, the accuracy of the source detections, and the consistency of the assigned identities. Finally, we propose a new kind of recursive neural network that, instead of using vectors as their input and their state, uses sets of vectors and is invariant to the permutation of the elements of the input set and equivariant to the permutations of the elements of the state set. We show how this is the behavior that we should expect from a tracking model which takes as inputs the estimates of a multi-source localization model and compare these permutation-invariant recursive neural networks with the conventional gated recurrent units for sound source tracking applications.<br /
Investigation of spacecraft cluster autonomy through an acoustic imaging interferometric testbed
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Aeronautics and Astronautics, 1999.Includes bibliographical references (p. 169-173).The development and use of a novel testbed architecture is presented. Separated spacecraft interferometers have been proposed for applications in sparse aperture radar or astronomical observations. Modeled after these systems, an integrated hardware and software interferometry testbed is developed. Utilizing acoustic sources and sensors as a simplified analog to radio or optical systems, the Acoustic Imaging Testbed's simplest function is that of a Michelson interferometer. Robot arms control the motion of microphones. Through successive measurements an acoustic image can be formed. On top of this functionality, a layered software architecture is developed. This software creates a virtual environment that mimics the command, control and communications functions appropriate to a space interferometer. Autonomous spacecraft agents interact within this environment as the logical equivalent of distributed satellites. Optimal imaging configurations are validated. A scalable approach to cluster autonomy is discussed.by John Enright.S.M
Application of sound source separation methods to advanced spatial audio systems
This thesis is related to the field of Sound Source Separation (SSS). It addresses the development
and evaluation of these techniques for their application in the resynthesis of high-realism sound scenes by
means of Wave Field Synthesis (WFS). Because the vast majority of audio recordings are preserved in twochannel
stereo format, special up-converters are required to use advanced spatial audio reproduction formats,
such as WFS. This is due to the fact that WFS needs the original source signals to be available, in order to
accurately synthesize the acoustic field inside an extended listening area. Thus, an object-based mixing is
required.
Source separation problems in digital signal processing are those in which several signals have been mixed
together and the objective is to find out what the original signals were. Therefore, SSS algorithms can be applied
to existing two-channel mixtures to extract the different objects that compose the stereo scene. Unfortunately,
most stereo mixtures are underdetermined, i.e., there are more sound sources than audio channels. This
condition makes the SSS problem especially difficult and stronger assumptions have to be taken, often related to
the sparsity of the sources under some signal transformation.
This thesis is focused on the application of SSS techniques to the spatial sound reproduction field. As a result,
its contributions can be categorized within these two areas. First, two underdetermined SSS methods are
proposed to deal efficiently with the separation of stereo sound mixtures. These techniques are based on a
multi-level thresholding segmentation approach, which enables to perform a fast and unsupervised separation of
sound sources in the time-frequency domain. Although both techniques rely on the same clustering type, the
features considered by each of them are related to different localization cues that enable to perform separation
of either instantaneous or real mixtures.Additionally, two post-processing techniques aimed at
improving the isolation of the separated sources are proposed. The performance achieved by
several SSS methods in the resynthesis of WFS sound scenes is afterwards evaluated by means of
listening tests, paying special attention to the change observed in the perceived spatial attributes.
Although the estimated sources are distorted versions of the original ones, the masking effects
involved in their spatial remixing make artifacts less perceptible, which improves the overall
assessed quality. Finally, some novel developments related to the application of time-frequency
processing to source localization and enhanced sound reproduction are presented.Cobos Serrano, M. (2009). Application of sound source separation methods to advanced spatial audio systems [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8969Palanci
Understanding sorting algorithms using music and spatial distribution
This thesis is concerned with the communication of information using auditory
techniques. In particular, a music-based interface has been used to communicate the
operation of a number of sorting algorithms to users. This auditory interface has been
further enhanced by the creation of an auditory scene including a sound wall, which
enables the auditory interface to utilise music parameters in conjunction with 2D/3D
spatial distribution to communicate the essential processes in the algorithms.
The sound wall has been constructed from a grid of measurements using a human head to
create a spatial distribution. The algorithm designer can therefore communicate events
using pitch, rhythm and timbre and associate these with particular positions in space. A
number of experiments have been carried out to investigate the usefulness of music and
the sound wall in communicating information relevant to the algorithms. Further, user
understanding of the six algorithms has been tested. In all experiments the effects of
previous musical experience has been allowed for. The results show that users can utilise musical parameters in understanding algorithms
and that in all cases improvements have been observed using the sound wall. Different
user performance was observed with different algorithms and it is concluded that certain
types of information lend themselves more readily to communication through auditory
interfaces than others.
As a result of the experimental analysis, recommendations are given on how to improve
the sound wall and user understanding by improved choice of the musical mappings
Tracking interacting targets in multi-modal sensors
PhDObject tracking is one of the fundamental tasks in various applications such as surveillance,
sports, video conferencing and activity recognition. Factors such as occlusions,
illumination changes and limited field of observance of the sensor make tracking a challenging
task. To overcome these challenges the focus of this thesis is on using multiple
modalities such as audio and video for multi-target, multi-modal tracking. Particularly,
this thesis presents contributions to four related research topics, namely, pre-processing of
input signals to reduce noise, multi-modal tracking, simultaneous detection and tracking,
and interaction recognition.
To improve the performance of detection algorithms, especially in the presence
of noise, this thesis investigate filtering of the input data through spatio-temporal feature
analysis as well as through frequency band analysis. The pre-processed data from multiple
modalities is then fused within Particle filtering (PF). To further minimise the discrepancy
between the real and the estimated positions, we propose a strategy that associates the
hypotheses and the measurements with a real target, using a Weighted Probabilistic Data
Association (WPDA). Since the filtering involved in the detection process reduces the
available information and is inapplicable on low signal-to-noise ratio data, we investigate
simultaneous detection and tracking approaches and propose a multi-target track-beforedetect
Particle filtering (MT-TBD-PF). The proposed MT-TBD-PF algorithm bypasses
the detection step and performs tracking in the raw signal. Finally, we apply the proposed
multi-modal tracking to recognise interactions between targets in regions within, as well
as outside the cameras’ fields of view.
The efficiency of the proposed approaches are demonstrated on large uni-modal,
multi-modal and multi-sensor scenarios from real world detections, tracking and event
recognition datasets and through participation in evaluation campaigns
Grounding semantics in robots for Visual Question Answering
In this thesis I describe an operational implementation of an object detection and description system that incorporates in an end-to-end Visual Question Answering system and evaluated it on two visual question answering datasets for compositional language and elementary visual reasoning
Recommended from our members
Efficient Acoustic Simulation for Immersive Media and Digital Fabrication
Sound is a crucial part of our life. Well-designed acoustic behaviors can lead to significant improvement in both physical and virtual interactions. In computer graphics, most existing methods focused primarily on improving the accuracy. It remained underexplored on how to develop efficient acoustic simulation algorithms for interactive practical applications.
The challenges arise from the dilemma between expensive accurate simulations and fast feedback demanded by intuitive user interaction: traditional physics-based acoustic simulations are computationally expensive; yet, for end users to benefit from the simulations, it is crucial to give prompt feedback during interactions.
In this thesis, I investigate how to develop efficient acoustic simulations for real-world applications such as immersive media and digital fabrication. To address the above-mentioned challenges, I leverage precomputation and optimization to significantly improve the speed while preserving the accuracy of complex acoustic phenomena. This work discusses three efforts along this research direction: First, to ease sound designer's workflow, we developed a fast keypoint-based precomputation algorithm to enable interactive acoustic transfer values in virtual sound simulations. Second, for realistic audio editing in 360° videos, we proposed an inverse material optimization based on fast sound simulation and a hybrid ambisonic audio synthesis that exploits the directional isotropy in spatial audios. Third, we devised a modular approach to efficiently simulate and optimize fabrication-ready acoustic filters, achieving orders of magnitudes speedup while maintaining the simulation accuracy. Through this series of projects, I demonstrate a wide range of applications made possible by efficient acoustic simulations
- …