124 research outputs found
Robust binaural localization of a target sound source by combining spectral source models and deep neural networks
Despite there being a clear evidence for top–down (e.g., attentional) effects in biological spatial hearing, relatively few machine hearing systems exploit the top–down model-based knowledge in sound localization. This paper addresses this issue by proposing a novel framework for the binaural sound localization that combines the model-based information about the spectral characteristics of sound sources and deep neural networks (DNNs). A target source model and a background source model are first estimated during a training phase using spectral features extracted from sound signals in isolation. When the identity of the background source is not available, a universal background model can be used. During testing, the source models are used jointly to explain the mixed observations and improve the localization process by selectively weighting source azimuth posteriors output by a DNN-based localization system. To address the possible mismatch between the training and testing, a model adaptation process is further employed the on-the-fly during testing, which adapts the background model parameters directly from the noisy observations in an iterative manner. The proposed system, therefore, combines the model-based and data-driven information flow within a single computational framework. The evaluation task involved localization of a target speech source in the presence of an interfering source and room reverberation. Our experiments show that by exploiting the model-based information in this way, the sound localization performance can be improved substantially under various noisy and reverberant conditions
Speech localisation in a multitalker mixture by humans and machines
Speech localisation in multitalker mixtures is affected by the listener’s expectations about the spatial arrangement of the sound sources. This effect was investigated via experiments with human listeners and a machine system, in which the task was to localise a female-voice target among four spatially distributed male-voice maskers. Two configurations were used: either the masker locations were fixed or the locations varied from trial-to-trial. The machine system uses deep neural networks (DNNs) to learn the relationship between binaural cues and source azimuth, and exploits top-down knowledge about the spectral characteristics of the target source. Performance was examined in both anechoic and reverberant conditions. Our experiments show that the machine system outperformed listeners in some conditions. Both the machine and listeners were able to make use of a priori knowledge about the spatial configuration of the sources, but the effect for headphone listening was smaller than that previously reported for listening in a real room
A psychoacoustic engineering approach to machine sound source separation in reverberant environments
Reverberation continues to present a major problem for sound source separation algorithms, due to its corruption of many of the acoustical cues on which these algorithms rely. However, humans demonstrate a remarkable robustness to reverberation and many psychophysical and perceptual mechanisms are well documented. This thesis therefore considers the research question: can the reverberation–performance of existing psychoacoustic engineering approaches to machine source separation be improved? The precedence effect is a perceptual mechanism that aids our ability to localise sounds in reverberant environments. Despite this, relatively little work has been done on incorporating the precedence effect into automated sound source separation. Consequently, a study was conducted that compared several computational precedence models and their impact on the performance of a baseline separation algorithm. The algorithm included a precedence model, which was replaced with the other precedence models during the investigation. The models were tested using a novel metric in a range of reverberant rooms and with a range of other mixture parameters. The metric, termed Ideal Binary Mask Ratio, is shown to be robust to the effects of reverberation and facilitates meaningful and direct comparison between algorithms across different acoustic conditions. Large differences between the performances of the models were observed. The results showed that a separation algorithm incorporating a model based on interaural coherence produces the greatest performance gain over the baseline algorithm. The results from the study also indicated that it may be necessary to adapt the precedence model to the acoustic conditions in which the model is utilised. This effect is analogous to the perceptual Clifton effect, which is a dynamic component of the precedence effect that appears to adapt precedence to a given acoustic environment in order to maximise its effectiveness. However, no work has been carried out on adapting a precedence model to the acoustic conditions under test. Specifically, although the necessity for such a component has been suggested in the literature, neither its necessity nor benefit has been formally validated. Consequently, a further study was conducted in which parameters of each of the previously compared precedence models were varied in each room in order to identify if, and to what extent, the separation performance varied with these parameters. The results showed that the reverberation–performance of existing psychoacoustic engineering approaches to machine source separation can be improved and can yield significant gains in separation performance.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
The Use of Optimal Cue Mapping to Improve the Intelligibility and Quality of Speech in Complex Binaural Sound Mixtures.
A person with normal hearing has the ability to follow a particular conversation of interest in a noisy and reverberant environment, whilst simultaneously ignoring the interfering sounds. This task often becomes more challenging for individuals with a hearing impairment. Attending selectively to a sound source is difficult to replicate in machines, including devices such as hearing aids. A correctly set up hearing aid will work well in quiet conditions, but its performance may deteriorate seriously in the presence of competing sounds. To be of help in these more challenging situations the hearing aid should be able to segregate the desired sound source from any other, unwanted sounds.
This thesis explores a novel approach to speech segregation based on optimal cue mapping (OCM). OCM is a signal processing method for segregating a sound source based on spatial and other cues extracted from the binaural mixture of sounds arriving at a listener's ears. The spectral energy fraction of the target speech source in the mixture is estimated frame-by-frame using artificial neural networks (ANNs). The resulting target speech magnitude estimates for the left and right channels are combined with the corresponding original phase spectra to produce the final binaural output signal. The performance improvements delivered by the OCM algorithm are evaluated using the STOI and PESQ metrics for speech intelligibility and quality, respectively. A variety of increasingly challenging binaural mixtures are synthesised involving up to five spatially separate sound sources in both anechoic and reverberant environments. The segregated speech consistently exhibits gains in intelligibility and quality and compares favourably with a leading, somewhat more complex approach. The OCM method allows the selection and integration of multiple cues to be optimised and provides scalable performance benefits to suit the available computational resources. The ability to determine the varying relative importance of each cue in different acoustic conditions is expected to facilitate computationally efficient solutions suitable for use in a hearing aid, allowing the aid to operate effectively in a range of typical acoustic environments. Further developments are proposed to achieve this overall goal
Spatial auditory display for acoustics and music collections
PhDThis thesis explores how audio can be better incorporated into how people access
information and does so by developing approaches for creating three-dimensional audio
environments with low processing demands. This is done by investigating three research
questions.
Mobile applications have processor and memory requirements that restrict the
number of concurrent static or moving sound sources that can be rendered with binaural
audio. Is there a more e cient approach that is as perceptually accurate as the traditional
method? This thesis concludes that virtual Ambisonics is an ef cient and accurate means
to render a binaural auditory display consisting of noise signals placed on the horizontal
plane without head tracking. Virtual Ambisonics is then more e cient than convolution
of HRTFs if more than two sound sources are concurrently rendered or if movement of
the sources or head tracking is implemented.
Complex acoustics models require signi cant amounts of memory and processing. If
the memory and processor loads for a model are too large for a particular device, that
model cannot be interactive in real-time. What steps can be taken to allow a complex
room model to be interactive by using less memory and decreasing the computational
load? This thesis presents a new reverberation model based on hybrid reverberation
which uses a collection of B-format IRs. A new metric for determining the mixing
time of a room is developed and interpolation between early re
ections is investigated.
Though hybrid reverberation typically uses a recursive lter such as a FDN for the late
reverberation, an average late reverberation tail is instead synthesised for convolution
reverberation.
Commercial interfaces for music search and discovery use little aural information
even though the information being sought is audio. How can audio be used in
interfaces for music search and discovery? This thesis looks at 20 interfaces and
determines that several themes emerge from past interfaces. These include using a two
or three-dimensional space to explore a music collection, allowing concurrent playback of
multiple sources, and tools such as auras to control how much information is presented. A
new interface, the amblr, is developed because virtual two-dimensional spaces populated
by music have been a common approach, but not yet a perfected one. The amblr is also
interpreted as an art installation which was visited by approximately 1000 people over 5
days. The installation maps the virtual space created by the amblr to a physical space
Informed algorithms for sound source separation in enclosed reverberant environments
While humans can separate a sound of interest amidst a cacophony of contending sounds in an echoic environment, machine-based methods lag behind in solving this task. This thesis thus aims at improving performance of audio separation algorithms when they are informed i.e. have access to source location information. These locations are assumed to be known a priori in this work, for example by video processing.
Initially, a multi-microphone array based method combined with binary
time-frequency masking is proposed. A robust least squares frequency invariant data independent beamformer designed with the location information is
utilized to estimate the sources. To further enhance the estimated sources, binary time-frequency masking based post-processing is used but cepstral domain smoothing is required to mitigate musical noise.
To tackle the under-determined case and further improve separation performance
at higher reverberation times, a two-microphone based method
which is inspired by human auditory processing and generates soft time-frequency masks is described. In this approach interaural level difference,
interaural phase difference and mixing vectors are probabilistically modeled in the time-frequency domain and the model parameters are learned
through the expectation-maximization (EM) algorithm. A direction vector is estimated for each source, using the location information, which is used as
the mean parameter of the mixing vector model. Soft time-frequency masks are used to reconstruct the sources. A spatial covariance model is then integrated into the probabilistic model framework that encodes the spatial
characteristics of the enclosure and further improves the separation performance
in challenging scenarios i.e. when sources are in close proximity and
when the level of reverberation is high.
Finally, new dereverberation based pre-processing is proposed based on the cascade of three dereverberation stages where each enhances the twomicrophone
reverberant mixture. The dereverberation stages are based on amplitude spectral subtraction, where the late reverberation is estimated and suppressed. The combination of such dereverberation based pre-processing and use of soft mask separation yields the best separation performance. All methods are evaluated with real and synthetic mixtures formed for example from speech signals from the TIMIT database and measured room impulse responses
Acoustic localization of people in reverberant environments using deep learning techniques
La localizaciĂłn de las personas a partir de informaciĂłn acĂşstica es cada vez más importante en aplicaciones del mundo real como la seguridad, la vigilancia y la interacciĂłn entre personas y robots. En muchos casos, es necesario localizar con precisiĂłn personas u objetos en funciĂłn del sonido que generan, especialmente en entornos ruidosos y reverberantes en los que los mĂ©todos de localizaciĂłn tradicionales pueden fallar, o en escenarios en los que los mĂ©todos basados en análisis de vĂdeo no son factibles por no disponer de ese tipo de sensores o por la existencia de oclusiones relevantes. Por ejemplo, en seguridad y vigilancia, la capacidad de localizar con precisiĂłn una fuente de sonido puede ayudar a identificar posibles amenazas o intrusos. En entornos sanitarios, la localizaciĂłn acĂşstica puede utilizarse para controlar los movimientos y actividades de los pacientes, especialmente los que tienen problemas de movilidad. En la interacciĂłn entre personas y robots, los robots equipados con capacidades de localizaciĂłn acĂşstica pueden percibir y responder mejor a su entorno, lo que permite interacciones más naturales e intuitivas con los humanos. Por lo tanto, el desarrollo de sistemas de localizaciĂłn acĂşstica precisos y robustos utilizando tĂ©cnicas avanzadas como el aprendizaje profundo es de gran importancia práctica. Es por esto que en esta tesis doctoral se aborda dicho problema en tres lĂneas de investigaciĂłn fundamentales: (i) El diseño de un sistema extremo a extremo (end-to-end) basado en redes neuronales capaz de mejorar las tasas de localizaciĂłn de sistemas ya existentes en el estado del arte. (ii) El diseño de un sistema capaz de localizar a uno o varios hablantes simultáneos en entornos con caracterĂsticas y con geometrĂas de arrays de sensores diferentes sin necesidad de re-entrenar. (iii) El diseño de sistemas capaces de refinar los mapas de potencia acĂşstica necesarios para localizar a las fuentes acĂşsticas para conseguir una mejor localizaciĂłn posterior. A la hora de evaluar la consecuciĂłn de dichos objetivos se han utilizado diversas bases de datos realistas con caracterĂsticas diferentes, donde las personas involucradas en las escenas pueden actuar sin ningĂşn tipo de restricciĂłn. Todos los sistemas propuestos han sido evaluados bajo las mismas condiciones consiguiendo superar en tĂ©rminos de error de localizaciĂłn a los sistemas actuales del estado del arte
Movements in Binaural Space: Issues in HRTF Interpolation and Reverberation, with applications to Computer Music
This thesis deals broadly with the topic of Binaural Audio. After reviewing the
literature, a reappraisal of the minimum-phase plus linear delay model for HRTF
representation and interpolation is offered. A rigorous analysis of threshold based
phase unwrapping is also performed. The results and conclusions drawn from these
analyses motivate the development of two novel methods for HRTF representation
and interpolation. Empirical data is used directly in a Phase Truncation method. A
Functional Model for phase is used in the second method based on the
psychoacoustical nature of Interaural Time Differences. Both methods are validated;
most significantly, both perform better than a minimum-phase method in subjective
testing.
The accurate, artefact-free dynamic source processing afforded by the above
methods is harnessed in a binaural reverberation model, based on an early reflection
image model and Feedback Delay Network diffuse field, with accurate interaural
coherence. In turn, these flexible environmental processing algorithms are used in
the development of a multi-channel binaural application, which allows the audition
of multi-channel setups in headphones. Both source and listener are dynamic in this
paradigm. A GUI is offered for intuitive use of the application.
HRTF processing is thus re-evaluated and updated after a review of accepted
practice. Novel solutions are presented and validated. Binaural reverberation is
recognised as a crucial tool for convincing artificial spatialisation, and is developed
on similar principles. Emphasis is placed on transparency of development practices,
with the aim of wider dissemination and uptake of binaural technology
- …