54 research outputs found

    Robust binaural localization of a target sound source by combining spectral source models and deep neural networks

    Get PDF
    Despite there being a clear evidence for top–down (e.g., attentional) effects in biological spatial hearing, relatively few machine hearing systems exploit the top–down model-based knowledge in sound localization. This paper addresses this issue by proposing a novel framework for the binaural sound localization that combines the model-based information about the spectral characteristics of sound sources and deep neural networks (DNNs). A target source model and a background source model are first estimated during a training phase using spectral features extracted from sound signals in isolation. When the identity of the background source is not available, a universal background model can be used. During testing, the source models are used jointly to explain the mixed observations and improve the localization process by selectively weighting source azimuth posteriors output by a DNN-based localization system. To address the possible mismatch between the training and testing, a model adaptation process is further employed the on-the-fly during testing, which adapts the background model parameters directly from the noisy observations in an iterative manner. The proposed system, therefore, combines the model-based and data-driven information flow within a single computational framework. The evaluation task involved localization of a target speech source in the presence of an interfering source and room reverberation. Our experiments show that by exploiting the model-based information in this way, the sound localization performance can be improved substantially under various noisy and reverberant conditions

    Perceptual compensation for reverberation in human listeners and machines

    Get PDF
    This thesis explores compensation for reverberation in human listeners and machines. Late reverberation is typically understood as a distortion which degrades intelligibility. Recent research, however, shows that late reverberation is not always detrimental to human speech perception. At times, prolonged exposure to reverberation can provide a helpful acoustic context which improves identification of reverberant speech sounds. The physiology underpinning our robustness to reverberation has not yet been elucidated, but is speculated in this thesis to include efferent processes which have previously been shown to improve discrimination of noisy speech. These efferent pathways descend from higher auditory centres, effectively recalibrating the encoding of sound in the cochlea. Moreover, this thesis proposes that efferent-inspired computational models based on psychoacoustic principles may also improve performance for machine listening systems in reverberant environments. A candidate model for perceptual compensation for reverberation is proposed in which efferent suppression derives from the level of reverberation detected in the simulated auditory nerve response. The model simulates human performance in a phoneme-continuum identification task under a range of reverberant conditions, where a synthetically controlled test-word and its surrounding context phrase are independently reverberated. Addressing questions which arose from the model, a series of perceptual experiments used naturally spoken speech materials to investigate aspects of the psychoacoustic mechanism underpinning compensation. These experiments demonstrate a monaural compensation mechanism that is influenced by both the preceding context (which need not be intelligible speech) and by the test-word itself, and which depends on the time-direction of reverberation. Compensation was shown to act rapidly (within a second or so), indicating a monaural mechanism that is likely to be effective in everyday listening. Finally, the implications of these findings for the future development of computational models of auditory perception are considered

    Aspects of room acoustics, vision and motion in the human auditory perception of space

    Get PDF
    The human sense of hearing contributes to the awareness of where sound-generating objects are located in space and of the environment in which the hearing individual is located. This auditory perception of space interacts in complex ways with our other senses, can be both disrupted and enhanced by sound reflections, and includes safety mechanisms which have evolved to protect our lives, but can also mislead us. This dissertation explores some selected topics from this wide subject area, mostly by testing the abilities and subjective judgments of human listeners in virtual environments. Reverberation is the gradually decaying persistence of sounds in an enclosed space which results from repeated sound reflections at surfaces. The first experiment (Chapter 2) compared how strongly people perceived reverberation in different visual situations: when they could see the room and the source which generated the sound; when they could see some room and some sound source, but the image did not match what they heard; and when they could not see anything at all. There were no indications that the visual image had any influence on this aspect of room-acoustical perception. The potential benefits of motion for judging the distance of sound sources were the focus of the second study (Chapter 3), which consists of two parts. In the first part, loudspeakers were placed at different depths in front of sitting listeners who, on command, had to either remain still or move their upper bodies sideways. This experiment demonstrated that humans can exploit motion parallax (the effect that closer objects appear faster to a moving observer than farther objects) with their ears and not just with their eyes. The second part combined a virtualisation of such sound sources with a motion platform to show that the listeners’ interpretation of this auditory motion parallax was better when they performed this lateral movement by themselves, rather than when they were moved by the apparatus or were not actually in motion at all. Two more experiments were concerned with the perception of sounds which are perceived as becoming louder over time. These have been called “looming”, as the source of such a sound might be on a collision course. One of the studies (Chapter 4) showed that western diamondback rattlesnakes (Crotalus atrox) increase the vibration speed of their rattle in response to the approach of a threatening object. It also demonstrated that human listeners perceive (virtual) snakes which engage in this behaviour as especially close, causing them to keep a greater margin of safety than they would otherwise. The other study (section 5.6) was concerned with the well-known looming bias of the sound localisation system, a phenomenon which leads to a sometimes exaggerated, sometimes more accurate perception of approaching compared to receding sounds. It attempted to find out whether this bias is affected by whether listeners hear such sounds in a virtual enclosed space or in an environment with no sound reflections. While the results were inconclusive, this experiment is noteworthy as a proof of concept: It was the first study to make use of a new real-time room-acoustical simulation system, liveRAZR, which was developed as part of this dissertation (Chapter 5). Finally, while humans have been more often studied for their unique abilities to communicate with each other and bats for their extraordinary capacity to locate objects by sound, this dissertation turns this setting of priorities on its head with the last paper (Chapter 6): Based on recordings of six pale spear-nosed bats (Phyllostomus discolor), it is a survey of the identifiably distinct vocalisations observed in their social interactions, along with a description of the different situations in which they typically occur.Das menschliche Gehör trägt zum Bewusstsein dafür bei, wo sich schallerzeugende Objekte im Raum befinden und wie die Umgebung beschaffen ist, in der sich eine Person aufhält. Diese auditorische Raumwahrnehmung interagiert auf komplexe Art und Weise mit unseren anderen Sinnen, kann von Schallreflektionen sowohl profitieren als auch durch sie behindert werden, und besitzt Mechanismen welche evolutionär entstanden sind, um unser Leben zu schützen, uns aber auch irreführen können. Diese Dissertation befasst sich mit einigen ausgewählten Themen aus diesem weiten Feld und stützt sich dabei meist auf die Testung von Wahrnehmungsfähigkeiten und subjektiver Einschätzungen menschlicher Hörer/-innen in virtueller Realität. Beim ersten Experiment (Kapitel 2) handelte es sich um einen Vergleich zwischen der Wahrnehmung von Nachhall, dem durch wiederholte Reflexionen an Oberflächen hervorgerufenen, sukzessiv abschwellenden Verbleib von Schall in einem umschlossenen Raum, unter verschiedenen visuellen Umständen: wenn die Versuchsperson den Raum und die Schallquelle sehen konnte; wenn sie irgendeinen Raum und irgendeine Schallquelle sehen konnte, dieses Bild aber vom Schalleindruck abwich; und wenn sie gar kein Bild sehen konnte. Dieser Versuch konnte keinen Einfluss eines Seheindrucks auf diesen Aspekt der raumakustischen Wahrnehmung zu Tage fördern. Mögliche Vorteile von Bewegung für die Einschätzung der Entfernung von Schallquellen waren der Schwerpunkt der zweiten Studie (Kapitel 3). Diese bestand aus zwei Teilen, wovon der erste zeigte, dass Hörer/-innen, die ihren Oberkörper relativ zu zwei in unterschiedlichen Abständen vor ihnen aufgestellten Lautsprechern auf Kommando entweder stillhalten oder seitlich bewegen mussten, im letzteren Falle von der Bewegungsparallaxe (dem Effekt, dass sich der nähere Lautsprecher relativ zum sich bewegenden Körper schneller bewegte als der weiter entfernte) profitieren konnten. Der zweite Teil kombinierte eine Simulation solcher Schallquellen mit einer Bewegungsplattform, wodurch gezeigt werden konnte, dass die bewusste Eigenbewegung für die Versuchspersonen hilfreicher war, als durch die Plattform bewegt zu werden oder gar nicht wirklich in Bewegung zu sein. Zwei weitere Versuche gingen auf die Wahrnehmung von Schallen ein, deren Ursprungsort sich nach und nach näher an den/die Hörer/-in heranbewegte. Derartige Schalle werden auch als „looming“ („anbahnend“) bezeichnet, da eine solche Annäherung bei bedrohlichen Signalen nichts Gutes ahnen lässt. Einer dieser Versuche (Kapitel 4) zeigte zunächst, dass Texas-Klapperschlangen (Crotalus atrox) die Vibrationsgeschwindigkeit der Schwanzrassel steigern, wenn sich ein bedrohliches Objekt ihnen nähert. Menschliche Hörer/-innen nahmen (virtuelle) Schlangen, die dieses Verhalten aufweisen, als besonders nahe wahr und hielten einen größeren Sicherheitsabstand ein, als sie es sonst tun würden. Der andere Versuch (Abschnitt 5.6) versuchte festzustellen, ob die wohlbekannte Neigung unserer Schallwahrnehmung, näherkommende Schalle manchmal übertrieben und manchmal genauer einzuschätzen als sich entfernende, durch Schallreflektionen beeinflusst werden kann. Diese Ergebnisse waren unschlüssig, jedoch bestand die Besonderheit dieses Versuchs darin, dass er erstmals ein neues Echtzeitsystem zur Raumakustiksimulation (liveRAZR) nutzte, welches als Teil dieser Dissertation entwickelt wurde (Kapitel 5). Abschließend (Kapitel 6) wird die Schwerpunktsetzung auf den Kopf gestellt, nach der Menschen öfter auf ihre einmaligen Fähigkeiten zur Kommunikation miteinander untersucht werden und Fledermäuse öfter auf ihre außergewöhnliches Geschick, Objekte durch Schall zu orten: Anhand von Aufnahmen von sechs Kleinen Lanzennasen (Phyllostomus discolor) fasst das Kapitel die klar voneinander unterscheidbaren Laute zusammen, die diese Tiere im sozialen Umgang miteinander produzieren, und beschreibt, in welchen Situationen diese Lauttypen typischerweise auftreten

    A comparison of sound localisation techniques using cross-correlation and spiking neural networks for mobile robotics

    Get PDF
    This paper outlines the development of a crosscorrelation algorithm and a spiking neural network (SNN) for sound localisation based on real sound recorded in a noisy and dynamic environment by a mobile robot. The SNN architecture aims to simulate the sound localisation ability of the mammalian auditory pathways by exploiting the binaural cue of interaural time difference (ITD). The medial superior olive was the inspiration for the SNN architecture which required the integration of an encoding layer which produced biologically realistic spike trains, a model of the bushy cells found in the cochlear nucleus and a supervised learning algorithm. The experimental results demonstrate that biologically inspired sound localisation achieved using a SNN can compare favourably to the more classical technique of cross-correlation

    Studies on auditory processing of spatial sound and speech by neuromagnetic measurements and computational modeling

    Get PDF
    This thesis addresses the auditory processing of spatial sound and speech. The thesis consists of two research branches: one, magnetoencephalographic (MEG) brain measurements on spatial localization and speech perception, and two, construction of computational auditory scene analysis models, which exploit spatial cues and other cues that are robust in reverberant environments. In the MEG research branch, we have addressed the processing of the spatial stimuli in the auditory cortex through studies concentrating to the following issues: processing of sound source location with realistic spatial stimuli, spatial processing of speech vs. non-speech stimuli, and finally processing of range of spatial location cues in the auditory cortex. Our main findings are as follows: Both auditory cortices respond more vigorously to contralaterally presented sound, whereby responses exhibit systematic tuning to the sound source direction. Responses and response dynamics are generally larger in the right hemisphere, which indicates right hemispheric specialization in the spatial processing. These observations hold over the range of speech and non-speech stimuli. The responses to speech sounds are decreased markedly if the natural periodic speech excitation is changed to random noise sequence. Moreover, the activation strength of the right auditory cortex seems to reflect processing of spatial cues, so that the dynamical differences are larger and the angular organization is more orderly for realistic spatial stimuli compared to impoverished spatial stimuli (e.g. isolated interaural time and level difference cues). In the auditory modeling part, we constructed models for the recognition of speech in the presence of interference. Firstly, we constructed a system using binaural cues in order to segregate target speech from spatially separated interference, and showed that the system outperforms a conventional approach at low signal-to-noise ratios. Secondly, we constructed a single channel system that is robust in room reverberation using strong speech modulations as robust cues, and showed that it outperforms a baseline approach in the most reverberant test conditions. In this case, the baseline approach was specifically optimized for recognition of speech in reverberation. In summary, this thesis addresses the auditory processing of spatial sound and speech in both brain measurement and auditory modeling. The studies aim to clarify cortical processes of sound localization, and to construct computational auditory models for sound segregation exploiting spatial cues, and strong speech modulations as robust cues in reverberation.reviewe

    Calibration of sound source localisation for robots using multiple adaptive filter models of the cerebellum

    Get PDF
    The aim of this research was to investigate the calibration of Sound Source Localisation (SSL) for robots using the adaptive filter model of the cerebellum and how this could be automatically adapted for multiple acoustic environments. The role of the cerebellum has mainly been identified in the context of motor control, and only in recent years has it been recognised that it has a wider role to play in the senses and cognition. The adaptive filter model of the cerebellum has been successfully applied to a number of robotics applications but so far none involving auditory sense. Multiple models frameworks such as MOdular Selection And Identification for Control (MOSAIC) have also been developed in the context of motor control, and this has been the inspiration for adaptation of audio calibration in multiple acoustic environments; again, application of this approach in the area of auditory sense is completely new. The thesis showed that it was possible to calibrate the output of an SSL algorithm using the adaptive filter model of the cerebellum, improving the performance compared to the uncalibrated SSL. Using an adaptation of the MOSAIC framework, and specifically using responsibility estimation, a system was developed that was able to select an appropriate set of cerebellar calibration models and to combine their outputs in proportion to how well each was able to calibrate, to improve the SSL estimate in multiple acoustic contexts, including novel contexts. The thesis also developed a responsibility predictor, also part of the MOSAIC framework, and this improved the robustness of the system to abrupt changes in context which could otherwise have resulted in a large performance error. Responsibility prediction also improved robustness to missing ground truth, which could occur in challenging environments where sensory feedback of ground truth may become impaired, which has not been addressed in the MOSAIC literature, adding to the novelty of the thesis. The utility of the so-called cerebellar chip has been further demonstrated through the development of a responsibility predictor that is based on the adaptive filter model of the cerebellum, rather than the more conventional function fitting neural network used in the literature. Lastly, it was demonstrated that the multiple cerebellar calibration architecture is capable of limited self-organising from a de-novo state, with a predetermined number of models. It was also demonstrated that the responsibility predictor could learn against its model after self-organisation, and to a limited extent, during self-organisation. The thesis addresses an important question of how a robot could improve its ability to listen in multiple, challenging acoustic environments, and recommends future work to develop this ability

    CORTICAL REPRESENTATION OF SPEECH IN COMPLEX AUDITORY ENVIRONMENTS AND APPLICATIONS

    Get PDF
    Being able to attend and recognize speech or a particular sound in complex listening environments is a feat performed by humans effortlessly. The underlying neural mechanisms, however, remain unclear and cannot yet be emulated by artificial systems. Understanding the internal (cortical) representation of external acoustic world is a key step in deciphering the mechanisms of human auditory processing. Further, understanding neural representation of sound finds numerous applications in clinical research for psychiatric disorders with auditory processing deficits such as schizophrenia. In the first part of this dissertation, cortical activity from normal hearing human subjects is recorded, non-invasively, using magnetoencephalography in two different real-life listening scenarios. First, when natural speech is distorted by reverberation as well as stationary additive noise. Second, when the attended speech is degraded by the presence of multiple additional talkers in the background, simulating a cocktail party. Using natural speech affected by reverberation and noise, it was demonstrated that the auditory cortex maintains both distorted as well as distortion-free representations of speech. Additionally, we show that, while the neural representation of speech remained robust to additive noise in absence of reverberation, noise had detrimental effect in presence of reverberation, suggesting differential mechanisms of speech processing for additive and reverberation distortions. In the cocktail party paradigm, we demonstrated that primary like areas represent the external auditory world in terms of acoustics, whereas higher-order areas maintained an object based representation. Further, it was demonstrated that background speech streams were represented as an unsegregated auditory object. The results suggest that object based representation of auditory scene emerge in higher-order auditory cortices. In the second part of this dissertation, using electroencephalographic recordings from normal human subjects and patients suffering from schizophrenia, it was demonstrated, for the first time, that delta band steady state responses are more affected in schizophrenia patients compared with healthy individuals, contrary to the prevailing dominance of gamma band studies in literature. Further, the results from this study suggest that the inadequate ability to sustain neural responses in this low frequency range may play a vital role in auditory perceptual and cognitive deficit mechanisms in schizophrenia. Overall this dissertation furthers current understanding of cortical representation of speech in complex listening environments and how auditory representation of sounds is affected in psychiatric disorders involving aberrant auditory processing

    Sonic Interactions in Virtual Environments

    Get PDF
    This open access book tackles the design of 3D spatial interactions in an audio-centered and audio-first perspective, providing the fundamental notions related to the creation and evaluation of immersive sonic experiences. The key elements that enhance the sensation of place in a virtual environment (VE) are: Immersive audio: the computational aspects of the acoustical-space properties of Virutal Reality (VR) technologies Sonic interaction: the human-computer interplay through auditory feedback in VE VR systems: naturally support multimodal integration, impacting different application domains Sonic Interactions in Virtual Environments will feature state-of-the-art research on real-time auralization, sonic interaction design in VR, quality of the experience in multimodal scenarios, and applications. Contributors and editors include interdisciplinary experts from the fields of computer science, engineering, acoustics, psychology, design, humanities, and beyond. Their mission is to shape an emerging new field of study at the intersection of sonic interaction design and immersive media, embracing an archipelago of existing research spread in different audio communities and to increase among the VR communities, researchers, and practitioners, the awareness of the importance of sonic elements when designing immersive environments
    • …
    corecore