580 research outputs found

    Binaural scene analysis : localization, detection and recognition of speakers in complex acoustic scenes

    Get PDF
    The human auditory system has the striking ability to robustly localize and recognize a specific target source in complex acoustic environments while ignoring interfering sources. Surprisingly, this remarkable capability, which is referred to as auditory scene analysis, is achieved by only analyzing the waveforms reaching the two ears. Computers, however, are presently not able to compete with the performance achieved by the human auditory system, even in the restricted paradigm of confronting a computer algorithm based on binaural signals with a highly constrained version of auditory scene analysis, such as localizing a sound source in a reverberant environment or recognizing a speaker in the presence of interfering noise. In particular, the problem of focusing on an individual speech source in the presence of competing speakers, termed the cocktail party problem, has been proven to be extremely challenging for computer algorithms. The primary objective of this thesis is the development of a binaural scene analyzer that is able to jointly localize, detect and recognize multiple speech sources in the presence of reverberation and interfering noise. The processing of the proposed system is divided into three main stages: localization stage, detection of speech sources, and recognition of speaker identities. The only information that is assumed to be known a priori is the number of target speech sources that are present in the acoustic mixture. Furthermore, the aim of this work is to reduce the performance gap between humans and machines by improving the performance of the individual building blocks of the binaural scene analyzer. First, a binaural front-end inspired by auditory processing is designed to robustly determine the azimuth of multiple, simultaneously active sound sources in the presence of reverberation. The localization model builds on the supervised learning of azimuthdependent binaural cues, namely interaural time and level differences. Multi-conditional training is performed to incorporate the uncertainty of these binaural cues resulting from reverberation and the presence of competing sound sources. Second, a speech detection module that exploits the distinct spectral characteristics of speech and noise signals is developed to automatically select azimuthal positions that are likely to correspond to speech sources. Due to the established link between the localization stage and the recognition stage, which is realized by the speech detection module, the proposed binaural scene analyzer is able to selectively focus on a predefined number of speech sources that are positioned at unknown spatial locations, while ignoring interfering noise sources emerging from other spatial directions. Third, the speaker identities of all detected speech sources are recognized in the final stage of the model. To reduce the impact of environmental noise on the speaker recognition performance, a missing data classifier is combined with the adaptation of speaker models using a universal background model. This combination is particularly beneficial in nonstationary background noise

    Informed algorithms for sound source separation in enclosed reverberant environments

    Get PDF
    While humans can separate a sound of interest amidst a cacophony of contending sounds in an echoic environment, machine-based methods lag behind in solving this task. This thesis thus aims at improving performance of audio separation algorithms when they are informed i.e. have access to source location information. These locations are assumed to be known a priori in this work, for example by video processing. Initially, a multi-microphone array based method combined with binary time-frequency masking is proposed. A robust least squares frequency invariant data independent beamformer designed with the location information is utilized to estimate the sources. To further enhance the estimated sources, binary time-frequency masking based post-processing is used but cepstral domain smoothing is required to mitigate musical noise. To tackle the under-determined case and further improve separation performance at higher reverberation times, a two-microphone based method which is inspired by human auditory processing and generates soft time-frequency masks is described. In this approach interaural level difference, interaural phase difference and mixing vectors are probabilistically modeled in the time-frequency domain and the model parameters are learned through the expectation-maximization (EM) algorithm. A direction vector is estimated for each source, using the location information, which is used as the mean parameter of the mixing vector model. Soft time-frequency masks are used to reconstruct the sources. A spatial covariance model is then integrated into the probabilistic model framework that encodes the spatial characteristics of the enclosure and further improves the separation performance in challenging scenarios i.e. when sources are in close proximity and when the level of reverberation is high. Finally, new dereverberation based pre-processing is proposed based on the cascade of three dereverberation stages where each enhances the twomicrophone reverberant mixture. The dereverberation stages are based on amplitude spectral subtraction, where the late reverberation is estimated and suppressed. The combination of such dereverberation based pre-processing and use of soft mask separation yields the best separation performance. All methods are evaluated with real and synthetic mixtures formed for example from speech signals from the TIMIT database and measured room impulse responses

    Auditory Displays and Assistive Technologies: the use of head movements by visually impaired individuals and their implementation in binaural interfaces

    Get PDF
    Visually impaired people rely upon audition for a variety of purposes, among these are the use of sound to identify the position of objects in their surrounding environment. This is limited not just to localising sound emitting objects, but also obstacles and environmental boundaries, thanks to their ability to extract information from reverberation and sound reflections- all of which can contribute to effective and safe navigation, as well as serving a function in certain assistive technologies thanks to the advent of binaural auditory virtual reality. It is known that head movements in the presence of sound elicit changes in the acoustical signals which arrive at each ear, and these changes can improve common auditory localisation problems in headphone-based auditory virtual reality, such as front-to-back reversals. The goal of the work presented here is to investigate whether the visually impaired naturally engage head movement to facilitate auditory perception and to what extent it may be applicable to the design of virtual auditory assistive technology. Three novel experiments are presented; a field study of head movement behaviour during navigation, a questionnaire assessing the self-reported use of head movement in auditory perception by visually impaired individuals (each comparing visually impaired and sighted participants) and an acoustical analysis of inter-aural differences and cross- correlations as a function of head angle and sound source distance. It is found that visually impaired people self-report using head movement for auditory distance perception. This is supported by head movements observed during the field study, whilst the acoustical analysis showed that interaural correlations for sound sources within 5m of the listener were reduced as head angle or distance to sound source were increased, and that interaural differences and correlations in reflected sound were generally lower than that of direct sound. Subsequently, relevant guidelines for designers of assistive auditory virtual reality are proposed

    The effects of monaural and binaural cues on perceived reverberation by normal hearing and hearing-impaired listeners.

    Get PDF
    This dissertation is a quantitative and qualitative examination of how young normal hearing and young hearing-impaired listeners perceive reverberation. A primary complaint among hearing-impaired listeners is difficulty understanding speech in noisy or reverberant environments. This work was motivated by a desire to better understand reverberation perception and processing so that this knowledge might be used to improve outcomes for hearing-impaired listeners in these environments. This dissertation is written in six chapters. Chapter One is an introduction to the field and a review of the relevant literature. Chapter Two describes a motivating experiment from laboratory work completed before the dissertation. This experiment asked human subjects to rate the amount of reverberation they perceived in a sound relative to another sound. This experiment showed a significant effect of listening condition on how listeners made their judgments. Chapter Three follows up on this experiment, seeking a better understanding of how listeners perform the task in Chapter Two. Chapter Three shows that listeners can use limited information to make their judgments. Chapter Four compares reverberation perception in normal hearing and hearing-impaired listeners and examines the effect of speech intelligibility on reverberation perception. This experiment finds no significant differences between cues used by normal hearing and hearing-impaired listeners when judging perceptual aspects of reverberation. Chapter Five describes and uses a quantitative model to examine the results of Chapters Two and Four. Chapter Six summarizes the data presented in the dissertation and discusses potential implications and future directions. This work finds that the perceived amount of reverberation relies primarily on two factors: 1) the listening condition (i.e., binaural, monaural, or a listening condition in which reverberation is present only in one ear) and 2) the sum of reverberant energy present at the two ears. Listeners do not need the reverberant tail to estimate perceived amount of reverberation, meaning that listeners are able to extract information about reverberation from the ongoing signal. The precise mechanism underlying this process is not explicitly found in this work; however, a potential framework is presented in Chapter Six

    Sensitivity to Angular and Radial Source Movements as a Function of Acoustic Complexity in Normal and Impaired Hearing

    Get PDF
    In contrast to static sounds, spatially dynamic sounds have received little attention in psychoacoustic research so far. This holds true especially for acoustically complex (reverberant, multisource) conditions and impaired hearing. The current study therefore investigated the influence of reverberation and the number of concurrent sound sources on source movement detection in young normal-hearing (YNH) and elderly hearing-impaired (EHI) listeners. A listening environment based on natural environmental sounds was simulated using virtual acoustics and rendered over headphones. Both near-far (‘radial’) and left-right (‘angular’) movements of a frontal target source were considered. The acoustic complexity was varied by adding static lateral distractor sound sources as well as reverberation. Acoustic analyses confirmed the expected changes in stimulus features that are thought to underlie radial and angular source movements under anechoic conditions and suggested a special role of monaural spectral changes under reverberant conditions. Analyses of the detection thresholds showed that, with the exception of the single-source scenarios, the EHI group was less sensitive to source movements than the YNH group, despite adequate stimulus audibility. Adding static sound sources clearly impaired the detectability of angular source movements for the EHI (but not the YNH) group. Reverberation, on the other hand, clearly impaired radial source movement detection for the EHI (but not the YNH) listeners. These results illustrate the feasibility of studying factors related to auditory movement perception with the help of the developed test setup

    OBJECTIVE AND SUBJECTIVE EVALUATION OF DEREVERBERATION ALGORITHMS

    Get PDF
    Reverberation significantly impacts the quality and intelligibility of speech. Several dereverberation algorithms have been proposed in the literature to combat this problem. A majority of these algorithms utilize a single channel and are developed for monaural applications, and as such do not preserve the cues necessary for sound localization. This thesis describes a blind two-channel dereverberation technique that improves the quality of speech corrupted by reverberation while preserving cues that affect localization. The method is based by combining a short term (2ms) and long term (20ms) weighting function of the linear prediction (LP) residual of the input signal. The developed and other dereverberation algorithms are evaluated objectively and subjectively in terms of sound quality and localization accuracy. The binaural adaptation provides a significant increase in sound quality while removing the loss in localization ability found in the bilateral implementation

    Computational models for listener-specific predictions of spatial audio quality

    Get PDF
    International audienceMillions of people use headphones every day for listening to music, watching movies, or communicating with others. Nevertheless, sounds presented via headphones are usually perceived inside the head instead of being localized at a naturally external position. Besides externalization and localization, spatial hearing also involves perceptual attributes like apparent source width, listener envelopment, and the ability to segregate sounds. The acoustic basis for spatial hearing is described by the listener-specific head-related transfer functions (HRTFs, Møller et al., 1995). Binaural virtual acoustics based on listener-specific HRTFs can create sounds presented via headphones being indistinguishable from natural sounds (Langendijk and Bronkhorst, 2000). In this talk, we will focus on the dimensions of sound localization that are particularly sensitive to listener-specific HRTFs, that is, along sagittal planes (i.e., vertical planes being orthogonal to the interaural axis) and near distances (sound externalization/internalization). We will discuss recent findings from binaural virtual acoustics and models aiming at predicting sound externalization (Hassager et al., 2016) and localization in sagittal planes (Baumgartner et al., 2014) considering the listener’s HRTFs. Sagittal-plane localization seems to be well understood and its model can already now reliably predict the localization performance in many listening situations (e.g., Marelli et al., 2015; Baumgartner and Majdak, 2015). In contrast, more investigation is required in order to better understand and create a valid model of sound externalization (Baumgartner et al., 2017). We aim to shed light onto the diversity of cues causing degraded sound externalization with spectral distortions by conducting a model-based meta-analysis of psychoacoustic studies. As potential cues we consider monaural and interaural spectral-shapes, spectral and temporal fluctuations of interaural level differences, interaural coherences, and broadband inconsistencies between interaural time and level differences in a highly comparable template-based modeling framework. Mere differences in sound pressure level between target and reference stimuli were used as a control cue. Our investigations revealed that the monaural spectral-shapes and the strengths of time-intensity trading are potent cues to explain previous results under anechoic conditions. However, future experiments will be required to unveil the actual essence of these cues.ReferencesBaumgartner, R., Majdak, P. (2015): Modeling Localization of Amplitude-Panned Virtual Sources in Sagittal Planes, in: Journal of Audio Engineering Society 63, 562-569.Baumgartner, R., Majdak, P., and Laback, B. (2014). “Modeling sound-source localization in sagittal planes for human listeners,” The Journal of the Acoustical Society of America 136, 791–802.Baumgartner, R., Reed, D. K., Tóth, B., Best, V., Majdak, P., Colburn, H. S., and Shinn-Cunningham, B. (2017). “Asymmetries in behavioral and neural responses to spectral cues demonstrate the generality of auditory looming bias,” Proceedings of the National Academy of Sciences 114, 9743–9748.Hassager, H. G., Gran, F., and Dau, T. (2016). “The role of spectral detail in the binaural transfer function on perceived externalization in a reverberant environment,” The Journal of the Acoustical Society of America 139, 2992–3000.Langendijk, E. H., and Bronkhorst, A. W. (2000). “Fidelity of three-dimensional-sound reproduction using a virtual auditory display,” J Acoust Soc Am 107, 528–37.Marelli, D., Baumgartner, R., and Majdak, P. (2015). “Efficient Approximation of Head-Related Transfer Functions in Subbands for Accurate Sound Localization,” IEEE Transactions on Audio, Speech, and Language Processing 23, 1130–1143.Møller, H., Sørensen, M. F., Hammershøi, D., and Jensen, C. B. (1995). “Head-related transfer functions of human subjects,” J Audio Eng Soc 43, 300–321

    Sensory Communication

    Get PDF
    Contains table of contents for Section 2, an introduction, reports on nine research projects and a list of publications.National Institutes of Health Grant 5 R01 DC00117National Institutes of Health Grant 2 R01 DC00270National Institutes of Health Grant 1 P01 DC00361National Institutes of Health Grant 2 R01 DC00100National Institutes of Health Grant FV00428National Institutes of Health Grant 5 R01 DC00126U.S. Air Force - Office of Scientific Research Grant AFOSR 90-200U.S. Navy - Office of Naval Research Grant N00014-90-J-1935National Institutes of Health Grant 5 R29 DC0062

    Decoding neural responses to temporal cues for sound localization

    Get PDF
    The activity of sensory neural populations carries information about the environment. This may be extracted from neural activity using different strategies. In the auditory brainstem, a recent theory proposes that sound location in the horizontal plane is decoded from the relative summed activity of two populations in each hemisphere, whereas earlier theories hypothesized that the location was decoded from the identity of the most active cells. We tested the performance of various decoders of neural responses in increasingly complex acoustical situations, including spectrum variations, noise, and sound diffraction. We demonstrate that there is insufficient information in the pooled activity of each hemisphere to estimate sound direction in a reliable way consistent with behavior, whereas robust estimates can be obtained from neural activity by taking into account the heterogeneous tuning of cells. These estimates can still be obtained when only contralateral neural responses are used, consistently with unilateral lesion studies. DOI: http://dx.doi.org/10.7554/eLife.01312.001
    corecore