457 research outputs found
Acoustic Space Learning for Sound Source Separation and Localization on Binaural Manifolds
In this paper we address the problems of modeling the acoustic space
generated by a full-spectrum sound source and of using the learned model for
the localization and separation of multiple sources that simultaneously emit
sparse-spectrum sounds. We lay theoretical and methodological grounds in order
to introduce the binaural manifold paradigm. We perform an in-depth study of
the latent low-dimensional structure of the high-dimensional interaural
spectral data, based on a corpus recorded with a human-like audiomotor robot
head. A non-linear dimensionality reduction technique is used to show that
these data lie on a two-dimensional (2D) smooth manifold parameterized by the
motor states of the listener, or equivalently, the sound source directions. We
propose a probabilistic piecewise affine mapping model (PPAM) specifically
designed to deal with high-dimensional data exhibiting an intrinsic piecewise
linear structure. We derive a closed-form expectation-maximization (EM)
procedure for estimating the model parameters, followed by Bayes inversion for
obtaining the full posterior density function of a sound source direction. We
extend this solution to deal with missing data and redundancy in real world
spectrograms, and hence for 2D localization of natural sound sources such as
speech. We further generalize the model to the challenging case of multiple
sound sources and we propose a variational EM framework. The associated
algorithm, referred to as variational EM for source separation and localization
(VESSL) yields a Bayesian estimation of the 2D locations and time-frequency
masks of all the sources. Comparisons of the proposed approach with several
existing methods reveal that the combination of acoustic-space learning with
Bayesian inference enables our method to outperform state-of-the-art methods.Comment: 19 pages, 9 figures, 3 table
Efficient coding of spectrotemporal binaural sounds leads to emergence of the auditory space representation
To date a number of studies have shown that receptive field shapes of early
sensory neurons can be reproduced by optimizing coding efficiency of natural
stimulus ensembles. A still unresolved question is whether the efficient coding
hypothesis explains formation of neurons which explicitly represent
environmental features of different functional importance. This paper proposes
that the spatial selectivity of higher auditory neurons emerges as a direct
consequence of learning efficient codes for natural binaural sounds. Firstly,
it is demonstrated that a linear efficient coding transform - Independent
Component Analysis (ICA) trained on spectrograms of naturalistic simulated
binaural sounds extracts spatial information present in the signal. A simple
hierarchical ICA extension allowing for decoding of sound position is proposed.
Furthermore, it is shown that units revealing spatial selectivity can be
learned from a binaural recording of a natural auditory scene. In both cases a
relatively small subpopulation of learned spectrogram features suffices to
perform accurate sound localization. Representation of the auditory space is
therefore learned in a purely unsupervised way by maximizing the coding
efficiency and without any task-specific constraints. This results imply that
efficient coding is a useful strategy for learning structures which allow for
making behaviorally vital inferences about the environment.Comment: 22 pages, 9 figure
Enhanced IVA for audio separation in highly reverberant environments
Blind Audio Source Separation (BASS), inspired by the "cocktail-party problem", has been a leading research application for blind source separation (BSS). This thesis concerns the enhancement of frequency domain convolutive blind source separation (FDCBSS) techniques for audio separation in highly reverberant room environments.
Independent component analysis (ICA) is a higher order statistics (HOS) approach commonly used in the BSS framework. When applied to audio FDCBSS, ICA based methods suffer from the permutation problem across the frequency bins of each source. Independent vector analysis (IVA) is an FD-BSS algorithm that theoretically solves the permutation problem by using a multivariate source prior, where the sources are considered to be random vectors. The algorithm allows independence between multivariate source signals, and retains dependency between the source signals within each source vector. The source prior adopted to model the nonlinear dependency structure within the source vectors is crucial to the separation performance of the IVA algorithm. The focus of this thesis is on improving the separation performance of the IVA algorithm in the application of BASS.
An alternative multivariate Student's t distribution is proposed as the source prior for the batch IVA algorithm. A Student's t probability density function can better model certain frequency domain speech signals due to its tail dependency property. Then, the nonlinear score function, for the IVA, is derived from the proposed source prior.
A novel energy driven mixed super Gaussian and Student's t source prior is proposed for the IVA and FastIVA algorithms. The Student's t distribution, in the mixed source prior, can model the high amplitude data points whereas the super Gaussian distribution can model the lower amplitude information in the speech signals. The ratio of both distributions can be adjusted according to the energy of the observed mixtures to adapt for different types of speech signals.
A particular multivariate generalized Gaussian distribution is adopted as the source prior for the online IVA algorithm. The nonlinear score function derived from this proposed source prior contains fourth order relationships between different frequency bins, which provides a more informative and stronger dependency structure and thereby improves the separation performance.
An adaptive learning scheme is developed to improve the performance of the online IVA algorithm. The scheme adjusts the learning rate as a function of proximity to the target solutions. The scheme is also accompanied with a novel switched source prior technique taking the best performance properties of the super Gaussian source prior and the generalized Gaussian source prior as the algorithm converges.
The methods and techniques, proposed in this thesis, are evaluated with real speech source signals in different simulated and real reverberant acoustic environments. A variety of measures are used within the evaluation criteria of the various algorithms. The experimental results demonstrate improved performance of the proposed methods and their robustness in a wide range of situations
Adjustment of interaural-time-difference analysis to sound level
To localize low-frequency sound sources in azimuth, the binaural system compares the timing of sound waves at the two ears with microsecond precision. A similarly high precision is also seen in the binaural processing of the envelopes of high-frequency complex sounds. Both for low- and high-frequency sounds, interaural time difference (ITD) acuity is to a large extent independent of sound level. The mechanisms underlying this level-invariant extraction of ITDs by the binaural system are, however, only poorly understood. We use high-frequency pip trains with asymmetric and dichotic pip envelopes in a combined psychophysical, electrophysiological, and modeling approach. Although the dichotic envelopes cannot be physically matched in terms of ITD, the match produced perceptually by humans is very reliable, and it depends systematically on the overall sound level. These data are reflected in neural responses from the gerbil lateral superior olive and lateral lemniscus. The results are predicted in an existing temporal-integration model extended with a level-dependent threshold criterion. These data provide a very sensitive quantification of how the peripheral temporal code is conditioned for binaural analysis
Time-Frequency Masking Performance for Improved Intelligibility with Microphone Arrays
Time-Frequency (TF) masking is an audio processing technique useful for isolating an audio source from interfering sources. TF masking has been applied and studied in monaural and binaural applications, but has only recently been applied to distributed microphone arrays. This work focuses on evaluating the TF masking technique\u27s ability to isolate human speech and improve speech intelligibility in an immersive cocktail party environment. In particular, an upper-bound on TF masking performance is established and compared to the traditional delay-sum and general sidelobe canceler (GSC) beamformers. Additionally, the novel technique of combining the GSC with TF masking is investigated and its performance evaluated. This work presents a resource-efficient method for studying the performance of these isolation techniques and evaluates their performance using both virtually simulated data and data recorded in a real-life acoustical environment. Further, methods are presented to analyze speech intelligibility post-processing, and automated objective intelligibility measurements are applied alongside informal subjective assessments to evaluate the performance of these processing techniques. Finally, the causes for subjective/objective intelligibility measurement disagreements are discussed, and it was shown that TF masking did enhance intelligibility beyond delay-sum beamforming and that the utilization of adaptive beamforming can be beneficial
Speech Perception in Reverberated Condition By Cochlear Implants
Previous Studies for bilateral cochlear implants users examined cocktail -party setting under anechoic listening conditions. However in real world listeners always encounter problems of reverberation, which could significantly deteriorate speech intelligibility for all listeners, independent of their hearing status.
The object of this study is to investigate the effects of reverberation on the binaural benefits for speech recognition by bilateral cochlear-implant (CI) listeners.
Bilateral CI subject was tested under different reverberation conditions. IEEE recorded sentences from one male speaker mixed with either speech shaped noise (ssn), energy masking, or with 2 female competing takers (2fsn), informational masking, at different signal -noise -ratios (SSN) were used as stimuli. The male target speech was always set at 90 degrees; azimuth (from the front), while the masker were placed 0 degrees;, 90 degrees;, 180 degrees; azimuth (0 degrees; implied left, 180 degrees; implied right). Generated stimuli were presented to Bilateral Cochlear Implant subjects via auxiliary input, which was connected to sound processor in a double wall sound attenuated booth. In each condition, subject was tested with individual ear alone, as well as with both ears.
Prior studies predict there would be decrease in speech intelligibility in reverberated condition as compared with anechoic environment. As predicted we saw a decrease in speech intelligibility in reverberated condition as compared with anechoic environment as reverberant environment produce more masking than the less reverberant environment do. We also observed that benefit of spatial hearing in reverberant environment. We observed that when the masking was placed at the better ear the subject performed better than the masking placed the other ear. We also observed the reverberation effect on energetic and informational masking. We observed that when the target and interfere are spatially separated, reverberation had greater detrimental effect on informational masking than energetic masking, and when the target and interfere were co-located the energetic masking results performed better than informational masking.
Due to time limitation and subject availability, test was done with one CI subject. Further testing and research on this topic, would help to understand the effect/s the informational masking vs energetic masking in reverberated conditions
Communications Biophysics
Contains research objectives and reports on eight research projects split into three sections.National Institutes of Health (Grant 2 PO1 NS13126)National Institutes of Health (Grant 5 RO1 NS18682)National Institutes of Health (Grant 5 RO1 NS20322)National Institutes of Health (Grant 1 RO1 NS 20269)National Institutes of Health (Grant 5 T32 NS 07047)Symbion, Inc.National Institutes of Health (Grant 5 R01 NS10916)National Institutes of Health (Grant 1 RO NS 16917)National Science Foundation (Grant BNS83-19874)National Science Foundation (Grant BNS83-19887)National Institutes of Health (Grant 5 RO1 NS12846)National Institutes of Health (Grant 1 RO1 NS21322-01)National Institutes of Health (Grant 5 T32-NS07099-07)National Institutes of Health (Grant 1 RO1 NS14092-06)National Science Foundation (Grant BNS77-21751)National Institutes of Health (Grant 5 RO1 NS11080
- …