211 research outputs found

    The influence of vision on the perceptual compensation for reverberation in simulated environments

    Full text link
    In typical listening environments, auditory signals arrive at the ear as a fusion of the direct energy from sound sources and the indirect reflections via reverberation. The listener thus cannot directly access the source and reverberation components individually, highlighting that the perceptual separation of these components can be subject to ambiguity. Accurate expectations of reverberation have been shown to reduce such ambiguity. The visible features of the physical environment (e.g., spatial and surface properties) can reveal aspects of reverberation that inform such expectations, suggesting an inferential role of vision in disambiguating the source and reverberation components. The aim of this thesis was to evaluate the degree to which visual information from simulated environments can affect the expectations of reverberation to consequently improve judgements of sound sources. To investigate this aim, we conducted three behavioural studies that assessed perception in audiovisual environments via online simulations created from a database of real-world locations. Chapter 3 assessed whether visual cues to the environment could inform of the reverberant properties of physical locations in an audiovisual congruence task. The results indicated a greater impression of congruence when reverberant cues were identical or similar to those represented by the depicted environment, demonstrating a capacity for vision to inform meaningful expectations of reverberation. Chapter 4 evaluated the degree to which vision contributed to the identification of speech sources within reverberation by prior exposure to visual environments. We found that exposure to the visual environment had hardly any effect on improving the identification of reverberant speech sources in this context. Chapter 5 investigated if a concurrent visual depiction of the environment would affect the tendency for estimates of sound source duration to be consistent despite varying reverberation. The results showed that source duration estimates were influenced by the degree of reverberation present, and were seemingly unaffected by any visual exposure. Taken together, the findings of this thesis suggest that scene understanding from vision contributes to the overall spatial understanding of environments and their reverberant properties, but appears to contribute little to enhancing the perceptual separation of source and reverberation components used to improve judgements of auditory sources

    Single- and multi-microphone speech dereverberation using spectral enhancement

    Get PDF
    In speech communication systems, such as voice-controlled systems, hands-free mobile telephones, and hearing aids, the received microphone signals are degraded by room reverberation, background noise, and other interferences. This signal degradation may lead to total unintelligibility of the speech and decreases the performance of automatic speech recognition systems. In the context of this work reverberation is the process of multi-path propagation of an acoustic sound from its source to one or more microphones. The received microphone signal generally consists of a direct sound, reflections that arrive shortly after the direct sound (commonly called early reverberation), and reflections that arrive after the early reverberation (commonly called late reverberation). Reverberant speech can be described as sounding distant with noticeable echo and colouration. These detrimental perceptual effects are primarily caused by late reverberation, and generally increase with increasing distance between the source and microphone. Conversely, early reverberations tend to improve the intelligibility of speech. In combination with the direct sound it is sometimes referred to as the early speech component. Reduction of the detrimental effects of reflections is evidently of considerable practical importance, and is the focus of this dissertation. More specifically the dissertation deals with dereverberation techniques, i.e., signal processing techniques to reduce the detrimental effects of reflections. In the dissertation, novel single- and multimicrophone speech dereverberation algorithms are developed that aim at the suppression of late reverberation, i.e., at estimation of the early speech component. This is done via so-called spectral enhancement techniques that require a specific measure of the late reverberant signal. This measure, called spectral variance, can be estimated directly from the received (possibly noisy) reverberant signal(s) using a statistical reverberation model and a limited amount of a priori knowledge about the acoustic channel(s) between the source and the microphone(s). In our work an existing single-channel statistical reverberation model serves as a starting point. The model is characterized by one parameter that depends on the acoustic characteristics of the environment. We show that the spectral variance estimator that is based on this model, can only be used when the source-microphone distance is larger than the so-called critical distance. This is, crudely speaking, the distance where the direct sound power is equal to the total reflective power. A generalization of the statistical reverberation model in which the direct sound is incorporated is developed. This model requires one additional parameter that is related to the ratio between the direct sound energy and the sound energy of all reflections. The generalized model is used to derive a novel spectral variance estimator. When the novel estimator is used for dereverberation rather than the existing estimator, and the source-microphone distance is smaller than the critical distance, the dereverberation performance is significantly increased. Single-microphone systems only exploit the temporal and spectral diversity of the received signal. Reverberation, of course, also induces spatial diversity. To additionally exploit this diversity, multiple microphones must be used, and their outputs must be combined by a suitable spatial processor such as the so-called delay and sum beamformer. It is not a priori evident whether spectral enhancement is best done before or after the spatial processor. For this reason we investigate both possibilities, as well as a merge of the spatial processor and the spectral enhancement technique. An advantage of the latter option is that the spectral variance estimator can be further improved. Our experiments show that the use of multiple microphones affords a significant improvement of the perceptual speech quality. The applicability of the theory developed in this dissertation is demonstrated using a hands-free communication system. Since hands-free systems are often used in a noisy and reverberant environment, the received microphone signal does not only contain the desired signal but also interferences such as room reverberation that is caused by the desired source, background noise, and a far-end echo signal that results from a sound that is produced by the loudspeaker. Usually an acoustic echo canceller is used to cancel the far-end echo. Additionally a post-processor is used to suppress background noise and residual echo, i.e., echo which could not be cancelled by the echo canceller. In this work a novel structure and post-processor for an acoustic echo canceller are developed. The post-processor suppresses late reverberation caused by the desired source, residual echo, and background noise. The late reverberation and late residual echo are estimated using the generalized statistical reverberation model. Experimental results convincingly demonstrate the benefits of the proposed system for suppressing late reverberation, residual echo and background noise. The proposed structure and post-processor have a low computational complexity, a highly modular structure, can be seamlessly integrated into existing hands-free communication systems, and affords a significant increase of the listening comfort and speech intelligibility

    Çok kubbeli anıtsal yapılarda bağlaşık akustik alanların sayısal ve deneysel yöntemlerle araştırılması.

    Get PDF
    The key concern of this study is to investigate sound fields of single space superstructures sheltered with multiple-domes, in terms of their potential for featuring non-exponential sound energy decay characteristics. In this framework, Süleymaniye Mosque and Hagia Sophia Museum are selected as cases for investigating the effects of different material use and volumetric contribution on multi-slope decay formation. Methodology involves joint use of in-situ acoustical measurements and acoustical simulations. Relevant acoustical parameters including decay rates and decay times are computed by applying Bayesian decay parameter estimation. Analysis results of experimentally acquired and simulated data disclose double or triple decay formation in superstructures of Süleymaniye Mosque and Hagia Sophia Museum. To justify the phenomena and to understand the mechanism of energy exchanges, spatial sound energy distributions and energy flow vectors are studied by Diffusion Equation Model (DEM) simulations and intensity probe measurements over the case of Süleymaniye Mosque. Both computed and in-situ flow vectors highlight the contribution of sound reflective central dome versus absorptive carpeted floor on providing later energy feedback, creating a nondiffuse sound field. On the other hand, for Süleymaniye Mosque trial by DEM simulations the case of floor with marble instead of carpet has resulted in a much diffuse sound field, implying that the use of sound reflective floor material has prevented the multislope decay formation. Results over various acoustical data collection and data analysis techniques proved that energy fragmentation in support of non-exponential energy decay formation is due to both materials’ sound absorption characteristics and their distributions, as well as volumetric inter-space relations.Ph.D. - Doctoral Progra

    Speech Dereverberation Based on Multi-Channel Linear Prediction

    Get PDF
    Room reverberation can severely degrade the auditory quality and intelligibility of the speech signals received by distant microphones in an enclosed environment. In recent years, various dereverberation algorithms have been developed to tackle this problem, such as beamforming and inverse filtering of the room transfer function. However, this kind of methods relies heavily on the precise estimation of either the direction of arrival (DOA) or room acoustic characteristics. Thus, their performance is very much limited. A more promising category of dereverberation algorithms has been developed based on multi-channel linear predictor (MCLP). This idea was first proposed in time domain where speech signal is highly correlated in a short period of time. To ensure a good suppression of the reverberation, the prediction filter length is required to be longer than the reverberation time. As a result, the complexity of this algorithm is often unacceptable because of large covariance matrix calculation. To overcome this disadvantage, this thesis focuses on the MCLP dereverberation methods performed in the short-time Fourier transform (STFT) domain. Recently, the weighted prediction error (WPE) algorithm has been developed and widely applied to speech dereverberation. In WPE algorithm, MCLP is used in the STFT domain to estimate the late reverberation components from previous frames of the reverberant speech. The enhanced speech is obtained by subtracting the late reverberation from the reverberant speech. Each STFT coefficient is assumed to be independent and obeys Gaussian distribution. A maximum likelihood (ML) problem is formulated in each frequency bin to calculate the predictor coefficients. In this thesis, the original WPE algorithm is improved in two aspects. First, two advanced statistical models, generalized Gaussian distribution (GGD) and Laplacian distribution, are employed instead of the classic Gaussian distribution. Both of them are shown to give better modeling of the histogram of the clean speech. Second, we focus on improving the estimation of the variances of the STFT coefficients of the desired signal. In the original WPE algorithm, the variances are estimated in each frequency bin independently without considering the cross-frequency correlation. Thus, we integrate the nonnegative matrix factorization (NMF) into the WPE algorithm to refine the estimation of the variances and hence obtain a better dereverberation performance. Another category of MCLP based dereverberation algorithm has been proposed in literature by exploiting the sparsity of the STFT coefficients of the desired signal for calculating the predictor coefficients. In this thesis, we also investigate an efficient algorithm based on the maximization of the group sparsity of desired signal using mixed norms. Inspired by the idea of sparse linear predictor (SLP), we propose to include a sparse constraint for the predictor coefficients in order to further improve the dereverberation performance. A weighting parameter is also introduced to achieve a trade-off between the sparsity of the desired signal and the predictor coefficients. Computer simulation of the proposed dereverberation algorithms is conducted. Our experimental results show that the proposed algorithms can significantly improve the quality of reverberant speech signal under different reverberation times. Subjective evaluation also gives a more intuitive demonstration of the enhanced speech intelligibility. Performance comparison also shows that our algorithms outperform some of the state-of-the-art dereverberation techniques

    Glottal-synchronous speech processing

    No full text
    Glottal-synchronous speech processing is a field of speech science where the pseudoperiodicity of voiced speech is exploited. Traditionally, speech processing involves segmenting and processing short speech frames of predefined length; this may fail to exploit the inherent periodic structure of voiced speech which glottal-synchronous speech frames have the potential to harness. Glottal-synchronous frames are often derived from the glottal closure instants (GCIs) and glottal opening instants (GOIs). The SIGMA algorithm was developed for the detection of GCIs and GOIs from the Electroglottograph signal with a measured accuracy of up to 99.59%. For GCI and GOI detection from speech signals, the YAGA algorithm provides a measured accuracy of up to 99.84%. Multichannel speech-based approaches are shown to be more robust to reverberation than single-channel algorithms. The GCIs are applied to real-world applications including speech dereverberation, where SNR is improved by up to 5 dB, and to prosodic manipulation where the importance of voicing detection in glottal-synchronous algorithms is demonstrated by subjective testing. The GCIs are further exploited in a new area of data-driven speech modelling, providing new insights into speech production and a set of tools to aid deployment into real-world applications. The technique is shown to be applicable in areas of speech coding, identification and artificial bandwidth extension of telephone speec
    corecore