266 research outputs found

    Single- and multi-microphone speech dereverberation using spectral enhancement

    Get PDF
    In speech communication systems, such as voice-controlled systems, hands-free mobile telephones, and hearing aids, the received microphone signals are degraded by room reverberation, background noise, and other interferences. This signal degradation may lead to total unintelligibility of the speech and decreases the performance of automatic speech recognition systems. In the context of this work reverberation is the process of multi-path propagation of an acoustic sound from its source to one or more microphones. The received microphone signal generally consists of a direct sound, reflections that arrive shortly after the direct sound (commonly called early reverberation), and reflections that arrive after the early reverberation (commonly called late reverberation). Reverberant speech can be described as sounding distant with noticeable echo and colouration. These detrimental perceptual effects are primarily caused by late reverberation, and generally increase with increasing distance between the source and microphone. Conversely, early reverberations tend to improve the intelligibility of speech. In combination with the direct sound it is sometimes referred to as the early speech component. Reduction of the detrimental effects of reflections is evidently of considerable practical importance, and is the focus of this dissertation. More specifically the dissertation deals with dereverberation techniques, i.e., signal processing techniques to reduce the detrimental effects of reflections. In the dissertation, novel single- and multimicrophone speech dereverberation algorithms are developed that aim at the suppression of late reverberation, i.e., at estimation of the early speech component. This is done via so-called spectral enhancement techniques that require a specific measure of the late reverberant signal. This measure, called spectral variance, can be estimated directly from the received (possibly noisy) reverberant signal(s) using a statistical reverberation model and a limited amount of a priori knowledge about the acoustic channel(s) between the source and the microphone(s). In our work an existing single-channel statistical reverberation model serves as a starting point. The model is characterized by one parameter that depends on the acoustic characteristics of the environment. We show that the spectral variance estimator that is based on this model, can only be used when the source-microphone distance is larger than the so-called critical distance. This is, crudely speaking, the distance where the direct sound power is equal to the total reflective power. A generalization of the statistical reverberation model in which the direct sound is incorporated is developed. This model requires one additional parameter that is related to the ratio between the direct sound energy and the sound energy of all reflections. The generalized model is used to derive a novel spectral variance estimator. When the novel estimator is used for dereverberation rather than the existing estimator, and the source-microphone distance is smaller than the critical distance, the dereverberation performance is significantly increased. Single-microphone systems only exploit the temporal and spectral diversity of the received signal. Reverberation, of course, also induces spatial diversity. To additionally exploit this diversity, multiple microphones must be used, and their outputs must be combined by a suitable spatial processor such as the so-called delay and sum beamformer. It is not a priori evident whether spectral enhancement is best done before or after the spatial processor. For this reason we investigate both possibilities, as well as a merge of the spatial processor and the spectral enhancement technique. An advantage of the latter option is that the spectral variance estimator can be further improved. Our experiments show that the use of multiple microphones affords a significant improvement of the perceptual speech quality. The applicability of the theory developed in this dissertation is demonstrated using a hands-free communication system. Since hands-free systems are often used in a noisy and reverberant environment, the received microphone signal does not only contain the desired signal but also interferences such as room reverberation that is caused by the desired source, background noise, and a far-end echo signal that results from a sound that is produced by the loudspeaker. Usually an acoustic echo canceller is used to cancel the far-end echo. Additionally a post-processor is used to suppress background noise and residual echo, i.e., echo which could not be cancelled by the echo canceller. In this work a novel structure and post-processor for an acoustic echo canceller are developed. The post-processor suppresses late reverberation caused by the desired source, residual echo, and background noise. The late reverberation and late residual echo are estimated using the generalized statistical reverberation model. Experimental results convincingly demonstrate the benefits of the proposed system for suppressing late reverberation, residual echo and background noise. The proposed structure and post-processor have a low computational complexity, a highly modular structure, can be seamlessly integrated into existing hands-free communication systems, and affords a significant increase of the listening comfort and speech intelligibility

    Spatial, Spectral, and Perceptual Nonlinear Noise Reduction for Hands-free Microphones in a Car

    Get PDF
    Speech enhancement in an automobile is a challenging problem because interference can come from engine noise, fans, music, wind, road noise, reverberation, echo, and passengers engaging in other conversations. Hands-free microphones make the situation worse because the strength of the desired speech signal reduces with increased distance between the microphone and talker. Automobile safety is improved when the driver can use a hands-free interface to phones and other devices instead of taking his eyes off the road. The demand for high quality hands-free communication in the automobile requires the introduction of more powerful algorithms. This thesis shows that a unique combination of five algorithms can achieve superior speech enhancement for a hands-free system when compared to beamforming or spectral subtraction alone. Several different designs were analyzed and tested before converging on the configuration that achieved the best results. Beamforming, voice activity detection, spectral subtraction, perceptual nonlinear weighting, and talker isolation via pitch tracking all work together in a complementary iterative manner to create a speech enhancement system capable of significantly enhancing real world speech signals. The following conclusions are supported by the simulation results using data recorded in a car and are in strong agreement with theory. Adaptive beamforming, like the Generalized Side-lobe Canceller (GSC), can be effectively used if the filters only adapt during silent data frames because too much of the desired speech is cancelled otherwise. Spectral subtraction removes stationary noise while perceptual weighting prevents the introduction of offensive audible noise artifacts. Talker isolation via pitch tracking can perform better when used after beamforming and spectral subtraction because of the higher accuracy obtained after initial noise removal. Iterating the algorithm once increases the accuracy of the Voice Activity Detection (VAD), which improves the overall performance of the algorithm. Placing the microphone(s) on the ceiling above the head and slightly forward of the desired talker appears to be the best location in an automobile based on the experiments performed in this thesis. Objective speech quality measures show that the algorithm removes a majority of the stationary noise in a hands-free environment of an automobile with relatively minimal speech distortion

    Postfiltering Using Multichannel Spectral Estimation in Multispeaker Environments

    Get PDF
    This paper investigates the problem of enhancing a single desired speech source from a mixture of signals in multispeaker environments. A beamformer structure is proposed which combines a fixed beamformer with postfiltering. In the first stage, the fixed multiobjective optimal beamformer is designed to spatially extract the desired source by suppressing all other undesired sources. In the second stage, a multichannel power spectral estimator is proposed and incorporated in the postfilter, thus enabling further suppression capability. The combined scheme exploits both spatial and spectral characteristics of the signals. Two new multichannel spectral estimation methods are proposed for the postfiltering using, respectively, inner product and joint diagonalization. Evaluations using recordings from a real-room environment show that the proposed beamformer offers a good interference suppression level whilst maintaining a low-distortion level of the desired source

    Informed algorithms for sound source separation in enclosed reverberant environments

    Get PDF
    While humans can separate a sound of interest amidst a cacophony of contending sounds in an echoic environment, machine-based methods lag behind in solving this task. This thesis thus aims at improving performance of audio separation algorithms when they are informed i.e. have access to source location information. These locations are assumed to be known a priori in this work, for example by video processing. Initially, a multi-microphone array based method combined with binary time-frequency masking is proposed. A robust least squares frequency invariant data independent beamformer designed with the location information is utilized to estimate the sources. To further enhance the estimated sources, binary time-frequency masking based post-processing is used but cepstral domain smoothing is required to mitigate musical noise. To tackle the under-determined case and further improve separation performance at higher reverberation times, a two-microphone based method which is inspired by human auditory processing and generates soft time-frequency masks is described. In this approach interaural level difference, interaural phase difference and mixing vectors are probabilistically modeled in the time-frequency domain and the model parameters are learned through the expectation-maximization (EM) algorithm. A direction vector is estimated for each source, using the location information, which is used as the mean parameter of the mixing vector model. Soft time-frequency masks are used to reconstruct the sources. A spatial covariance model is then integrated into the probabilistic model framework that encodes the spatial characteristics of the enclosure and further improves the separation performance in challenging scenarios i.e. when sources are in close proximity and when the level of reverberation is high. Finally, new dereverberation based pre-processing is proposed based on the cascade of three dereverberation stages where each enhances the twomicrophone reverberant mixture. The dereverberation stages are based on amplitude spectral subtraction, where the late reverberation is estimated and suppressed. The combination of such dereverberation based pre-processing and use of soft mask separation yields the best separation performance. All methods are evaluated with real and synthetic mixtures formed for example from speech signals from the TIMIT database and measured room impulse responses

    Multi-channel dereverberation for speech intelligibility improvement in hearing aid applications

    Get PDF

    System approach to robust acoustic echo cancellation through semi-blind source separation based on independent component analysis

    Get PDF
    We live in a dynamic world full of noises and interferences. The conventional acoustic echo cancellation (AEC) framework based on the least mean square (LMS) algorithm by itself lacks the ability to handle many secondary signals that interfere with the adaptive filtering process, e.g., local speech and background noise. In this dissertation, we build a foundation for what we refer to as the system approach to signal enhancement as we focus on the AEC problem. We first propose the residual echo enhancement (REE) technique that utilizes the error recovery nonlinearity (ERN) to "enhances" the filter estimation error prior to the filter adaptation. The single-channel AEC problem can be viewed as a special case of semi-blind source separation (SBSS) where one of the source signals is partially known, i.e., the far-end microphone signal that generates the near-end acoustic echo. SBSS optimized via independent component analysis (ICA) leads to the system combination of the LMS algorithm with the ERN that allows for continuous and stable adaptation even during double talk. Second, we extend the system perspective to the decorrelation problem for AEC, where we show that the REE procedure can be applied effectively in a multi-channel AEC (MCAEC) setting to indirectly assist the recovery of lost AEC performance due to inter-channel correlation, known generally as the "non-uniqueness" problem. We develop a novel, computationally efficient technique of frequency-domain resampling (FDR) that effectively alleviates the non-uniqueness problem directly while introducing minimal distortion to signal quality and statistics. We also apply the system approach to the multi-delay filter (MDF) that suffers from the inter-block correlation problem. Finally, we generalize the MCAEC problem in the SBSS framework and discuss many issues related to the implementation of an SBSS system. We propose a constrained batch-online implementation of SBSS that stabilizes the convergence behavior even in the worst case scenario of a single far-end talker along with the non-uniqueness condition on the far-end mixing system. The proposed techniques are developed from a pragmatic standpoint, motivated by real-world problems in acoustic and audio signal processing. Generalization of the orthogonality principle to the system level of an AEC problem allows us to relate AEC to source separation that seeks to maximize the independence, hence implicitly the orthogonality, not only between the error signal and the far-end signal, but rather, among all signals involved. The system approach, for which the REE paradigm is just one realization, enables the encompassing of many traditional signal enhancement techniques in analytically consistent yet practically effective manner for solving the enhancement problem in a very noisy and disruptive acoustic mixing environment.PhDCommittee Chair: Biing-Hwang Juang; Committee Member: Brani Vidakovic; Committee Member: David V. Anderson; Committee Member: Jeff S. Shamma; Committee Member: Xiaoli M

    Distributed Microphone Array System for Two-way Audio Communication

    Get PDF
    Tässä työssä esitellään hajautettu mikrofoniryhmäjärjestelmä kahdensuuntaisessa äänikommunikaatiossa. Järjestelmän tavoitteena on paikallistaa hallitseva puhuja ja tallentaa puhesignaali mahdollisimman korkealaatuisesti. Työssä esiteltävässä järjestelmässä jokainen mikrofoniryhmä toimii polynomirakenteella parametrisoituna keilanmuodostajana (PBF), joka mahdollistaa jatkuvan keilanohjauksen. Hallitsevan puhelähteen suunta päätellään PBF:n jokaisen keilan ulostulotehoista. Lopuksi yhdistämällä jokaisen PBF:n kaikkien keilojen ulostulotehot muodostetaan avaruudellinen todennäköisyysfunktio (SLF), jonka suurin arvo määrää puhujan paikan. Puhesignaali tallennetaan ohjaamalla puhujaa lähinnä olevan PBF:n keila puhujan suuntaan. Tässä työssä esiteltävän järjestelmän toiminta arvioitiin simuloidulla ja mitatulla datalla. Arvionti näyttää, että toteutettu järjestelmä pystyy paikallistamaan puhujan noin 40 cm paikannustarkkuudella ja järjestelmä vaimentaa muista suunnista tulevia häiriölähteitä noin 15 dB. Lopuksi järjestelmä toteutettiin reaaliakaisena systeeminä Pure Data signaalinkäsittelyympäristössä.In this work a distributed microphone array system for two-way audio communication is presented. The goal of the system is to locate the dominant speaker and capture the speech signal with highest possible quality. In the presented system each microphone array works as a Polynomial Beamformer (PBF) thus enabling continuous beam steering. The output power of each PBF beam is used to determine the direction of the dominant speech source. Finally, a Spatial Likelihood Function (SLF) is formed by combining the output beam powers of each microphone array and the speaker is determined to be in the point that has highest value of SLF. The audio signal capture is done by steering the closest microphone array to the direction of the speaker. The presented audio capture front-end was evaluated with simulated and measured data. The evaluation shows that the implemented system gives approximately 40 cm localization accuracy and 15 dB attenuation of interference sources. Finally the system was implemented to run in real-time in the Pure Data signal processing environment

    Single-Microphone Speech Dereverberation based on Multiple-Step Linear Predictive Inverse Filtering and Spectral Subtraction

    Get PDF
    Single-channel speech dereverberation is a challenging problem of deconvolution of reverberation, produced by the room impulse response, from the speech signal, when only one observation of the reverberant signal (one microphone) is available. Although reverberation in mild levels is helpful in perceiving the speech (or any audio) signal, the adverse effect of reverberation, particularly at high levels, could both deteriorate the performance of automatic recognition systems and make it less intelligible by humans. Single-microphone speech dereverberation is more challenging than multi-microphone speech dereverberation, since it does not allow for spatial processing of different observations of the signal. A review of the recent single-channel dereverberation techniques reveals that, those based on LP-residual enhancement are the most promising ones. On the other hand, spectral subtraction has also been effectively used for dereverberation particularly when long reflections are involved. By using LP-residuals and spectral subtraction as two promising tools for dereverberation, a new dereverberation technique is proposed. The first stage of the proposed technique consists of pre-whitening followed by a delayed long-term LP filtering whose kurtosis or skewness of LP-residuals is maximized to control the weight updates of the inverse filter. The second stage consists of nonlinear spectral subtraction. The proposed two-stage dereverberation scheme leads to two separate algorithms depending on whether kurtosis or skewness maximization is used to establish a feedback function for the weight updates of the adaptive inverse filter. It is shown that the proposed algorithms have several advantages over the existing major single-microphone methods, including a reduction in both early and late reverberations, speech enhancement even in the case of very high reverberation time, robustness to additive background noise, and introducing only a few minor artifacts. Equalized room impulse responses by the proposed algorithms have less reverberation times. This means the inverse-filtering by the proposed algorithms is more successful in dereverberating the speech signal. For short, medium and high reverberation times, the signal-to-reverberation ratio of the proposed technique is significantly higher than that of the existing major algorithms. The waveforms and spectrograms of the inverse-filtered and fully-processed signals indicate the superiority of the proposed algorithms. Assessment of the overall quality of the processed speech signals by automatic speech recognition and perceptual evaluation of speech quality test also confirms that in most cases the proposed technique yields higher scores and in the cases that it does not do so, the difference is not as significant as the other aspects of the performance evaluation. Finally, the robustness of the proposed algorithms against the background noise is investigated and compared to that of the benchmark algorithms, which shows that the proposed algorithms are capable of maintaining a rather stable performance for contaminated speech signals with SNR levels as low as 0 dB

    Blind Subband Beamforming With Time-Delay Constraints for Moving Source Speech Enhancement

    Full text link
    corecore