13 research outputs found

    주파수 및 시간적 상관관계에 기반한 음향학적 에코 억제 기법

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 8. 김남수.In the past decades, a number of approaches have been dedicated to acoustic echo cancellation and suppression which reduce the negative effects of acoustic echo, namely the acoustic coupling between the loudspeaker and microphone in a room. In particular, the increasing use of full-duplex telecommunication systems has led to the requirement of faster and more reliable acoustic echo cancellation algorithms. The solutions have been based on adaptive filters, but the length of these filters has to be long enough to consider most of the echo signal and linear filtering in these algorithms may be limited to remove the echo signal in various environments. In this thesis, a novel stereophonic acoustic echo suppression (SAES) technique based on spectral and temporal correlations is proposed in the short-time Fourier transform (STFT) domain. Unlike traditional stereophonic acoustic echo cancellation, the proposed algorithm estimates the echo spectra in the STFT domain and uses a Wiener filter to suppress echo without performing any explicit double-talk detection. The proposed approach takes account of interdependencies among components in adjacent time frames and frequency bins, which enables more accurate estimation of the echo signals. Due to the limitations of power amplifiers or loudspeakers, the echo signals captured in the microphones are not in a linear relationship with the far-end signals even when the echo path is perfectly linear. The nonlinear components of the echo cannot be successfully removed by a linear acoustic echo canceller. The remaining echo components in the output of acoustic echo suppression (AES) can be further suppressed by applying residual echo suppression (RES) algorithms. In this thesis, we propose an optimal RES gain estimation based on deep neural network (DNN) exploiting both the far-end and the AES output signals in all frequency bins. A DNN structure is introduced as a regression function representing the complex nonlinear mapping from these signals to the optimal RES gain. Because of the capability of the DNN, the spectro-temporal correlations in the full-band can be considered while finding the nonlinear function. The proposed method does not require any explicit double-talk detectors to deal with single-talk and double-talk situations. One of the well-known approaches for nonlinear acoustic echo cancellation is an adaptive Volterra filtering and various algorithms based on the Volterra filter were proposed to describe the characteristics of nonlinear echo and showed the better performance than the conventional linear filtering. However, the performance might be not satisfied since these algorithms could not consider the full correlation for the nonlinear relationship between the input signal and far-end signal in time-frequency domain. In this thesis, we propose a novel DNN-based approach for nonlinear acoustic echo suppression (NAES), extending the proposed RES algorithm. Instead of estimating the residual gain for suppressing the nonlinear echo components, the proposed algorithm straightforwardly recovers the near-end speech signal through the direct gain estimation obtained from DNN frameworks on the input and far-end signal. For echo aware training, a priori and a posteriori signal-to-echo ratio (SER) are introduced as additional inputs of the DNN for tracking the change of the echo signal. In addition, the multi-task learning (MTL) to the DNN-based NAES is combined to the DNN incorporating echo aware training for robustness. In the proposed system, an additional task of double-talk detection is jointly trained with the primary task of the gain estimation for NAES. The DNN can learn the good representations which can suppress more in single-talk periods and improve the gain estimates in double-talk periods through the MTL framework. Besides, the proposed NAES using echo aware training and MTL with double-talk detection makes the DNN be more robust in various conditions. The proposed techniques show significantly better performance than the conventional AES methods in both single- and double-talk periods. As a pre-processing of various applications such as speech recognition and speech enhancement, these approaches can help to transmit the clean speech and provide an acceptable communication in full-duplex real environments.Chapter 1 Introduction 1 1.1 Background 1 1.2 Scope of thesis 3 Chapter 2 Conventional Approaches for Acoustic Echo Suppression 7 2.1 Single Channel Acoustic Echo Cancellation and Suppression 8 2.1.1 Single Channel Acoustic Echo Cancellation 8 2.1.2 Adaptive Filters for Acoustic Echo Cancellation 10 2.1.3 Acoustic Echo Suppression Based on Spectral Modication 11 2.2 Residual Echo Suppression 13 2.2.1 Spectral Feature-based Nonlinear Residual Echo Suppression 15 2.3 Stereophonic Acoustic Echo Cancellation 17 2.4 Wiener Filtering for Stereophonic Acoustic Echo Suppression 20 Chapter 3 Stereophonic Acoustic Echo Suppression Incorporating Spectro-Temporal Correlations 25 3.1 Introduction 25 3.2 Linear Time-Invariant Systems in the STFT Domain with Crossband Filtering 26 3.3 Enhanced SAES (ESAES) Utilizing Spectro-Temporal Correlations 29 3.3.1 Problem Formulation 31 3.3.2 Estimation of Extended PSD Matrices, Echo Spectra, and Gain Function 34 3.3.3 Complexity of the Proposed ESAES Algorithm 36 3.4 Experimental Results 37 3.5 Summary 41 Chapter 4 Nonlinear Residual Echo Suppression Based on Deep Neural Network 43 4.1 Introduction 43 4.2 A Brief Review on RES 45 4.3 Deep Neural Networks 46 4.4 Nonlinear RES using Deep Neural Network 49 4.5 Experimental Results 52 4.5.1 Combination with Stereophonic Acoustic Echo Suppression 59 4.6 Summary 61 Chapter 5 Enhanced Deep Learning Frameworks for Nonlinear Acoustic Echo Suppression 69 5.1 Introduction 69 5.2 DNN-based Nonlinear Acoustic Echo Suppression using Echo Aware Training 72 5.3 Multi-Task Learning for NAES 75 5.4 Experimental Results 78 5.5 Summary 82 Chapter 6 Conclusions 89 Bibliography 91 요약 101Docto

    Blind MultiChannel Identification and Equalization for Dereverberation and Noise Reduction based on Convolutive Transfer Function

    Get PDF
    This paper addresses the problems of blind channel identification and multichannel equalization for speech dereverberation and noise reduction. The time-domain cross-relation method is not suitable for blind room impulse response identification, due to the near-common zeros of the long impulse responses. We extend the cross-relation method to the short-time Fourier transform (STFT) domain, in which the time-domain impulse responses are approximately represented by the convolutive transfer functions (CTFs) with much less coefficients. The CTFs suffer from the common zeros caused by the oversampled STFT. We propose to identify CTFs based on the STFT with the oversampled signals and the critical sampled CTFs, which is a good compromise between the frequency aliasing of the signals and the common zeros problem of CTFs. In addition, a normalization of the CTFs is proposed to remove the gain ambiguity across sub-bands. In the STFT domain, the identified CTFs is used for multichannel equalization, in which the sparsity of speech signals is exploited. We propose to perform inverse filtering by minimizing the 1\ell_1-norm of the source signal with the relaxed 2\ell_2-norm fitting error between the micophone signals and the convolution of the estimated source signal and the CTFs used as a constraint. This method is advantageous in that the noise can be reduced by relaxing the 2\ell_2-norm to a tolerance corresponding to the noise power, and the tolerance can be automatically set. The experiments confirm the efficiency of the proposed method even under conditions with high reverberation levels and intense noise.Comment: 13 pages, 5 figures, 5 table

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Get PDF
    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Audio source separation into the wild

    Get PDF
    International audienceThis review chapter is dedicated to multichannel audio source separation in real-life environment. We explore some of the major achievements in the field and discuss some of the remaining challenges. We will explore several important practical scenarios, e.g. moving sources and/or microphones, varying number of sources and sensors, high reverberation levels, spatially diffuse sources, and synchronization problems. Several applications such as smart assistants, cellular phones, hearing aids and robots, will be discussed. Our perspectives on the future of the field will be given as concluding remarks of this chapter

    An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony

    Get PDF
    In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique

    An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony

    Get PDF
    In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique

    Reverberation: models, estimation and application

    No full text
    The use of reverberation models is required in many applications such as acoustic measurements, speech dereverberation and robust automatic speech recognition. The aim of this thesis is to investigate different models and propose a perceptually-relevant reverberation model with suitable parameter estimation techniques for different applications. Reverberation can be modelled in both the time and frequency domain. The model parameters give direct information of both physical and perceptual characteristics. These characteristics create a multidimensional parameter space of reverberation, which can be to a large extent captured by a time-frequency domain model. In this thesis, the relationship between physical and perceptual model parameters will be discussed. In the first application, an intrusive technique is proposed to measure the reverberation or reverberance, perception of reverberation and the colouration. The room decay rate parameter is of particular interest. In practical applications, a blind estimate of the decay rate of acoustic energy in a room is required. A statistical model for the distribution of the decay rate of the reverberant signal named the eagleMax distribution is proposed. The eagleMax distribution describes the reverberant speech decay rates as a random variable that is the maximum of the room decay rates and anechoic speech decay rates. Three methods were developed to estimate the mean room decay rate from the eagleMax distributions alone. The estimated room decay rates form a reverberation model that will be discussed in the context of room acoustic measurements, speech dereverberation and robust automatic speech recognition individually

    Estimation of the Direct-Path Relative Transfer Function for Supervised Sound-Source Localization

    Get PDF
    International audienceThis paper addresses the problem of binaural localization of a single speech source in noisy and reverberant environments. For a given binaural microphone setup, the binaural response corresponding to the direct-path propagation of a single source is a function of the source direction. In practice, this response is contaminated by noise and reverberations. The direct-path relative transfer function (DP-RTF) is defined as the ratio between the direct-path acoustic transfer function of the two channels. We propose a method to estimate the DP-RTF from the noisy and reverberant microphone signals in the short-time Fourier transform domain. First, the convolutive transfer function approximation is adopted to accurately represent the impulse response of the sensors in the STFT domain. Second, the DP-RTF is estimated by using the auto-and cross-power spectral densities at each frequency and over multiple frames. In the presence of stationary noise, an inter-frame spectral subtraction algorithm is proposed, which enables to achieve the estimation of noise-free auto-and cross-power spectral densities. Finally, the estimated DP-RTFs are concatenated across frequencies and used as a feature vector for the localization of speech source. Experiments with both simulated and real data show that the proposed localization method performs well, even under severe adverse acoustic conditions, and outperforms state-of-the-art localization methods under most of the acoustic conditions
    corecore