7 research outputs found

    Speech reverberation suppression for time-varying environments using weighted prediction error method with time-varying autoregressive model

    Get PDF
    In this paper, a novel approach for the task of speech reverberation suppression in non-stationary (changing) acoustic environments is proposed. The suggested approach is based on the popular weighted prediction error (WPE) method, yet, instead of considering fixed reverberation prediction weights, our method takes into account the more generic time-varying autoregressive (TV-AR) model which allows dynamic estimation and updating for the prediction weights over time. We use an initial estimate of the prediction weights in order to optimally select the TV-AR model order and also to calculate the TV-AR coefficients. Next, by properly interpolating the calculated coefficients, we obtain the ultimate estimate of reverberation prediction weights. Performance evaluation of the proposed approach is shown not only for fixed acoustic rooms but also for environments where the source and/or sensors are moving. Our experiments reveal further reverberation suppression as well as higher quality in the enhanced speech samples in comparison with recent literature within the context of speech dereverberation

    Speech Dereverberation Based on Multi-Channel Linear Prediction

    Get PDF
    Room reverberation can severely degrade the auditory quality and intelligibility of the speech signals received by distant microphones in an enclosed environment. In recent years, various dereverberation algorithms have been developed to tackle this problem, such as beamforming and inverse filtering of the room transfer function. However, this kind of methods relies heavily on the precise estimation of either the direction of arrival (DOA) or room acoustic characteristics. Thus, their performance is very much limited. A more promising category of dereverberation algorithms has been developed based on multi-channel linear predictor (MCLP). This idea was first proposed in time domain where speech signal is highly correlated in a short period of time. To ensure a good suppression of the reverberation, the prediction filter length is required to be longer than the reverberation time. As a result, the complexity of this algorithm is often unacceptable because of large covariance matrix calculation. To overcome this disadvantage, this thesis focuses on the MCLP dereverberation methods performed in the short-time Fourier transform (STFT) domain. Recently, the weighted prediction error (WPE) algorithm has been developed and widely applied to speech dereverberation. In WPE algorithm, MCLP is used in the STFT domain to estimate the late reverberation components from previous frames of the reverberant speech. The enhanced speech is obtained by subtracting the late reverberation from the reverberant speech. Each STFT coefficient is assumed to be independent and obeys Gaussian distribution. A maximum likelihood (ML) problem is formulated in each frequency bin to calculate the predictor coefficients. In this thesis, the original WPE algorithm is improved in two aspects. First, two advanced statistical models, generalized Gaussian distribution (GGD) and Laplacian distribution, are employed instead of the classic Gaussian distribution. Both of them are shown to give better modeling of the histogram of the clean speech. Second, we focus on improving the estimation of the variances of the STFT coefficients of the desired signal. In the original WPE algorithm, the variances are estimated in each frequency bin independently without considering the cross-frequency correlation. Thus, we integrate the nonnegative matrix factorization (NMF) into the WPE algorithm to refine the estimation of the variances and hence obtain a better dereverberation performance. Another category of MCLP based dereverberation algorithm has been proposed in literature by exploiting the sparsity of the STFT coefficients of the desired signal for calculating the predictor coefficients. In this thesis, we also investigate an efficient algorithm based on the maximization of the group sparsity of desired signal using mixed norms. Inspired by the idea of sparse linear predictor (SLP), we propose to include a sparse constraint for the predictor coefficients in order to further improve the dereverberation performance. A weighting parameter is also introduced to achieve a trade-off between the sparsity of the desired signal and the predictor coefficients. Computer simulation of the proposed dereverberation algorithms is conducted. Our experimental results show that the proposed algorithms can significantly improve the quality of reverberant speech signal under different reverberation times. Subjective evaluation also gives a more intuitive demonstration of the enhanced speech intelligibility. Performance comparison also shows that our algorithms outperform some of the state-of-the-art dereverberation techniques

    주변 환경에 강인한 음성인식을 위한 모델 및 데이터기반 기법

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 8. 김남수.In this thesis, we propose model-based and data-driven techniques for environment-robust automatic speech recognition. The model-based technique is the feature enhancement method in the reverberant noisy environment to improve the performance of Gaussian mixture model-hidden Markov model (HMM) system. It is based on the interacting multiple model (IMM), which was originally developed in single-channel scenario. We extend the single-channel IMM algorithm such that it can handle the multi-channel inputs under the Bayesian framework. The multi-channel IMM algorithm is capable of tracking time-varying room impulse responses and background noises by updating the relevant parameters in an on-line manner. In order to reduce the computation as the number of microphones increases, a computationally efficient algorithm is also devised. In various simulated and real environmental conditions, the performance gain of the proposed method has been confirmed. The data-driven techniques are based on deep neural network (DNN)-HMM hybrid system. In order to enhance the performance of DNN-HMM system in the adverse environments, we propose three techniques. Firstly, we propose a novel supervised pre-training technique for DNN-HMM system to achieve robust speech recognition in adverse environments. In the proposed approach, our aim is to initialize the DNN parameters such that they yield abstract features robust to acoustic environment variations. In order to achieve this, we first derive the abstract features from an early fine-tuned DNN model which is trained based on a clean speech database. By using the derived abstract features as the target values, the standard error back-propagation algorithm with the stochastic gradient descent method is performed to estimate the initial parameters of the DNN. The performance of the proposed algorithm was evaluated on Aurora-4 DB and better results were observed compared to a number of conventional pre-training methods. Secondly, a new DNN-based robust speech recognition approaches taking advantage of noise estimates are proposed. A novel part of the proposed approaches is that the time-varying noise estimates are applied to the DNN as additional inputs. For this, we extract the noise estimates in a frame-by-frame manner from the IMM algorithm which has been known to show good performance in tracking slowly-varying background noise. The performance of the proposed approaches is evaluated on Aurora-4 DB and better performance is observed compared to the conventional DNN-based robust speech recognition algorithms. Finally, a new approach to DNN-based robust speech recognition using soft target labels is proposed. The soft target labeling means that each target value of the DNN output is not restricted to 0 or 1 but takes non negative values in (0,1) and their sum equals 1. In this study, the soft target labels are obtained from the forward-backward algorithm well-known in HMM training. The proposed method makes the DNN training be more robust in noisy and unseen conditions. The performance of the proposed approach was evaluated on Aurora-4 DB and various mismatched noise test conditions, and found better compared to the conventional hard target labeling method. Furthermore, in the data-driven approaches, an integrated technique using above three algorithms and model-based technique is described. In matched and mismatched noise conditions, the performance results are discussed. In matched noise conditions, the initialization method for the DNN was effective to enhance the recognition performance. In mismatched noise conditions, the combination of using the noise estimates as an DNN input and soft target labels showed the best recognition results in all the tested combinations of the proposed techniques.Abstract i Contents iv List of Figures viii List of Tables x 1 Introduction 1 2 Experimental Environments and Database 7 2.1 ASR in Hands-Free Scenario and Feature Extraction 7 2.2 Relationship between Clean and Distorted Speech in Feature Domain 10 2.3 Database 12 2.3.1 TI Digits Corpus 13 2.3.2 Aurora-4 DB 15 3 Previous Robust ASR Approaches 17 3.1 IMM-Based Feature Compensation in Noise Environment 18 3.2 Single-Channel Reverberation and Noise-Robust Feature Enhancement Based on IMM 24 3.3 Multi-Channel Feature Enhancement for Robust Speech Recognition 26 3.4 DNN-Based Robust Speech Recognition 27 4 Multi-Channel IMM-Based Feature Enhancement for Robust Speech Recognition 31 4.1 Introduction 31 4.2 Observation Model in Multi-Channel Reverberant Noisy Environment 33 4.3 Multi-Channel Feature Enhancement in a Bayesian Framework 35 4.3.1 A Priori Clean Speech Model 37 4.3.2 A Priori Model for RIR 38 4.3.3 A Priori Model for Background Noise 39 4.3.4 State Transition Formulation 40 4.3.5 Function Linearization 41 4.4 Feature Enhancement Algorithm 42 4.5 Incremental State Estimation 48 4.6 Experiments 52 4.6.1 Simulation Data 52 4.6.2 Live Recording Data 54 4.6.3 Computational Complexity 55 4.7 Summary 56 5 Supervised Denoising Pre-Training for Robust ASR with DNN-HMM 59 5.1 Introduction 59 5.2 Deep Neural Networks 61 5.3 Supervised Denoising Pre-Training 63 5.4 Experiments 65 5.4.1 Feature Extraction and GMM-HMM System 66 5.4.2 DNN Structures 66 5.4.3 Performance Evaluation 68 5.5 Summary 69 6 DNN-Based Frameworks for Robust Speech Recognition Using Noise Estimates 71 6.1 Introduction 71 6.2 DNN-Based Frameworks for Robust ASR 73 6.2.1 Robust Feature Enhancement 74 6.2.2 Robust Model Training 75 6.3 IMM-Based Noise Estimation 77 6.4 Experiments 78 6.4.1 DNN Structures 78 6.4.2 Performance Evaluations 79 6.5 Summary 82 7 DNN-Based Robust Speech Recognition Using Soft Target Labels 83 7.1 Introduction 83 7.2 DNN-HMM Hybrid System 85 7.3 Soft Target Label Estimation 87 7.4 Experiments 89 7.4.1 DNN Structures 89 7.4.2 Performance Evaluation 90 7.4.3 Effects of Control Parameter ξ 91 7.4.4 An Integration with SDPT and ESTN Methods 92 7.4.5 Performance Evaluation on Various Noise Types 93 7.4.6 DNN Training and Decoding Time 95 7.5 Summary 96 8 Conclusions 99 Bibliography 101 요약 108Docto

    Puesta en marcha de un entorno de experimentación para reconocimiento de habla en cabinas de avión

    Get PDF
    Este proyecto afronta el tema del reconocimiento automático de habla en el escenario de las cabinas de avión. Se trata de un escenario en el cual nos enfrentamos a problemas tales como distintos tipos de ruido (ruido del propio avión, ruido conversacional o ruido de reverberación) así como variabilidad en la lengua nativa de los propios hablantes. En este caso, hemos puesto especial énfasis en buscar soluciones al problema del ruido por reverberación. Para realizar la investigación se ha realizado la puesta en marcha de un entorno de experimentación haciendo uso de la base de datos HIWIRE sobre el que hemos probado técnicas básicas orientadas a la mejora del reconocimiento en este entorno, en particular hemos probado algunas técnicas simples como la normalización en media y varianza y otras mas complejas como la substracción espectral la cual combinaremos también con un VAD. ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------This project deals with the task of automatic speech recognition (ASR) in aeronautic environments such as airplane's cockpits. Cockpits are scenarios in which we face problems like different kinds of noise (engine noise, conversational noise or reverberant noise) or a high variability due to non-native speakers. In our case we have made an special effort in searching solutions to overcome the problems of reverberant noise. In order to carry on with this investigation we have created an environment of investigation based on the HIWIRE database over which we have tested basic techniques focused on improving ASR performance. More specifically, we have tested basic techniques like mean and variance normalization and some others more complex tike spectral substraction later on combined with VAD.Ingeniería de Telecomunicació

    Robuste Spracherkennung unter raumakustischen Umgebungsbedingungen

    Get PDF
    Bei der Überführung eines wissenschaftlichen Laborsystems zur automatischen Spracherkennung in eine reale Anwendung ergeben sich verschiedene praktische Problemstellungen, von denen eine der Verlust an Erkennungsleistung durch umgebende akustische Störungen ist. Im Gegensatz zu additiven Störungen wie Lüfterrauschen o. ä. hat die Wissenschaft bislang die Störung des Raumhalls bei der Spracherkennung nahezu ignoriert. Dabei besitzen, wie in der vorliegenden Dissertation deutlich gezeigt wird, bereits geringfügig hallende Räume einen stark störenden Einfluss auf die Leistungsfähigkeit von Spracherkennern. Mit dem Ziel, die Erkennungsleistung wieder in einen praktisch benutzbaren Bereich zu bringen, nimmt sich die Arbeit dieser Problemstellung an und schlägt Lösungen vor. Der Hintergrund der wissenschaftlichen Aktivitäten ist die Erstellung von funktionsfähigen Sprachbenutzerinterfaces für Gerätesteuerungen im Wohn- und Büroumfeld, wie z.~B. bei der Hausautomation. Aus diesem Grund werden praktische Randbedingungen wie die Restriktionen von embedded Computerplattformen in die Lösungsfindung einbezogen. Die Argumentation beginnt bei der Beschreibung der raumakustischen Umgebung und der Ausbreitung von Schallfeldern in Räumen. Es wird theoretisch gezeigt, dass die Störung eines Sprachsignals durch Hall von zwei Parametern abhängig ist: der Sprecher-Mikrofon-Distanz (SMD) und der Nachhallzeit T60. Um die Abhängigkeit der Erkennungsleistung vom Grad der Hallstörung zu ermitteln, wird eine Anzahl von Erkennungsexperimenten durchgeführt, die den Einfluss von T60 und SMD nachweisen. Weitere Experimente zeigen, dass die Spracherkennung kaum durch hochfrequente Hallanteile beeinträchtigt wird, wohl aber durch tieffrequente. In einer Literaturrecherche wird ein Überblick über den Stand der Technik zu Maßnahmen gegeben, die den störenden Einfluss des Halls unterdrücken bzw. kompensieren können. Jedoch wird auch gezeigt, dass, obwohl bei einigen Maßnahmen von Verbesserungen berichtet wird, keiner der gefundenen Ansätze den o. a. praktischen Einsatzbedingungen genügt. In dieser Arbeit wird die Methode Harmonicity-based Feature Analysis (HFA) vorgeschlagen. Sie basiert auf drei Ideen, die aus den Betrachtungen der vorangehenden Kapitel abgeleitet werden. Experimentelle Ergebnisse weisen die Verbesserung der Erkennungsleistung in halligen Umgebungen nach. Es werden sogar praktisch relevante Erkennungsraten erzielt, wenn die Methode mit verhalltem Training kombiniert wird. Die HFA wird gegen Ansätze aus der Literatur evaluiert, die ebenfalls praktischen Implementierungskriterien genügen. Auch Kombinationen der HFA und einigen dieser Ansätze werden getestet. Im letzten Kapitel werden die beiden Basistechnologien Stimm\-haft-Stimmlos-Entscheidung und Grundfrequenzdetektion umfangreich unter Hallbedingungen getestet, da sie Voraussetzung für die Funktionsfähigkeit der HFA sind. Als Ergebnis wird dargestellt, dass derzeit für beide Technologien kein Verfahren existiert, das unter Hallbedingungen robust arbeitet. Es kann allerdings gezeigt werden, dass die HFA trotz der Unsicherheiten der Verfahren arbeitet und signifikante Steigerungen der Erkennungsleistung erreicht.Automatic speech recognition (ASR) systems used in real-world indoor scenarios suffer from performance degradation if noise and reverberation conditions differ from the training conditions of the recognizer. This thesis deals with the problem of room reverberation as a cause of distortion in ASR systems. The background of this research is the design of practical command and control applications, such as a voice controlled light switch in rooms or similar applications. Therefore, the design aims to incorporate several restricting working conditions for the recognizer and still achieve a high level of robustness. One of those design restrictions is the minimisation of computational complexity to allow the practical implementation on an embedded processor. One chapter comprehensively describes the room acoustic environment, including the behavior of the sound field in rooms. It addresses the speaker room microphone (SRM) system which is expressed in the time domain as the room impulse response (RIR). The convolution of the RIR with the clean speech signal yields the reverberant signal at the microphone. A thorough analysis proposes that the degree of the distortion caused by reverberation is dependent on two parameters, the reverberation time T60 and the speaker-to-microphone distance (SMD). To evaluate the dependency of the recognition rate on the degree of distortion, a number of experiments has been successfully conducted, confirming the above mentioned dependency of the two parameters, T60 and SMD. Further experiments have shown that ASR is barely affected by high-frequency reverberation, whereas low frequency reverberation has a detrimental effect on the recognition rate. A literature survey concludes that, although several approaches exist which claim significant improvements, none of them fulfils the above mentioned practical implementation criteria. Within this thesis, a new approach entitled 'harmonicity-based feature analysis' (HFA) is proposed. It is based on three ideas that are derived in former chapters. Experimental results prove that HFA is able to enhance the recognition rate in reverberant environments. Even practical applicable results are achieved when HFA is combined with reverberant training. The method is further evaluated against three other approaches from the literature. Also combinations of methods are tested. In a last chapter the two base technologies fundamental frequency (F0) estimation and voiced unvoiced decision (VUD) are evaluated in reverberant environments, since they are necessary to run HFA. This evaluation aims to find one optimal method for each of these technologies. The results show that all F0 estimation methods and also the VUD methods have a strong decreasing performance in reverberant environments. Nevertheless it is shown that HFA is able to deal with uncertainties of these base technologies as such that the recognition performance still improves

    Robuste Spracherkennung unter raumakustischen Umgebungsbedingungen

    Get PDF
    Bei der Überführung eines wissenschaftlichen Laborsystems zur automatischen Spracherkennung in eine reale Anwendung ergeben sich verschiedene praktische Problemstellungen, von denen eine der Verlust an Erkennungsleistung durch umgebende akustische Störungen ist. Im Gegensatz zu additiven Störungen wie Lüfterrauschen o. ä. hat die Wissenschaft bislang die Störung des Raumhalls bei der Spracherkennung nahezu ignoriert. Dabei besitzen, wie in der vorliegenden Dissertation deutlich gezeigt wird, bereits geringfügig hallende Räume einen stark störenden Einfluss auf die Leistungsfähigkeit von Spracherkennern. Mit dem Ziel, die Erkennungsleistung wieder in einen praktisch benutzbaren Bereich zu bringen, nimmt sich die Arbeit dieser Problemstellung an und schlägt Lösungen vor. Der Hintergrund der wissenschaftlichen Aktivitäten ist die Erstellung von funktionsfähigen Sprachbenutzerinterfaces für Gerätesteuerungen im Wohn- und Büroumfeld, wie z.~B. bei der Hausautomation. Aus diesem Grund werden praktische Randbedingungen wie die Restriktionen von embedded Computerplattformen in die Lösungsfindung einbezogen. Die Argumentation beginnt bei der Beschreibung der raumakustischen Umgebung und der Ausbreitung von Schallfeldern in Räumen. Es wird theoretisch gezeigt, dass die Störung eines Sprachsignals durch Hall von zwei Parametern abhängig ist: der Sprecher-Mikrofon-Distanz (SMD) und der Nachhallzeit T60. Um die Abhängigkeit der Erkennungsleistung vom Grad der Hallstörung zu ermitteln, wird eine Anzahl von Erkennungsexperimenten durchgeführt, die den Einfluss von T60 und SMD nachweisen. Weitere Experimente zeigen, dass die Spracherkennung kaum durch hochfrequente Hallanteile beeinträchtigt wird, wohl aber durch tieffrequente. In einer Literaturrecherche wird ein Überblick über den Stand der Technik zu Maßnahmen gegeben, die den störenden Einfluss des Halls unterdrücken bzw. kompensieren können. Jedoch wird auch gezeigt, dass, obwohl bei einigen Maßnahmen von Verbesserungen berichtet wird, keiner der gefundenen Ansätze den o. a. praktischen Einsatzbedingungen genügt. In dieser Arbeit wird die Methode Harmonicity-based Feature Analysis (HFA) vorgeschlagen. Sie basiert auf drei Ideen, die aus den Betrachtungen der vorangehenden Kapitel abgeleitet werden. Experimentelle Ergebnisse weisen die Verbesserung der Erkennungsleistung in halligen Umgebungen nach. Es werden sogar praktisch relevante Erkennungsraten erzielt, wenn die Methode mit verhalltem Training kombiniert wird. Die HFA wird gegen Ansätze aus der Literatur evaluiert, die ebenfalls praktischen Implementierungskriterien genügen. Auch Kombinationen der HFA und einigen dieser Ansätze werden getestet. Im letzten Kapitel werden die beiden Basistechnologien Stimm\-haft-Stimmlos-Entscheidung und Grundfrequenzdetektion umfangreich unter Hallbedingungen getestet, da sie Voraussetzung für die Funktionsfähigkeit der HFA sind. Als Ergebnis wird dargestellt, dass derzeit für beide Technologien kein Verfahren existiert, das unter Hallbedingungen robust arbeitet. Es kann allerdings gezeigt werden, dass die HFA trotz der Unsicherheiten der Verfahren arbeitet und signifikante Steigerungen der Erkennungsleistung erreicht.Automatic speech recognition (ASR) systems used in real-world indoor scenarios suffer from performance degradation if noise and reverberation conditions differ from the training conditions of the recognizer. This thesis deals with the problem of room reverberation as a cause of distortion in ASR systems. The background of this research is the design of practical command and control applications, such as a voice controlled light switch in rooms or similar applications. Therefore, the design aims to incorporate several restricting working conditions for the recognizer and still achieve a high level of robustness. One of those design restrictions is the minimisation of computational complexity to allow the practical implementation on an embedded processor. One chapter comprehensively describes the room acoustic environment, including the behavior of the sound field in rooms. It addresses the speaker room microphone (SRM) system which is expressed in the time domain as the room impulse response (RIR). The convolution of the RIR with the clean speech signal yields the reverberant signal at the microphone. A thorough analysis proposes that the degree of the distortion caused by reverberation is dependent on two parameters, the reverberation time T60 and the speaker-to-microphone distance (SMD). To evaluate the dependency of the recognition rate on the degree of distortion, a number of experiments has been successfully conducted, confirming the above mentioned dependency of the two parameters, T60 and SMD. Further experiments have shown that ASR is barely affected by high-frequency reverberation, whereas low frequency reverberation has a detrimental effect on the recognition rate. A literature survey concludes that, although several approaches exist which claim significant improvements, none of them fulfils the above mentioned practical implementation criteria. Within this thesis, a new approach entitled 'harmonicity-based feature analysis' (HFA) is proposed. It is based on three ideas that are derived in former chapters. Experimental results prove that HFA is able to enhance the recognition rate in reverberant environments. Even practical applicable results are achieved when HFA is combined with reverberant training. The method is further evaluated against three other approaches from the literature. Also combinations of methods are tested. In a last chapter the two base technologies fundamental frequency (F0) estimation and voiced unvoiced decision (VUD) are evaluated in reverberant environments, since they are necessary to run HFA. This evaluation aims to find one optimal method for each of these technologies. The results show that all F0 estimation methods and also the VUD methods have a strong decreasing performance in reverberant environments. Nevertheless it is shown that HFA is able to deal with uncertainties of these base technologies as such that the recognition performance still improves
    corecore