8 research outputs found
잡음에 강인한 음성 구간 검출과 음성 향상을 위한 딥 러닝 기반 기법 연구
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 김남수.Over the past decades, a number of approaches have been proposed to improve the performances of voice activity detection (VAD) and speech enhancement algorithms which are crucial for speech communication and speech signal processing systems. In particular, the increasing use of machine learning-based techniques has led to the more robust algorithms in low SNR conditions. Among them, the deep neural network (DNN) has been one of the most popular techniques.
While the DNN-based technique is successfully applied to these tasks, the characteristics of VAD and speech enhancement tasks are not fully incorporated to the DNN structures and objective functions. In this thesis, we propose the novel training schemes and post-filter for DNN-based VAD and speech enhancement. Unlike algorithms with basic DNN-based framework, the proposed algorithm combines the knowledge from signal processing and machine learning society to develop the improve DNN-based VAD and speech enhancement algorithm. In the following chapters, the environmental mismatch problem in the VAD area is compensated by applying multi-task learning to the DNN-based VAD. Also, the DNN-based framework is proposed in the speech enhancement scenario and the novel objective function and post-filter which are derived from the characteristics on human auditory perception improve the DNN-based speech enhancement algorithm.
In the VAD task, the DNN-based algorithm was recently proposed and outperformed the traditional and other machine learning-based VAD algorithms. However, the performance of the DNN-based algorithm sometimes deteriorates when the training and test environments are not matched with each other. In order to increase the performance of the DNN-based VAD in unseen environments, we adopt the multi-task learning (MTL) framework which consists of the primary VAD and subsidiary feature enhancement tasks. By employing the MTL framework, the DNN learns the denoising function in the shared hidden layers that is useful to maintain the VAD performance in mismatched noise conditions.
Second, the DNN-based framework is applied to the speech enhancement by considering it as a regression task. The encoding vector of the conventional nonnegative matrix factorization (NMF)-based algorithm is estimated by the proposed DNN and the performance of the DNN-based algorithm is compared to the conventional NMF-based algorithm.
Third, the perceptually motivated objective function is proposed for the DNN-based speech enhancement. In the proposed technique, a new objective function which consists of the Mel-scale weighted mean square error, temporal and spectral variations similarities between the enhanced and clean speech is employed in the DNN training stage. The proposed objective function helps to compute the gradients based on a perceptually motivated non-linear frequency scale and alleviates the over-smoothness of the estimated speech.
Furthermore, the post-filter which adjusts the variance over frequency bins further compensates the lack of contrasts between spectral peaks and valleys in the enhanced speech. The conventional GV equalization post-filters do not consider the spectral dynamics over frequency bins. To consider the contrast between spectral peaks and valleys in each enhanced speech frames, the proposed algorithm matches the variance over coefficients in the log-power spectra domain.
Finally, in the speech enhancement task, an integrated technique using the proposed perceptually motivated objective function and the post-filter is described. In matched and mismatched noise conditions, the performance results of the conventional and proposed algorithm are discussed. Also, the subjective preference test result of these algorithms is also provided.1 Introduction 1
2 Conventional Approaches for Speech Enhancement 7
2.1 NMF-Based Speech Enhancement 7
3 Deep Neural Networks 13
3.1 Introduction 13
3.2 Objective Function 14
3.3 Stochastic Gradient Descent 16
4 DNN-Based Voiced Activity Detection with Multi-Task Learning Framework 19
4.1 Introduction 19
4.2 DNN-Based VAD Algorithm 21
4.3 DNN-Based VAD with MTL framework 23
4.4 Experimental Results 26
4.4.1 Experiments in Matched Noise Conditions 26
4.4.2 Experiments in Mismatched Noise Conditions 28
4.5 Summary 30
5 NMF-based Speech Enhancement Using Deep Neural Network 35
5.1 Introduction 35
5.2 Encoding Vector Estimation Using DNN 37
5.3 Experiments 42
5.4 Summary 47
6 DNN-Based Monaural Speech Enhancement with Temporal and Spectral Variations Equalization 49
6.1 Introduction 49
6.2 Conventional DNN-Based Speech Enhancement 53
6.2.1 Training Stage 53
6.2.2 Test Stage 55
6.3 Perceptually-Motivated Criteria 56
6.3.1 Perceptually Motivated Objective Function 56
6.3.2 Mel-Scale Weighted Mean Square Error 58
6.3.3 Temporal Variation Similarity 58
6.3.4 Spectral Variation Similarity 61
6.3.5 DNN Training with the Proposed Objective Function 62
6.4 Experiments 62
6.4.1 Performance Evaluation with Varying Weight Parameters 64
6.4.2 Performance Evaluation in Matched Noise Conditions 64
6.4.3 Performance Evaluation in Mismatched Noise Conditions 66
6.4.4 Comparison Between Variation Analysis Method 66
6.4.5 Subjective Test Results 67
6.5 Summary 68
7 Spectral Variance Equalization Post-filter for DNN-Based Speech Enhancement 75
7.1 Introduction 75
7.2 GV Equalization Post-Filter 76
7.3 Spectral Variance(SV) Equalization Post-Filter 77
7.4 Experiments 78
7.4.1 Objective Test Results 78
7.4.2 Subjective Test Results 79
7.5 Summary 81
8 Conclusions 83
Bibliography 85
Appendix 95
요약 97Docto
강인한 음성 인식을 위한 Switching Linear Dynamic System기반의 특징 벡터 보상 기법
학위논문 (석사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2012. 2. 김남수.본 논문은 Switching Linear Dynamic System (SLDS)를 이용한 특징파라미터 보상을 통해 잡음 환경에서의 음성 인식률을 향상시키는 방안에 관한 것이다. 같은 음성 신호라도 배경 잡음, 녹음 기기의 선형적 혹은 비선형적인 특성, 그리고 반향 등에 따라 다른 특성을 보인다. 이런 영향으로 인해 음성 인식기에 저장된 깨끗한 음성을 기반으로 만들어진 모델의 특성과 실제 환경에서의 왜곡된 음성 특성이 불일치를 나타내게 되어 음성 인식의 성능이 저하된다. 이러한 문제를 보상해주기 위한 방법중 하나가 왜곡된 음성의 특징 파라미터를 그에 대응되는 깨끗한 특징 파라미터로 매핑해 주는 전처리 과정이다. 이와 같은 전처리 과정은 같은 음성을 서로 다른 환경에서 녹음한 스테레오 데이터 세트를 가지고 학습된다. 이러한 기존의 피쳐 보상 연구는 주로 특징 파라미터 벡터 간의 일대 일 대응이 주를 이루었다. 이 논문에서는 과거의 벡터까지 고려하여 특징 파라미터 보상을 실시하는 SLDS를 제안한다. SLDS는 과거의 입력 역시 정보로 사용할 수 있으므로 다양한 환경에서 실제 매핑 함수에 근사한 모델을 생성할 수 있다. 또한 스테레오 데이터 세트가 없을 때에 HMM 모델 파라미터로부터 스테레오 데이터 세트를 만드는 방법에 대해서 다루었다.Maste
