7 research outputs found
A perceptually-weighted deep neural network for monaural speech enhancement in various background noise conditions
Deep neural networks (DNN) have recently been
shown to give state-of-the-art performance in monaural speech
enhancement. However in the DNN training process, the perceptual
difference between different components of the DNN
output is not fully exploited, where equal importance is often
assumed. To address this limitation, we have proposed a new
perceptually-weighted objective function within a feedforward
DNN framework, aiming to minimize the perceptual difference
between the enhanced speech and the target speech. A perceptual
weight is integrated into the proposed objective function, and
has been tested on two types of output features: spectra and
ideal ratio masks. Objective evaluations for both speech quality
and speech intelligibility have been performed. Integration of our
perceptual weight shows consistent improvement on several noise
levels and a variety of different noise types
์์ฑ ์งํ ์ธก์ ๋ชจ๋ธ์ ์ด์ฉํ ์์ฑ ํฅ์ ์ฌ์ธต์ ๊ฒฝ๋ง
ํ์๋
ผ๋ฌธ (์์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2019. 2. ๊น๋จ์.๋ณธ ๋
ผ๋ฌธ์ ์์ฑ ์งํ ์ธก์ ๋ชจ๋ธ์ ์ด์ฉํ ์ฌ์ธต์ ๊ฒฝ๋ง ์์ฑ ํฅ์ ๊ธฐ๋ฒ์ ๋ค๋ฃจ๊ณ ์๋ค. ๊ธฐ์กด์ ์ฌ์ธต์ ๊ฒฝ๋ง ์์ฑ ํฅ์ ๊ธฐ๋ฒ์ ๋ชฉํ ํจ์๊ฐ ๋ช
๋ฃ๋ ๋ฐ ์์ง์ ๋ํ๋ด๋ ์งํ์ ๊ด๋ จ์ฑ์ด ์ ๊ธฐ ๋๋ฌธ์ ํ๊ณ์ฑ์ ๋ ๊ณ ์์๋ค. ์ด๋ฅผ ๋ณด์ํ๊ธฐ ์ํด ์์ฑ ํฅ์ ๋ชจ๋ธ๊ณผ ๋ชฉํ ํจ์๊ฐ ์์ฑ ๋ช
๋ฃ๋ ๋๋ ์์ฑ ํ์ง์ ์งํ๋ก ์ค์ ๋ ์ ๊ฒฝ๋ง ๋ชจ๋ธ, ๋ ๊ฐ์ง ๋ชจ๋ธ์ ์ฐ๊ฒฐํ์ฌ ์์ฑ ํฅ์์ ์๋ํ๋ ๋ฐฉํฅ์ ์ธ์ ๋ค. ์์ํ ์์ฑ, ์ก์์ด ์์ธ ์์ฑ, ํฅ์๋ ์์ฑ ์ธ ๊ฐ์ง ๊ฒฝ์ฐ์ ๋ํด ์์ฑ ์งํ๋ฅผ ์ธก์ ํ ๋ค, ํ๋ จ์ ํตํด ๊ฐ๊ฐ์ ์์น๋ค์ ์ธก์ ํ๋ ๋ชจ๋ธ์ ๋ง๋ค๊ณ , ์ด๋ฅผ ์ฐ๊ฒฐํ์ฌ ์์ฑ ํฅ์ ๋ชจ๋ธ์ ํ๋ จํ๋ ๊ฒ์ด๋ค. ๋ํ, ์์ฑ ํฅ์ ๋ชจ๋ธ๊ณผ ์ฐ๊ฒฐ๋ ์งํ ์ธก์ ๋ชจ๋ธ์์ ์ถ๋ ฅ๋๋ ์งํ ๊ฐ์ด ์ต๋์น๊ฐ ๋๋๋ก ํ๋ จํ๋ ๊ณผ์ ์์ ์ธก์ ๋ชจ๋ธ์ ๋ด๋ด ๋คํธ์ํฌ ํํ๋ฅผ ๋ณํ์ํค๋ฉด์ ์ต๋์น์ ๋๋ฌํ๋ ์๋ ๋ฐ ์ ํ๋๋ฅผ ํฅ์ํ์๋ค. ๋ณธ ๋
ผ๋ฌธ์์ ์์ฑ ์งํ๋ฅผ ์ธก์ ํ๋ ๋ฐ ์ฌ์ฉ๋ ์งํ๋ STOI(short time objective intelligibility measure), PESQ(perceptual evaluation of speech quality) ๋ ๊ฐ์ง์ด๋ฉฐ, ์ด ๋ ๊ฐ์ง ์งํ๋ฅผ ๊ตฌํ๋ ๋ชจ๋ธ์ ์์ฑ ํฅ์ ๋ชจ๋ธ์ ์ฐ๊ฒฐํ๋ ๋ฐฉํฅ์ผ๋ก ์๊ณ ๋ฆฌ์ฆ์ ๊ตฌํํ ๋ค, ์งํ์ mean square error์ ์์ฑ feature์ mean square error ๋ ๊ฐ์ง ๊ฐ์ ์ต์ํํ๋ ๋ฉํฐ ํ์คํฌ ํ์์ผ๋ก ํ๋ จํ์๋ค. ๋ชจ๋ธ์ ๊ฒ์ฆํ ๊ฒฐ๊ณผ ๊ธฐ์กด์ ์์ฑ ํฅ์ ์ฌ์ธต์ ๊ฒฝ๋ง์ ๋นํด ๋ ๋์ ์งํ๊ฐ์ ๋ํ๋ด๋ ๊ฒ์ ์คํ์ผ๋ก ํ์ธํ์๋ค. ์คํ ๊ฒฐ๊ณผ์์๋ PESQ ๊ฐ๊ณผ STOI๋ฅผ ์งํ๋ก ์ฌ์ฉํ์๊ณ , ๊ธฐ์กด ๊ธฐ๋ฒ์์ ์ฌ์ฉํ๋ ๊ธฐ์ ํ๋ ฌ ๋ณด๋ค ๋ ๋์ ์ฑ๋ฅ์ ๋ณด์์ ํ์ธํ์๋ค.This paper discusses in deep neural network speech enhancement techniques using subject quality measurement model. In conventional studies, there is an inconsistency between the model optimization criterion and the evaluation criterion on the enhanced speech. To compensate for the problem, we have established a direction to try to improve the enhancement efficiency by connecting two models: speech enhancement model and a neural network model with target functions as speech intelligibility or speech quality. To make this model, This model is trained by measuring subject qualities for three cases of clean speech, mixed speech and enhanced speech. In addition, in the course of training to maximize the quality value output from the subject quality measurement model associated with the speech enhancement model, by changing the shape of the measurement model's neutral network, the speed and accuracy at which the maximum is reached were improved. In this paper, there are two metrics used to measure subject qualities: short-time objective intelligibility measure (STOI), and perceptual evaluation of speech quality (PESQ), which have been trained and verified to show higher levels of speech enhancement algorithms in a multi-task format. The results of the experiment used PESQ values and STOI as indicators, and found that they performed better than the underlying model used by conventional techniques.Abstract (In Korean) 4
Contents
List of Tables ii
List of Figures iii
1 Introduction 1
2 Conventional Approaches for Speech Enhancement 4
2.1 Deep Neural Network-based Speech Enhancement . . . . . . . . . . 4
2.1.1 Deep Neural Network . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Deep Neural Network-based Speech Enhancement Network . 7
3 Subject Quality Measurement 10
3.1 STOI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 PESQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 DNN-based Speech Enhancement using Subject Quality Measurement
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.1 Deep Neural Network-based Model . . . . . . . . . . . . . . 14
3.3.2 Convolutional Neural Network-based Model . . . . . . . . . 16
4 Proposed Enhancement Model 20
5 Experiment Design 22
5.1 Noisy Speech Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 SQM Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Neural Network Design . . . . . . . . . . . . . . . . . . . . . . . . . 24
6 Experimental Results 27
6.1 Subject Quality Measurement Models Performance . . . . . . . . . . 27
6.2 Speech Enhancement Models Performance using SQM Model as a
Postfilter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7 Conclusion and Future Work 34
Abstract iMaste
A Perceptually-Weighted Deep Neural Network for Monaural Speech Enhancement in Various Background Noise Conditions
Deep neural networks (DNN) have recently been shown to give state-of-the-art performance in monaural speech enhancement. However in the DNN training process, the perceptual difference between different components of the DNN output is not fully exploited, where equal importance is often assumed. To address this limitation, we have proposed a new perceptually-weighted objective function within a feedforward DNN framework, aiming to minimize the perceptual difference between the enhanced speech and the target speech. A perceptual weight is integrated into the proposed objective function, and has been tested on two types of output features: spectra and ideal ratio masks. Objective evaluations for both speech quality and speech intelligibility have been performed. Integration of our perceptual weight shows consistent improvement on several noise levels and a variety of different noise types