7 research outputs found

    A perceptually-weighted deep neural network for monaural speech enhancement in various background noise conditions

    Get PDF
    Deep neural networks (DNN) have recently been shown to give state-of-the-art performance in monaural speech enhancement. However in the DNN training process, the perceptual difference between different components of the DNN output is not fully exploited, where equal importance is often assumed. To address this limitation, we have proposed a new perceptually-weighted objective function within a feedforward DNN framework, aiming to minimize the perceptual difference between the enhanced speech and the target speech. A perceptual weight is integrated into the proposed objective function, and has been tested on two types of output features: spectra and ideal ratio masks. Objective evaluations for both speech quality and speech intelligibility have been performed. Integration of our perceptual weight shows consistent improvement on several noise levels and a variety of different noise types

    ์Œ์„ฑ ์ง€ํ‘œ ์ธก์ • ๋ชจ๋ธ์„ ์ด์šฉํ•œ ์Œ์„ฑ ํ–ฅ์ƒ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2019. 2. ๊น€๋‚จ์ˆ˜.๋ณธ ๋…ผ๋ฌธ์€ ์Œ์„ฑ ์ง€ํ‘œ ์ธก์ • ๋ชจ๋ธ์„ ์ด์šฉํ•œ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ์Œ์„ฑ ํ–ฅ์ƒ ๊ธฐ๋ฒ•์„ ๋‹ค๋ฃจ๊ณ  ์žˆ๋‹ค. ๊ธฐ์กด์˜ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ์Œ์„ฑ ํ–ฅ์ƒ ๊ธฐ๋ฒ•์€ ๋ชฉํ‘œ ํ•จ์ˆ˜๊ฐ€ ๋ช…๋ฃŒ๋„ ๋ฐ ์Œ์งˆ์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ง€ํ‘œ์™€ ๊ด€๋ จ์„ฑ์ด ์ ๊ธฐ ๋•Œ๋ฌธ์— ํ•œ๊ณ„์„ฑ์„ ๋ ๊ณ  ์žˆ์—ˆ๋‹ค. ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ์Œ์„ฑ ํ–ฅ์ƒ ๋ชจ๋ธ๊ณผ ๋ชฉํ‘œ ํ•จ์ˆ˜๊ฐ€ ์Œ์„ฑ ๋ช…๋ฃŒ๋„ ๋˜๋Š” ์Œ์„ฑ ํ’ˆ์งˆ์„ ์ง€ํ‘œ๋กœ ์„ค์ •๋œ ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ, ๋‘ ๊ฐ€์ง€ ๋ชจ๋ธ์„ ์—ฐ๊ฒฐํ•˜์—ฌ ์Œ์„ฑ ํ–ฅ์ƒ์„ ์‹œ๋„ํ•˜๋Š” ๋ฐฉํ–ฅ์„ ์„ธ์› ๋‹ค. ์ˆœ์ˆ˜ํ•œ ์Œ์„ฑ, ์žก์Œ์ด ์„ž์ธ ์Œ์„ฑ, ํ–ฅ์ƒ๋œ ์Œ์„ฑ ์„ธ ๊ฐ€์ง€ ๊ฒฝ์šฐ์— ๋Œ€ํ•ด ์Œ์„ฑ ์ง€ํ‘œ๋ฅผ ์ธก์ •ํ•œ ๋’ค, ํ›ˆ๋ จ์„ ํ†ตํ•ด ๊ฐ๊ฐ์˜ ์ˆ˜์น˜๋“ค์„ ์ธก์ •ํ•˜๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ , ์ด๋ฅผ ์—ฐ๊ฒฐํ•˜์—ฌ ์Œ์„ฑ ํ–ฅ์ƒ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋˜ํ•œ, ์Œ์„ฑ ํ–ฅ์ƒ ๋ชจ๋ธ๊ณผ ์—ฐ๊ฒฐ๋œ ์ง€ํ‘œ ์ธก์ • ๋ชจ๋ธ์—์„œ ์ถœ๋ ฅ๋˜๋Š” ์ง€ํ‘œ ๊ฐ’์ด ์ตœ๋Œ€์น˜๊ฐ€ ๋˜๋„๋ก ํ›ˆ๋ จํ•˜๋Š” ๊ณผ์ •์—์„œ ์ธก์ • ๋ชจ๋ธ์˜ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ ํ˜•ํƒœ๋ฅผ ๋ณ€ํ™”์‹œํ‚ค๋ฉด์„œ ์ตœ๋Œ€์น˜์— ๋„๋‹ฌํ•˜๋Š” ์†๋„ ๋ฐ ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์Œ์„ฑ ์ง€ํ‘œ๋ฅผ ์ธก์ •ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ ์ง€ํ‘œ๋Š” STOI(short time objective intelligibility measure), PESQ(perceptual evaluation of speech quality) ๋‘ ๊ฐ€์ง€์ด๋ฉฐ, ์ด ๋‘ ๊ฐ€์ง€ ์ง€ํ‘œ๋ฅผ ๊ตฌํ•˜๋Š” ๋ชจ๋ธ์„ ์Œ์„ฑ ํ–ฅ์ƒ ๋ชจ๋ธ์— ์—ฐ๊ฒฐํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•œ ๋’ค, ์ง€ํ‘œ์˜ mean square error์™€ ์Œ์„ฑ feature์˜ mean square error ๋‘ ๊ฐ€์ง€ ๊ฐ’์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฉ€ํ‹ฐ ํƒœ์Šคํฌ ํ˜•์‹์œผ๋กœ ํ›ˆ๋ จํ•˜์˜€๋‹ค. ๋ชจ๋ธ์„ ๊ฒ€์ฆํ•œ ๊ฒฐ๊ณผ ๊ธฐ์กด์˜ ์Œ์„ฑ ํ–ฅ์ƒ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์— ๋น„ํ•ด ๋” ๋†’์€ ์ง€ํ‘œ๊ฐ’์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์„ ์‹คํ—˜์œผ๋กœ ํ™•์ธํ•˜์˜€๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ์—์„œ๋Š” PESQ ๊ฐ’๊ณผ STOI๋ฅผ ์ง€ํ‘œ๋กœ ์‚ฌ์šฉํ•˜์˜€๊ณ , ๊ธฐ์กด ๊ธฐ๋ฒ•์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ธฐ์ € ํ–‰๋ ฌ ๋ณด๋‹ค ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ํ™•์ธํ•˜์˜€๋‹ค.This paper discusses in deep neural network speech enhancement techniques using subject quality measurement model. In conventional studies, there is an inconsistency between the model optimization criterion and the evaluation criterion on the enhanced speech. To compensate for the problem, we have established a direction to try to improve the enhancement efficiency by connecting two models: speech enhancement model and a neural network model with target functions as speech intelligibility or speech quality. To make this model, This model is trained by measuring subject qualities for three cases of clean speech, mixed speech and enhanced speech. In addition, in the course of training to maximize the quality value output from the subject quality measurement model associated with the speech enhancement model, by changing the shape of the measurement model's neutral network, the speed and accuracy at which the maximum is reached were improved. In this paper, there are two metrics used to measure subject qualities: short-time objective intelligibility measure (STOI), and perceptual evaluation of speech quality (PESQ), which have been trained and verified to show higher levels of speech enhancement algorithms in a multi-task format. The results of the experiment used PESQ values and STOI as indicators, and found that they performed better than the underlying model used by conventional techniques.Abstract (In Korean) 4 Contents List of Tables ii List of Figures iii 1 Introduction 1 2 Conventional Approaches for Speech Enhancement 4 2.1 Deep Neural Network-based Speech Enhancement . . . . . . . . . . 4 2.1.1 Deep Neural Network . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 Deep Neural Network-based Speech Enhancement Network . 7 3 Subject Quality Measurement 10 3.1 STOI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 PESQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 DNN-based Speech Enhancement using Subject Quality Measurement Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3.1 Deep Neural Network-based Model . . . . . . . . . . . . . . 14 3.3.2 Convolutional Neural Network-based Model . . . . . . . . . 16 4 Proposed Enhancement Model 20 5 Experiment Design 22 5.1 Noisy Speech Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.2 SQM Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 23 5.3 Neural Network Design . . . . . . . . . . . . . . . . . . . . . . . . . 24 6 Experimental Results 27 6.1 Subject Quality Measurement Models Performance . . . . . . . . . . 27 6.2 Speech Enhancement Models Performance using SQM Model as a Postfilter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 7 Conclusion and Future Work 34 Abstract iMaste

    A Perceptually-Weighted Deep Neural Network for Monaural Speech Enhancement in Various Background Noise Conditions

    No full text
    Deep neural networks (DNN) have recently been shown to give state-of-the-art performance in monaural speech enhancement. However in the DNN training process, the perceptual difference between different components of the DNN output is not fully exploited, where equal importance is often assumed. To address this limitation, we have proposed a new perceptually-weighted objective function within a feedforward DNN framework, aiming to minimize the perceptual difference between the enhanced speech and the target speech. A perceptual weight is integrated into the proposed objective function, and has been tested on two types of output features: spectra and ideal ratio masks. Objective evaluations for both speech quality and speech intelligibility have been performed. Integration of our perceptual weight shows consistent improvement on several noise levels and a variety of different noise types
    corecore