2 research outputs found

    Optimizing Vocabulary Modeling for Dysarthric Voice User Interface

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ํ˜‘๋™๊ณผ์ • ์ธ์ง€๊ณผํ•™์ „๊ณต, 2016. 2. ์ •๋ฏผํ™”.๋งˆ๋น„๋ง์žฅ์•  ํ™”์ž์˜ ์Œ์„ฑ ์ธํ„ฐํŽ˜์ด์Šค ํ™œ์šฉ์—์„œ ๋นˆ๋ฒˆํ•œ ์กฐ์Œ์˜ค๋ฅ˜, ๋ˆŒ๋ณ€, ๋นˆ๋ฒˆํ•˜๊ณ  ๋ถˆ๊ทœ์น™์ ์œผ๋กœ ๋ฐœ์ƒํ•˜๋Š” ๋ฐœํ™”์ค‘๋‹จ, ๋Š๋ฆฐ ๋ฐœํ™”์†๋„ ๋“ฑ์€ ์˜ค์ธ์‹์„ ์œ ๋ฐœํ•˜๋Š” ์š”์ธ์ด ๋œ๋‹ค. ์„ ํ–‰์—ฐ๊ตฌ์—์„œ๋Š” ์žฅ์• ๋ฐœํ™”์˜ ์Œํ–ฅ ๋ฐ ์Œ์šด์  ํŠน์„ฑ์„ ๋ถ„์„ํ•˜๊ณ  ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์žฅ์• ๋ฐœํ™” ์ˆ˜์ •, ์Œํ–ฅ๋ชจ๋ธ ์ ์‘, ๋ฐœ์Œ๋ณ€์ด ๋ชจ๋ธ๋ง, ๋ฌธ๋ฒ• ๋ฐ ์–ดํœ˜ ๋ชจ๋ธ๋ง ๋“ฑ์„ ํ†ตํ•ด ์ธ์‹์˜ค๋ฅ˜์˜ ๋ฌธ์ œ๋ฅผ ๋ณด์™„ํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์žฅ์• ๋ฐœํ™”์˜ ํŠน์ง•์„ ๋ฐ˜์˜ํ•˜์—ฌ ์Œํ–ฅ๋ชจ๋ธ์„ ์ตœ์ ํ™”ํ–ˆ๋‹ค. ๋˜ํ•œ ์–ดํœ˜๋ชจ๋ธ์˜ ๊ตฌ์„ฑ์—์„œ ์Œ์†Œ๋ฒ”์ฃผ ๊ธฐ๋ฐ˜์˜ ์กฐ์ŒํŠน์ง•๊ณผ ์ธ์‹์˜ค๋ฅ˜์™€์˜ ๊ด€๊ณ„๋ฅผ ๋ชจํ˜•ํ™”ํ•˜์—ฌ ๋‹จ์–ด์˜ ์„ ํƒ๊ธฐ์ค€์œผ๋กœ ๋„์ž…ํ–ˆ๊ณ  ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ค„์ž„์œผ๋กœ์จ ์žฅ์• ํ™”์ž์˜ ์Œ์„ฑ ์ธํ„ฐํŽ˜์ด์Šค ์‹œ ๋ฐœ์ƒํ•˜๋Š” ์ธ์‹์˜ค๋ฅ˜๋ฅผ ์ค„์˜€๋‹ค. ๋งˆ๋น„๋ง์žฅ์•  ํ™”์ž๋ฅผ ์œ„ํ•œ ์Œํ–ฅ๋ชจ๋ธ์˜ ๊ตฌ์ถ•์„ ์œ„ํ•ด ์ฒซ์งธ๋กœ ์žฅ์• ํ™”์ž์˜ ๋Š๋ฆฐ ๋ฐœํ™”์†๋„์— ๋”ฐ๋ผ ํŠน์ง•์ถ”์ถœ์˜ ์œˆ๋„์šฐ ํฌ๊ธฐ์™€ HMM๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” state ๊ฐœ์ˆ˜๋ฅผ ์กฐ์ •ํ•˜์—ฌ ์˜ค๋ฅ˜๋ฅผ ๋‚ฎ์ท„๋‹ค. ๋‘˜์งธ๋กœ HMM์˜ ์ถœ๋ ฅํ™•๋ฅ  ๋ชจ๋ธ๋กœ์„œ GMM, Subspace GMM, DNN ๋“ฑ์„ ๋„์ž…ํ•˜์—ฌ ์ธ์‹์˜ค๋ฅ˜๋ฅผ ๋น„๊ตํ–ˆ๋‹ค. ์…‹์งธ๋กœ ํ•™์Šต๋ฐ์ดํ„ฐ ๋ถ€์กฑ๋ฌธ์ œ์— ๋Œ€ํ•œ ๋Œ€์‘๋ฐฉ๋ฒ•์œผ๋กœ ์ •์ƒ๋ฐœํ™” ๋„์ž…์˜ ํšจ์œจ์„ฑ์„ ์ธ์‹์‹คํ—˜์œผ๋กœ ํ™•์ธํ–ˆ๋‹ค. ์กฐ์ŒํŠน์ง•๊ณผ ์ธ์‹์˜ค๋ฅ˜์œจ์˜ ํ˜ผํ•ฉ์„ ํ˜•๋ชจ๋ธ ๋ถ„์„์—์„œ ์ž์Œ๋ฒ”์ฃผ ์ค‘ ์˜ค๋ฅ˜์œจ๊ณผ ์œ ์˜์ˆ˜์ค€ 0.05 ์ดํ•˜์—์„œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ธ ๋ฒ”์ฃผ๋Š” ๋งˆ์ฐฐ์Œ๊ณผ ๋น„์Œ์ด๊ณ  ๋ชจ๋“  ๋ชจ์Œ๋ฒ”์ฃผ๋Š” ์˜ค๋ฅ˜์œจ๊ณผ ์œ ์˜์ˆ˜์ค€ 0.05 ์ดํ•˜์—์„œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ํ˜ผํ•ฉ๋ชจ๋ธ์€ ์ž์Œ์„ ์กฐ์Œ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฒ”์ฃผํ™”ํ•  ๋•Œ๊ฐ€ ์กฐ์Œ์œ„์น˜๋กœ ๋ฒ”์ฃผํ™”ํ•  ๋•Œ์— ๋น„ํ•ด ๋‚ฎ์€ AIC๋ฅผ ๋ณด์˜€๊ณ  ๋ชจ์Œ์„ ํ˜€์˜ ์œ„์น˜๋กœ ๋ฒ”์ฃผํ™”ํ•  ๋•Œ๊ฐ€ ๋‚ฎ์€ AIC๋ฅผ ๋ณด์˜€๋‹ค. ์Œ์†Œ๋ฅผ ์กฐ์Œ๋ฐฉ๋ฒ•๊ณผ ํ˜€์˜ ์œ„์น˜๋กœ ๋ฒ”์ฃผํ™”ํ–ˆ์„ ๋•Œ ๋งˆ์ฐฐ์Œ์€ ๋‹จ์–ด์— ํฌํ•จ๋ ์ˆ˜๋ก ์ธ์‹์˜ค๋ฅ˜๋ฅผ ๋†’์ด๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๊ณ  ๋น„์Œ์€ ์ธ์‹์ •ํ™•๋„๋ฅผ ๋†’์˜€๋‹ค. ํ์‡„์Œ, ํŒŒ์ฐฐ์Œ, ์œ ์Œ ๋“ฑ์€ ์ธ์‹๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ์˜ํ–ฅ์ด 0์— ๊ฐ€๊นŒ์› ๋‹ค. ๋ชจ๋“  ๋ชจ์Œ ๋ฒ”์ฃผ๋Š” ์ธ์‹์ •ํ™•๋„๋ฅผ ๋†’์˜€๋‹ค. ๊ทธ ์ค‘ ์ค‘์„ค๋ชจ์Œ์˜ ์˜ํ–ฅ๋ ฅ์ด ๊ฐ€์žฅ ์ปธ๊ณ  ํ›„์„ค๋ชจ์Œ, ์ „์„ค๋ชจ์Œ, ์ด์ค‘๋ชจ์Œ ์ˆœ์œผ๋กœ ์ž‘์•„์กŒ๋‹ค. ๋‹จ์–ด ๊ฐ„ ์ธ์‹์˜ค๋ฅ˜์˜ ์œ ๋ฐœ ๊ฐ€๋Šฅ์„ฑ์„ Levenshtein ๊ฑฐ๋ฆฌ์™€ Cosine ๊ฑฐ๋ฆฌ ๋“ฑ ์—ด ๊ฑฐ๋ฆฌ ๊ธฐ์ค€์˜ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋กœ ๋ชจ๋ธ๋งํ–ˆ๊ณ  ์œ ์‚ฌ๋„์˜ ์ตœ์†Œ-์ตœ๋Œ€ ๋น„๊ต์™€ N-best ์ถ”์ •์œผ๋กœ ์ •์˜๋œ ๋‹จ์–ด๊ฐ„ ์œ ์‚ฌ๋„๊ฐ€ ์Œ์„ฑ์ธ์‹์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์š”์ธ์ด ๋  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ–ˆ๋‹ค. ์žฅ์• ๋ฐœํ™” ์ธ์‹์„ ์œ„ํ•œ ์–ดํœ˜๋ชจ๋ธ์˜ ๊ตฌ์„ฑ์—์„œ ๋จผ์ € ์กฐ์ŒํŠน์ง• ๊ธฐ๋ฐ˜์˜ ํ˜ผํ•ฉ๋ชจ๋ธ๋กœ ๋‹จ์–ด์˜ ์กฐ์Œ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์ ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋‹จ์–ด๋กœ ์ธ์‹๋‹จ์–ด ๋ฆฌ์ŠคํŠธ๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. ๋˜ํ•œ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ตœ์†Œํ™”ํ•˜๋„๋ก ์–ดํœ˜๋ชจ๋ธ์„ ์ˆ˜์ •ํ•˜์—ฌ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ ๊ธฐ์กด ํ†ตํ™”ํ‘œ ๋ฐฉ์‹์— ๋น„ํ•ด ์ ˆ๋Œ€์ ์œผ๋กœ 5.7%, ์ƒ๋Œ€์ ์œผ๋กœ 34.6%์˜ ์ธ์‹์˜ค๋ฅ˜๊ฐ€ ๊ฐ์†Œํ•˜์˜€๋‹ค.Articulation errors, disfluency, impulsive pause, low speaking rate have been suggested as factors of recognition error for dysarthric speakers using voice user interface. In related works, methods for correcting dysarthric speech, AM adaptation, pronunciation variation modeling, grammar modeling and vocabulary modeling based on acoustic and phonetic analyses on dysarthric speech were proposed to compensate those factors. In this paper, acoustic model was optimized. And words in the vocabulary were selected by the GLMM which had modeled relationship between recognition errors and articulatory features for phonetic class and optimized by lowering similarity between words. Problems in training AM for dysarthric speech recognition were addressed: firstly low speaking rate was compensated by varying the window length of FFT and the number of states of HMM. Secondly the efficiency of models for emission probability of HMM was compared. Thirdly AM trained using large amount of non-dysarthric speech was experimented. Fricative and nasal were statistically significant in the analysis of relation between recognition error and consonant classes. And all vowel classes were significant. AIC was lower by classifying consonants based on manner of articulation than based on place and by classifying vowels based on position of tongue than based on height. Fricatives increased WER and nasals increased accuracy of recognition. Estimates of plosive, affricate, liquid were close to zero. All vowel classes increased accuracy. Estimate of central vowel was the largest followed by back vowel, front vowel and diphthong. Triggering recognition error by competitive words was modeled by similarity between words based on Levenshtein and cosine distance respectively. Effect of similarity between words on recognition result was confirmed by the minimum-maximum similarity contrast and the N-best prediction. Prior to model vocabulary, articulation score for each word was calculated. Firstly the vocabulary was composed of the words with the maximum articulation scores. Secondly the words in the vocabulary with high similarity were replaced by the word with less similarity and large articulation score. In dysarthric speech recognitions, the optimized vocabulary lowered WER by 5.72% (34.60% ERR).์ œ 1 ์žฅ ์„œ๋ก  1 ์ œ 2 ์žฅ ๊ด€๋ จ์—ฐ๊ตฌ 5 ์ œ 1 ์ ˆ ๋งˆ๋น„๋ง์žฅ์•  ๋ฐœํ™” ์กฐ์Œ์˜ค๋ฅ˜ ๋ถ„์„ 5 ์ œ 2 ์ ˆ ์Œ์„ฑ์ธ์‹ ์‹œ์Šคํ…œ์˜ ๊ตฌ์กฐ์™€ ๋ชจ๋ธ 9 2.1 ํŠน์ง•์ถ”์ถœ 12 2.2 ์Œํ–ฅ๋ชจ๋ธ 12 2.3 ๋ฐœ์Œ๋ชจ๋ธ 16 2.4 ์–ธ์–ด๋ชจ๋ธ 18 ์ œ 3 ์ ˆ ์Œ์„ฑ์ธ์‹์‹œ์Šคํ…œ์˜ ํ™œ์šฉ๋ถ„์•ผ 18 ์ œ 4 ์ ˆ ๋งˆ๋น„๋ง์žฅ์•  ๋ฐœํ™” ์ธ์‹ 22 ์ œ 3 ์žฅ ํŠน์ง•์ถ”์ถœ ๋ฐ ์Œํ–ฅ๋ชจ๋ธ ๋ฒ ์ด์Šค๋ผ์ธ ๊ตฌ์ถ• 27 ์ œ 1 ์ ˆ ์ ‘๊ทผ๋ฐฉ๋ฒ• 28 ์ œ 2 ์ ˆ ๊ฐœ๋ฐœ ํ™˜๊ฒฝ 29 2.1 ์Œ์„ฑ ๋ฐ์ดํ„ฐ 29 2.2 ์Œ์„ฑ์ธ์‹ ํ™˜๊ฒฝ ๋ฐ ์Œํ–ฅ๋ชจ๋ธ ํ•™์Šต์ ˆ์ฐจ 30 ์ œ 3 ์ ˆ ์ธ์‹์‹คํ—˜ 32 3.1 ๋ฐœํ™”์†๋„ ๋ชจ๋ธ๋ง 32 3.2 ์ถœ๋ ฅํ™•๋ฅ  ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™” 38 3.3 ์Œํ–ฅ๋ชจ๋ธ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์„ฑ 44 3.4 ์‹คํ—˜๊ฒฐ๊ณผ ๋ถ„์„ ๋ฐ ์š”์•ฝ 46 ์ œ 4 ์ ˆ ๊ฒฐ๋ก  48 ์ œ 4 ์žฅ ์กฐ์Œ์˜ค๋ฅ˜ ํŠน์„ฑ ๊ธฐ๋ฐ˜ ์ธ์‹๋‹จ์–ด ์„ ํƒ๊ธฐ์ค€ 49 ์ œ 1 ์ ˆ ํ•œ๊ตญ์–ด ์Œ์†Œ ์ •์˜์™€ ์กฐ์Œ ํŠน์ง•์— ์˜ํ•œ ๋ฒ”์ฃผํ™” 50 ์ œ 2 ์ ˆ ์—ฐ๊ตฌ ๋ชฉํ‘œ 53 ์ œ 3 ์ ˆ ๋ถ„์„ ๋ฐ์ดํ„ฐ 53 3.1 ์Œ์„ฑ ๋ฐ์ดํ„ฐ 53 3.2 ์Œ์†Œ์˜ ๋ฒ”์ฃผ ๋ฐ ํ†ต๊ณ„ 54 3.3 ์Œ์„ฑ์ธ์‹ 56 3.4 ๋ฐ์ดํ„ฐ ๋ถ„์„ 57 ์ œ 4 ์ ˆ ๋ถ„์„ ๊ฒฐ๊ณผ 59 ์ œ 5 ์ ˆ ๊ฒฐ๋ก  62 ์ œ 5 ์žฅ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ ์ตœ์†Œํ™” ๊ธฐ์ค€ ์ธ์‹๋‹จ์–ด ์ตœ์ ํ™” 63 ์ œ 1 ์ ˆ ์—ด ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜์˜ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ 65 ์ œ 2 ์ ˆ ์ธ์‹๋ฅ ์— ๋Œ€ํ•œ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„์˜ ์˜ํ–ฅ 67 ์ œ 3 ์ ˆ N-best ์ถ”์ • 68 ์ œ 6 ์žฅ ์ธ์‹์‹คํ—˜ 74 ์ œ 1 ์ ˆ ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ ๊ตฌ์„ฑ 74 1.1 ๋ฒ ์ด์Šค๋ผ์ธ ์ธ์‹ ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ 76 1.2 ์กฐ์Œ์ ์ˆ˜ ์ตœ๋Œ€ํ™” ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ 77 1.3 ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ ์ตœ์†Œํ™” ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ 77 ์ œ 2 ์ ˆ ๊ธฐ์ดˆ ์‹คํ—˜ 78 ์ œ 3 ์ ˆ ์‹คํ—˜ ํ™˜๊ฒฝ 81 3.1 ์Œ์„ฑ ์ฝ”ํผ์Šค ๊ตฌ์„ฑ 81 ์ œ 4 ์ ˆ ์ธ์‹ ๊ฒฐ๊ณผ 84 4.1 ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ 84 4.2 ์กฐ์Œ์ ์ˆ˜ ์ตœ๋Œ€ํ™” ๋ชจ๋ธ 84 4.3 ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ ์ตœ์†Œํ™” ๋ชจ๋ธ 86 ์ œ 7 ์žฅ ๊ฒฐ๋ก  88 ์ œ 1 ์ ˆ ์—ฐ๊ตฌ๊ฒฐ๊ณผ ์š”์•ฝ ๋ฐ ํ‰๊ฐ€ 88 ์ œ 2 ์ ˆ ๊ธฐ์—ฌ๋„ ์š”์•ฝ 90 ์ œ 3 ์ ˆ ํ–ฅํ›„ ์—ฐ๊ตฌ 91 ์ฐธ๊ณ  ๋ฌธํ—Œ 94 ๋ถ€๋ก 101 Abstract 114Docto
    corecore