4 research outputs found

    Optimizing Vocabulary Modeling for Dysarthric Voice User Interface

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ํ˜‘๋™๊ณผ์ • ์ธ์ง€๊ณผํ•™์ „๊ณต, 2016. 2. ์ •๋ฏผํ™”.๋งˆ๋น„๋ง์žฅ์•  ํ™”์ž์˜ ์Œ์„ฑ ์ธํ„ฐํŽ˜์ด์Šค ํ™œ์šฉ์—์„œ ๋นˆ๋ฒˆํ•œ ์กฐ์Œ์˜ค๋ฅ˜, ๋ˆŒ๋ณ€, ๋นˆ๋ฒˆํ•˜๊ณ  ๋ถˆ๊ทœ์น™์ ์œผ๋กœ ๋ฐœ์ƒํ•˜๋Š” ๋ฐœํ™”์ค‘๋‹จ, ๋Š๋ฆฐ ๋ฐœํ™”์†๋„ ๋“ฑ์€ ์˜ค์ธ์‹์„ ์œ ๋ฐœํ•˜๋Š” ์š”์ธ์ด ๋œ๋‹ค. ์„ ํ–‰์—ฐ๊ตฌ์—์„œ๋Š” ์žฅ์• ๋ฐœํ™”์˜ ์Œํ–ฅ ๋ฐ ์Œ์šด์  ํŠน์„ฑ์„ ๋ถ„์„ํ•˜๊ณ  ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์žฅ์• ๋ฐœํ™” ์ˆ˜์ •, ์Œํ–ฅ๋ชจ๋ธ ์ ์‘, ๋ฐœ์Œ๋ณ€์ด ๋ชจ๋ธ๋ง, ๋ฌธ๋ฒ• ๋ฐ ์–ดํœ˜ ๋ชจ๋ธ๋ง ๋“ฑ์„ ํ†ตํ•ด ์ธ์‹์˜ค๋ฅ˜์˜ ๋ฌธ์ œ๋ฅผ ๋ณด์™„ํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์žฅ์• ๋ฐœํ™”์˜ ํŠน์ง•์„ ๋ฐ˜์˜ํ•˜์—ฌ ์Œํ–ฅ๋ชจ๋ธ์„ ์ตœ์ ํ™”ํ–ˆ๋‹ค. ๋˜ํ•œ ์–ดํœ˜๋ชจ๋ธ์˜ ๊ตฌ์„ฑ์—์„œ ์Œ์†Œ๋ฒ”์ฃผ ๊ธฐ๋ฐ˜์˜ ์กฐ์ŒํŠน์ง•๊ณผ ์ธ์‹์˜ค๋ฅ˜์™€์˜ ๊ด€๊ณ„๋ฅผ ๋ชจํ˜•ํ™”ํ•˜์—ฌ ๋‹จ์–ด์˜ ์„ ํƒ๊ธฐ์ค€์œผ๋กœ ๋„์ž…ํ–ˆ๊ณ  ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ค„์ž„์œผ๋กœ์จ ์žฅ์• ํ™”์ž์˜ ์Œ์„ฑ ์ธํ„ฐํŽ˜์ด์Šค ์‹œ ๋ฐœ์ƒํ•˜๋Š” ์ธ์‹์˜ค๋ฅ˜๋ฅผ ์ค„์˜€๋‹ค. ๋งˆ๋น„๋ง์žฅ์•  ํ™”์ž๋ฅผ ์œ„ํ•œ ์Œํ–ฅ๋ชจ๋ธ์˜ ๊ตฌ์ถ•์„ ์œ„ํ•ด ์ฒซ์งธ๋กœ ์žฅ์• ํ™”์ž์˜ ๋Š๋ฆฐ ๋ฐœํ™”์†๋„์— ๋”ฐ๋ผ ํŠน์ง•์ถ”์ถœ์˜ ์œˆ๋„์šฐ ํฌ๊ธฐ์™€ HMM๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” state ๊ฐœ์ˆ˜๋ฅผ ์กฐ์ •ํ•˜์—ฌ ์˜ค๋ฅ˜๋ฅผ ๋‚ฎ์ท„๋‹ค. ๋‘˜์งธ๋กœ HMM์˜ ์ถœ๋ ฅํ™•๋ฅ  ๋ชจ๋ธ๋กœ์„œ GMM, Subspace GMM, DNN ๋“ฑ์„ ๋„์ž…ํ•˜์—ฌ ์ธ์‹์˜ค๋ฅ˜๋ฅผ ๋น„๊ตํ–ˆ๋‹ค. ์…‹์งธ๋กœ ํ•™์Šต๋ฐ์ดํ„ฐ ๋ถ€์กฑ๋ฌธ์ œ์— ๋Œ€ํ•œ ๋Œ€์‘๋ฐฉ๋ฒ•์œผ๋กœ ์ •์ƒ๋ฐœํ™” ๋„์ž…์˜ ํšจ์œจ์„ฑ์„ ์ธ์‹์‹คํ—˜์œผ๋กœ ํ™•์ธํ–ˆ๋‹ค. ์กฐ์ŒํŠน์ง•๊ณผ ์ธ์‹์˜ค๋ฅ˜์œจ์˜ ํ˜ผํ•ฉ์„ ํ˜•๋ชจ๋ธ ๋ถ„์„์—์„œ ์ž์Œ๋ฒ”์ฃผ ์ค‘ ์˜ค๋ฅ˜์œจ๊ณผ ์œ ์˜์ˆ˜์ค€ 0.05 ์ดํ•˜์—์„œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ธ ๋ฒ”์ฃผ๋Š” ๋งˆ์ฐฐ์Œ๊ณผ ๋น„์Œ์ด๊ณ  ๋ชจ๋“  ๋ชจ์Œ๋ฒ”์ฃผ๋Š” ์˜ค๋ฅ˜์œจ๊ณผ ์œ ์˜์ˆ˜์ค€ 0.05 ์ดํ•˜์—์„œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ํ˜ผํ•ฉ๋ชจ๋ธ์€ ์ž์Œ์„ ์กฐ์Œ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฒ”์ฃผํ™”ํ•  ๋•Œ๊ฐ€ ์กฐ์Œ์œ„์น˜๋กœ ๋ฒ”์ฃผํ™”ํ•  ๋•Œ์— ๋น„ํ•ด ๋‚ฎ์€ AIC๋ฅผ ๋ณด์˜€๊ณ  ๋ชจ์Œ์„ ํ˜€์˜ ์œ„์น˜๋กœ ๋ฒ”์ฃผํ™”ํ•  ๋•Œ๊ฐ€ ๋‚ฎ์€ AIC๋ฅผ ๋ณด์˜€๋‹ค. ์Œ์†Œ๋ฅผ ์กฐ์Œ๋ฐฉ๋ฒ•๊ณผ ํ˜€์˜ ์œ„์น˜๋กœ ๋ฒ”์ฃผํ™”ํ–ˆ์„ ๋•Œ ๋งˆ์ฐฐ์Œ์€ ๋‹จ์–ด์— ํฌํ•จ๋ ์ˆ˜๋ก ์ธ์‹์˜ค๋ฅ˜๋ฅผ ๋†’์ด๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๊ณ  ๋น„์Œ์€ ์ธ์‹์ •ํ™•๋„๋ฅผ ๋†’์˜€๋‹ค. ํ์‡„์Œ, ํŒŒ์ฐฐ์Œ, ์œ ์Œ ๋“ฑ์€ ์ธ์‹๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ์˜ํ–ฅ์ด 0์— ๊ฐ€๊นŒ์› ๋‹ค. ๋ชจ๋“  ๋ชจ์Œ ๋ฒ”์ฃผ๋Š” ์ธ์‹์ •ํ™•๋„๋ฅผ ๋†’์˜€๋‹ค. ๊ทธ ์ค‘ ์ค‘์„ค๋ชจ์Œ์˜ ์˜ํ–ฅ๋ ฅ์ด ๊ฐ€์žฅ ์ปธ๊ณ  ํ›„์„ค๋ชจ์Œ, ์ „์„ค๋ชจ์Œ, ์ด์ค‘๋ชจ์Œ ์ˆœ์œผ๋กœ ์ž‘์•„์กŒ๋‹ค. ๋‹จ์–ด ๊ฐ„ ์ธ์‹์˜ค๋ฅ˜์˜ ์œ ๋ฐœ ๊ฐ€๋Šฅ์„ฑ์„ Levenshtein ๊ฑฐ๋ฆฌ์™€ Cosine ๊ฑฐ๋ฆฌ ๋“ฑ ์—ด ๊ฑฐ๋ฆฌ ๊ธฐ์ค€์˜ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋กœ ๋ชจ๋ธ๋งํ–ˆ๊ณ  ์œ ์‚ฌ๋„์˜ ์ตœ์†Œ-์ตœ๋Œ€ ๋น„๊ต์™€ N-best ์ถ”์ •์œผ๋กœ ์ •์˜๋œ ๋‹จ์–ด๊ฐ„ ์œ ์‚ฌ๋„๊ฐ€ ์Œ์„ฑ์ธ์‹์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์š”์ธ์ด ๋  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ–ˆ๋‹ค. ์žฅ์• ๋ฐœํ™” ์ธ์‹์„ ์œ„ํ•œ ์–ดํœ˜๋ชจ๋ธ์˜ ๊ตฌ์„ฑ์—์„œ ๋จผ์ € ์กฐ์ŒํŠน์ง• ๊ธฐ๋ฐ˜์˜ ํ˜ผํ•ฉ๋ชจ๋ธ๋กœ ๋‹จ์–ด์˜ ์กฐ์Œ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์ ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋‹จ์–ด๋กœ ์ธ์‹๋‹จ์–ด ๋ฆฌ์ŠคํŠธ๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. ๋˜ํ•œ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ตœ์†Œํ™”ํ•˜๋„๋ก ์–ดํœ˜๋ชจ๋ธ์„ ์ˆ˜์ •ํ•˜์—ฌ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ ๊ธฐ์กด ํ†ตํ™”ํ‘œ ๋ฐฉ์‹์— ๋น„ํ•ด ์ ˆ๋Œ€์ ์œผ๋กœ 5.7%, ์ƒ๋Œ€์ ์œผ๋กœ 34.6%์˜ ์ธ์‹์˜ค๋ฅ˜๊ฐ€ ๊ฐ์†Œํ•˜์˜€๋‹ค.Articulation errors, disfluency, impulsive pause, low speaking rate have been suggested as factors of recognition error for dysarthric speakers using voice user interface. In related works, methods for correcting dysarthric speech, AM adaptation, pronunciation variation modeling, grammar modeling and vocabulary modeling based on acoustic and phonetic analyses on dysarthric speech were proposed to compensate those factors. In this paper, acoustic model was optimized. And words in the vocabulary were selected by the GLMM which had modeled relationship between recognition errors and articulatory features for phonetic class and optimized by lowering similarity between words. Problems in training AM for dysarthric speech recognition were addressed: firstly low speaking rate was compensated by varying the window length of FFT and the number of states of HMM. Secondly the efficiency of models for emission probability of HMM was compared. Thirdly AM trained using large amount of non-dysarthric speech was experimented. Fricative and nasal were statistically significant in the analysis of relation between recognition error and consonant classes. And all vowel classes were significant. AIC was lower by classifying consonants based on manner of articulation than based on place and by classifying vowels based on position of tongue than based on height. Fricatives increased WER and nasals increased accuracy of recognition. Estimates of plosive, affricate, liquid were close to zero. All vowel classes increased accuracy. Estimate of central vowel was the largest followed by back vowel, front vowel and diphthong. Triggering recognition error by competitive words was modeled by similarity between words based on Levenshtein and cosine distance respectively. Effect of similarity between words on recognition result was confirmed by the minimum-maximum similarity contrast and the N-best prediction. Prior to model vocabulary, articulation score for each word was calculated. Firstly the vocabulary was composed of the words with the maximum articulation scores. Secondly the words in the vocabulary with high similarity were replaced by the word with less similarity and large articulation score. In dysarthric speech recognitions, the optimized vocabulary lowered WER by 5.72% (34.60% ERR).์ œ 1 ์žฅ ์„œ๋ก  1 ์ œ 2 ์žฅ ๊ด€๋ จ์—ฐ๊ตฌ 5 ์ œ 1 ์ ˆ ๋งˆ๋น„๋ง์žฅ์•  ๋ฐœํ™” ์กฐ์Œ์˜ค๋ฅ˜ ๋ถ„์„ 5 ์ œ 2 ์ ˆ ์Œ์„ฑ์ธ์‹ ์‹œ์Šคํ…œ์˜ ๊ตฌ์กฐ์™€ ๋ชจ๋ธ 9 2.1 ํŠน์ง•์ถ”์ถœ 12 2.2 ์Œํ–ฅ๋ชจ๋ธ 12 2.3 ๋ฐœ์Œ๋ชจ๋ธ 16 2.4 ์–ธ์–ด๋ชจ๋ธ 18 ์ œ 3 ์ ˆ ์Œ์„ฑ์ธ์‹์‹œ์Šคํ…œ์˜ ํ™œ์šฉ๋ถ„์•ผ 18 ์ œ 4 ์ ˆ ๋งˆ๋น„๋ง์žฅ์•  ๋ฐœํ™” ์ธ์‹ 22 ์ œ 3 ์žฅ ํŠน์ง•์ถ”์ถœ ๋ฐ ์Œํ–ฅ๋ชจ๋ธ ๋ฒ ์ด์Šค๋ผ์ธ ๊ตฌ์ถ• 27 ์ œ 1 ์ ˆ ์ ‘๊ทผ๋ฐฉ๋ฒ• 28 ์ œ 2 ์ ˆ ๊ฐœ๋ฐœ ํ™˜๊ฒฝ 29 2.1 ์Œ์„ฑ ๋ฐ์ดํ„ฐ 29 2.2 ์Œ์„ฑ์ธ์‹ ํ™˜๊ฒฝ ๋ฐ ์Œํ–ฅ๋ชจ๋ธ ํ•™์Šต์ ˆ์ฐจ 30 ์ œ 3 ์ ˆ ์ธ์‹์‹คํ—˜ 32 3.1 ๋ฐœํ™”์†๋„ ๋ชจ๋ธ๋ง 32 3.2 ์ถœ๋ ฅํ™•๋ฅ  ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™” 38 3.3 ์Œํ–ฅ๋ชจ๋ธ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์„ฑ 44 3.4 ์‹คํ—˜๊ฒฐ๊ณผ ๋ถ„์„ ๋ฐ ์š”์•ฝ 46 ์ œ 4 ์ ˆ ๊ฒฐ๋ก  48 ์ œ 4 ์žฅ ์กฐ์Œ์˜ค๋ฅ˜ ํŠน์„ฑ ๊ธฐ๋ฐ˜ ์ธ์‹๋‹จ์–ด ์„ ํƒ๊ธฐ์ค€ 49 ์ œ 1 ์ ˆ ํ•œ๊ตญ์–ด ์Œ์†Œ ์ •์˜์™€ ์กฐ์Œ ํŠน์ง•์— ์˜ํ•œ ๋ฒ”์ฃผํ™” 50 ์ œ 2 ์ ˆ ์—ฐ๊ตฌ ๋ชฉํ‘œ 53 ์ œ 3 ์ ˆ ๋ถ„์„ ๋ฐ์ดํ„ฐ 53 3.1 ์Œ์„ฑ ๋ฐ์ดํ„ฐ 53 3.2 ์Œ์†Œ์˜ ๋ฒ”์ฃผ ๋ฐ ํ†ต๊ณ„ 54 3.3 ์Œ์„ฑ์ธ์‹ 56 3.4 ๋ฐ์ดํ„ฐ ๋ถ„์„ 57 ์ œ 4 ์ ˆ ๋ถ„์„ ๊ฒฐ๊ณผ 59 ์ œ 5 ์ ˆ ๊ฒฐ๋ก  62 ์ œ 5 ์žฅ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ ์ตœ์†Œํ™” ๊ธฐ์ค€ ์ธ์‹๋‹จ์–ด ์ตœ์ ํ™” 63 ์ œ 1 ์ ˆ ์—ด ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜์˜ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ 65 ์ œ 2 ์ ˆ ์ธ์‹๋ฅ ์— ๋Œ€ํ•œ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„์˜ ์˜ํ–ฅ 67 ์ œ 3 ์ ˆ N-best ์ถ”์ • 68 ์ œ 6 ์žฅ ์ธ์‹์‹คํ—˜ 74 ์ œ 1 ์ ˆ ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ ๊ตฌ์„ฑ 74 1.1 ๋ฒ ์ด์Šค๋ผ์ธ ์ธ์‹ ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ 76 1.2 ์กฐ์Œ์ ์ˆ˜ ์ตœ๋Œ€ํ™” ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ 77 1.3 ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ ์ตœ์†Œํ™” ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ 77 ์ œ 2 ์ ˆ ๊ธฐ์ดˆ ์‹คํ—˜ 78 ์ œ 3 ์ ˆ ์‹คํ—˜ ํ™˜๊ฒฝ 81 3.1 ์Œ์„ฑ ์ฝ”ํผ์Šค ๊ตฌ์„ฑ 81 ์ œ 4 ์ ˆ ์ธ์‹ ๊ฒฐ๊ณผ 84 4.1 ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ 84 4.2 ์กฐ์Œ์ ์ˆ˜ ์ตœ๋Œ€ํ™” ๋ชจ๋ธ 84 4.3 ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ ์ตœ์†Œํ™” ๋ชจ๋ธ 86 ์ œ 7 ์žฅ ๊ฒฐ๋ก  88 ์ œ 1 ์ ˆ ์—ฐ๊ตฌ๊ฒฐ๊ณผ ์š”์•ฝ ๋ฐ ํ‰๊ฐ€ 88 ์ œ 2 ์ ˆ ๊ธฐ์—ฌ๋„ ์š”์•ฝ 90 ์ œ 3 ์ ˆ ํ–ฅํ›„ ์—ฐ๊ตฌ 91 ์ฐธ๊ณ  ๋ฌธํ—Œ 94 ๋ถ€๋ก 101 Abstract 114Docto

    Towards Automatic Speech-Language Assessment for Aphasia Rehabilitation

    Full text link
    Speech-based technology has the potential to reinforce traditional aphasia therapy through the development of automatic speech-language assessment systems. Such systems can provide clinicians with supplementary information to assist with progress monitoring and treatment planning, and can provide support for on-demand auxiliary treatment. However, current technology cannot support this type of application due to the difficulties associated with aphasic speech processing. The focus of this dissertation is on the development of computational methods that can accurately assess aphasic speech across a range of clinically-relevant dimensions. The first part of the dissertation focuses on novel techniques for assessing aphasic speech intelligibility in constrained contexts. The second part investigates acoustic modeling methods that lead to significant improvement in aphasic speech recognition and allow the system to work with unconstrained speech samples. The final part demonstrates the efficacy of speech recognition-based analysis in automatic paraphasia detection, extraction of clinically-motivated quantitative measures, and estimation of aphasia severity. The methods and results presented in this work will enable robust technologies for accurately recognizing and assessing aphasic speech, and will provide insights into the link between computational methods and clinical understanding of aphasia.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/140840/1/ducle_1.pd
    corecore