57 research outputs found

    Optimizing Vocabulary Modeling for Dysarthric Voice User Interface

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ํ˜‘๋™๊ณผ์ • ์ธ์ง€๊ณผํ•™์ „๊ณต, 2016. 2. ์ •๋ฏผํ™”.๋งˆ๋น„๋ง์žฅ์•  ํ™”์ž์˜ ์Œ์„ฑ ์ธํ„ฐํŽ˜์ด์Šค ํ™œ์šฉ์—์„œ ๋นˆ๋ฒˆํ•œ ์กฐ์Œ์˜ค๋ฅ˜, ๋ˆŒ๋ณ€, ๋นˆ๋ฒˆํ•˜๊ณ  ๋ถˆ๊ทœ์น™์ ์œผ๋กœ ๋ฐœ์ƒํ•˜๋Š” ๋ฐœํ™”์ค‘๋‹จ, ๋Š๋ฆฐ ๋ฐœํ™”์†๋„ ๋“ฑ์€ ์˜ค์ธ์‹์„ ์œ ๋ฐœํ•˜๋Š” ์š”์ธ์ด ๋œ๋‹ค. ์„ ํ–‰์—ฐ๊ตฌ์—์„œ๋Š” ์žฅ์• ๋ฐœํ™”์˜ ์Œํ–ฅ ๋ฐ ์Œ์šด์  ํŠน์„ฑ์„ ๋ถ„์„ํ•˜๊ณ  ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์žฅ์• ๋ฐœํ™” ์ˆ˜์ •, ์Œํ–ฅ๋ชจ๋ธ ์ ์‘, ๋ฐœ์Œ๋ณ€์ด ๋ชจ๋ธ๋ง, ๋ฌธ๋ฒ• ๋ฐ ์–ดํœ˜ ๋ชจ๋ธ๋ง ๋“ฑ์„ ํ†ตํ•ด ์ธ์‹์˜ค๋ฅ˜์˜ ๋ฌธ์ œ๋ฅผ ๋ณด์™„ํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์žฅ์• ๋ฐœํ™”์˜ ํŠน์ง•์„ ๋ฐ˜์˜ํ•˜์—ฌ ์Œํ–ฅ๋ชจ๋ธ์„ ์ตœ์ ํ™”ํ–ˆ๋‹ค. ๋˜ํ•œ ์–ดํœ˜๋ชจ๋ธ์˜ ๊ตฌ์„ฑ์—์„œ ์Œ์†Œ๋ฒ”์ฃผ ๊ธฐ๋ฐ˜์˜ ์กฐ์ŒํŠน์ง•๊ณผ ์ธ์‹์˜ค๋ฅ˜์™€์˜ ๊ด€๊ณ„๋ฅผ ๋ชจํ˜•ํ™”ํ•˜์—ฌ ๋‹จ์–ด์˜ ์„ ํƒ๊ธฐ์ค€์œผ๋กœ ๋„์ž…ํ–ˆ๊ณ  ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ค„์ž„์œผ๋กœ์จ ์žฅ์• ํ™”์ž์˜ ์Œ์„ฑ ์ธํ„ฐํŽ˜์ด์Šค ์‹œ ๋ฐœ์ƒํ•˜๋Š” ์ธ์‹์˜ค๋ฅ˜๋ฅผ ์ค„์˜€๋‹ค. ๋งˆ๋น„๋ง์žฅ์•  ํ™”์ž๋ฅผ ์œ„ํ•œ ์Œํ–ฅ๋ชจ๋ธ์˜ ๊ตฌ์ถ•์„ ์œ„ํ•ด ์ฒซ์งธ๋กœ ์žฅ์• ํ™”์ž์˜ ๋Š๋ฆฐ ๋ฐœํ™”์†๋„์— ๋”ฐ๋ผ ํŠน์ง•์ถ”์ถœ์˜ ์œˆ๋„์šฐ ํฌ๊ธฐ์™€ HMM๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” state ๊ฐœ์ˆ˜๋ฅผ ์กฐ์ •ํ•˜์—ฌ ์˜ค๋ฅ˜๋ฅผ ๋‚ฎ์ท„๋‹ค. ๋‘˜์งธ๋กœ HMM์˜ ์ถœ๋ ฅํ™•๋ฅ  ๋ชจ๋ธ๋กœ์„œ GMM, Subspace GMM, DNN ๋“ฑ์„ ๋„์ž…ํ•˜์—ฌ ์ธ์‹์˜ค๋ฅ˜๋ฅผ ๋น„๊ตํ–ˆ๋‹ค. ์…‹์งธ๋กœ ํ•™์Šต๋ฐ์ดํ„ฐ ๋ถ€์กฑ๋ฌธ์ œ์— ๋Œ€ํ•œ ๋Œ€์‘๋ฐฉ๋ฒ•์œผ๋กœ ์ •์ƒ๋ฐœํ™” ๋„์ž…์˜ ํšจ์œจ์„ฑ์„ ์ธ์‹์‹คํ—˜์œผ๋กœ ํ™•์ธํ–ˆ๋‹ค. ์กฐ์ŒํŠน์ง•๊ณผ ์ธ์‹์˜ค๋ฅ˜์œจ์˜ ํ˜ผํ•ฉ์„ ํ˜•๋ชจ๋ธ ๋ถ„์„์—์„œ ์ž์Œ๋ฒ”์ฃผ ์ค‘ ์˜ค๋ฅ˜์œจ๊ณผ ์œ ์˜์ˆ˜์ค€ 0.05 ์ดํ•˜์—์„œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ธ ๋ฒ”์ฃผ๋Š” ๋งˆ์ฐฐ์Œ๊ณผ ๋น„์Œ์ด๊ณ  ๋ชจ๋“  ๋ชจ์Œ๋ฒ”์ฃผ๋Š” ์˜ค๋ฅ˜์œจ๊ณผ ์œ ์˜์ˆ˜์ค€ 0.05 ์ดํ•˜์—์„œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ํ˜ผํ•ฉ๋ชจ๋ธ์€ ์ž์Œ์„ ์กฐ์Œ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฒ”์ฃผํ™”ํ•  ๋•Œ๊ฐ€ ์กฐ์Œ์œ„์น˜๋กœ ๋ฒ”์ฃผํ™”ํ•  ๋•Œ์— ๋น„ํ•ด ๋‚ฎ์€ AIC๋ฅผ ๋ณด์˜€๊ณ  ๋ชจ์Œ์„ ํ˜€์˜ ์œ„์น˜๋กœ ๋ฒ”์ฃผํ™”ํ•  ๋•Œ๊ฐ€ ๋‚ฎ์€ AIC๋ฅผ ๋ณด์˜€๋‹ค. ์Œ์†Œ๋ฅผ ์กฐ์Œ๋ฐฉ๋ฒ•๊ณผ ํ˜€์˜ ์œ„์น˜๋กœ ๋ฒ”์ฃผํ™”ํ–ˆ์„ ๋•Œ ๋งˆ์ฐฐ์Œ์€ ๋‹จ์–ด์— ํฌํ•จ๋ ์ˆ˜๋ก ์ธ์‹์˜ค๋ฅ˜๋ฅผ ๋†’์ด๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๊ณ  ๋น„์Œ์€ ์ธ์‹์ •ํ™•๋„๋ฅผ ๋†’์˜€๋‹ค. ํ์‡„์Œ, ํŒŒ์ฐฐ์Œ, ์œ ์Œ ๋“ฑ์€ ์ธ์‹๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ์˜ํ–ฅ์ด 0์— ๊ฐ€๊นŒ์› ๋‹ค. ๋ชจ๋“  ๋ชจ์Œ ๋ฒ”์ฃผ๋Š” ์ธ์‹์ •ํ™•๋„๋ฅผ ๋†’์˜€๋‹ค. ๊ทธ ์ค‘ ์ค‘์„ค๋ชจ์Œ์˜ ์˜ํ–ฅ๋ ฅ์ด ๊ฐ€์žฅ ์ปธ๊ณ  ํ›„์„ค๋ชจ์Œ, ์ „์„ค๋ชจ์Œ, ์ด์ค‘๋ชจ์Œ ์ˆœ์œผ๋กœ ์ž‘์•„์กŒ๋‹ค. ๋‹จ์–ด ๊ฐ„ ์ธ์‹์˜ค๋ฅ˜์˜ ์œ ๋ฐœ ๊ฐ€๋Šฅ์„ฑ์„ Levenshtein ๊ฑฐ๋ฆฌ์™€ Cosine ๊ฑฐ๋ฆฌ ๋“ฑ ์—ด ๊ฑฐ๋ฆฌ ๊ธฐ์ค€์˜ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋กœ ๋ชจ๋ธ๋งํ–ˆ๊ณ  ์œ ์‚ฌ๋„์˜ ์ตœ์†Œ-์ตœ๋Œ€ ๋น„๊ต์™€ N-best ์ถ”์ •์œผ๋กœ ์ •์˜๋œ ๋‹จ์–ด๊ฐ„ ์œ ์‚ฌ๋„๊ฐ€ ์Œ์„ฑ์ธ์‹์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์š”์ธ์ด ๋  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ–ˆ๋‹ค. ์žฅ์• ๋ฐœํ™” ์ธ์‹์„ ์œ„ํ•œ ์–ดํœ˜๋ชจ๋ธ์˜ ๊ตฌ์„ฑ์—์„œ ๋จผ์ € ์กฐ์ŒํŠน์ง• ๊ธฐ๋ฐ˜์˜ ํ˜ผํ•ฉ๋ชจ๋ธ๋กœ ๋‹จ์–ด์˜ ์กฐ์Œ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์ ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋‹จ์–ด๋กœ ์ธ์‹๋‹จ์–ด ๋ฆฌ์ŠคํŠธ๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. ๋˜ํ•œ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ตœ์†Œํ™”ํ•˜๋„๋ก ์–ดํœ˜๋ชจ๋ธ์„ ์ˆ˜์ •ํ•˜์—ฌ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ ๊ธฐ์กด ํ†ตํ™”ํ‘œ ๋ฐฉ์‹์— ๋น„ํ•ด ์ ˆ๋Œ€์ ์œผ๋กœ 5.7%, ์ƒ๋Œ€์ ์œผ๋กœ 34.6%์˜ ์ธ์‹์˜ค๋ฅ˜๊ฐ€ ๊ฐ์†Œํ•˜์˜€๋‹ค.Articulation errors, disfluency, impulsive pause, low speaking rate have been suggested as factors of recognition error for dysarthric speakers using voice user interface. In related works, methods for correcting dysarthric speech, AM adaptation, pronunciation variation modeling, grammar modeling and vocabulary modeling based on acoustic and phonetic analyses on dysarthric speech were proposed to compensate those factors. In this paper, acoustic model was optimized. And words in the vocabulary were selected by the GLMM which had modeled relationship between recognition errors and articulatory features for phonetic class and optimized by lowering similarity between words. Problems in training AM for dysarthric speech recognition were addressed: firstly low speaking rate was compensated by varying the window length of FFT and the number of states of HMM. Secondly the efficiency of models for emission probability of HMM was compared. Thirdly AM trained using large amount of non-dysarthric speech was experimented. Fricative and nasal were statistically significant in the analysis of relation between recognition error and consonant classes. And all vowel classes were significant. AIC was lower by classifying consonants based on manner of articulation than based on place and by classifying vowels based on position of tongue than based on height. Fricatives increased WER and nasals increased accuracy of recognition. Estimates of plosive, affricate, liquid were close to zero. All vowel classes increased accuracy. Estimate of central vowel was the largest followed by back vowel, front vowel and diphthong. Triggering recognition error by competitive words was modeled by similarity between words based on Levenshtein and cosine distance respectively. Effect of similarity between words on recognition result was confirmed by the minimum-maximum similarity contrast and the N-best prediction. Prior to model vocabulary, articulation score for each word was calculated. Firstly the vocabulary was composed of the words with the maximum articulation scores. Secondly the words in the vocabulary with high similarity were replaced by the word with less similarity and large articulation score. In dysarthric speech recognitions, the optimized vocabulary lowered WER by 5.72% (34.60% ERR).์ œ 1 ์žฅ ์„œ๋ก  1 ์ œ 2 ์žฅ ๊ด€๋ จ์—ฐ๊ตฌ 5 ์ œ 1 ์ ˆ ๋งˆ๋น„๋ง์žฅ์•  ๋ฐœํ™” ์กฐ์Œ์˜ค๋ฅ˜ ๋ถ„์„ 5 ์ œ 2 ์ ˆ ์Œ์„ฑ์ธ์‹ ์‹œ์Šคํ…œ์˜ ๊ตฌ์กฐ์™€ ๋ชจ๋ธ 9 2.1 ํŠน์ง•์ถ”์ถœ 12 2.2 ์Œํ–ฅ๋ชจ๋ธ 12 2.3 ๋ฐœ์Œ๋ชจ๋ธ 16 2.4 ์–ธ์–ด๋ชจ๋ธ 18 ์ œ 3 ์ ˆ ์Œ์„ฑ์ธ์‹์‹œ์Šคํ…œ์˜ ํ™œ์šฉ๋ถ„์•ผ 18 ์ œ 4 ์ ˆ ๋งˆ๋น„๋ง์žฅ์•  ๋ฐœํ™” ์ธ์‹ 22 ์ œ 3 ์žฅ ํŠน์ง•์ถ”์ถœ ๋ฐ ์Œํ–ฅ๋ชจ๋ธ ๋ฒ ์ด์Šค๋ผ์ธ ๊ตฌ์ถ• 27 ์ œ 1 ์ ˆ ์ ‘๊ทผ๋ฐฉ๋ฒ• 28 ์ œ 2 ์ ˆ ๊ฐœ๋ฐœ ํ™˜๊ฒฝ 29 2.1 ์Œ์„ฑ ๋ฐ์ดํ„ฐ 29 2.2 ์Œ์„ฑ์ธ์‹ ํ™˜๊ฒฝ ๋ฐ ์Œํ–ฅ๋ชจ๋ธ ํ•™์Šต์ ˆ์ฐจ 30 ์ œ 3 ์ ˆ ์ธ์‹์‹คํ—˜ 32 3.1 ๋ฐœํ™”์†๋„ ๋ชจ๋ธ๋ง 32 3.2 ์ถœ๋ ฅํ™•๋ฅ  ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™” 38 3.3 ์Œํ–ฅ๋ชจ๋ธ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์„ฑ 44 3.4 ์‹คํ—˜๊ฒฐ๊ณผ ๋ถ„์„ ๋ฐ ์š”์•ฝ 46 ์ œ 4 ์ ˆ ๊ฒฐ๋ก  48 ์ œ 4 ์žฅ ์กฐ์Œ์˜ค๋ฅ˜ ํŠน์„ฑ ๊ธฐ๋ฐ˜ ์ธ์‹๋‹จ์–ด ์„ ํƒ๊ธฐ์ค€ 49 ์ œ 1 ์ ˆ ํ•œ๊ตญ์–ด ์Œ์†Œ ์ •์˜์™€ ์กฐ์Œ ํŠน์ง•์— ์˜ํ•œ ๋ฒ”์ฃผํ™” 50 ์ œ 2 ์ ˆ ์—ฐ๊ตฌ ๋ชฉํ‘œ 53 ์ œ 3 ์ ˆ ๋ถ„์„ ๋ฐ์ดํ„ฐ 53 3.1 ์Œ์„ฑ ๋ฐ์ดํ„ฐ 53 3.2 ์Œ์†Œ์˜ ๋ฒ”์ฃผ ๋ฐ ํ†ต๊ณ„ 54 3.3 ์Œ์„ฑ์ธ์‹ 56 3.4 ๋ฐ์ดํ„ฐ ๋ถ„์„ 57 ์ œ 4 ์ ˆ ๋ถ„์„ ๊ฒฐ๊ณผ 59 ์ œ 5 ์ ˆ ๊ฒฐ๋ก  62 ์ œ 5 ์žฅ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ ์ตœ์†Œํ™” ๊ธฐ์ค€ ์ธ์‹๋‹จ์–ด ์ตœ์ ํ™” 63 ์ œ 1 ์ ˆ ์—ด ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜์˜ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ 65 ์ œ 2 ์ ˆ ์ธ์‹๋ฅ ์— ๋Œ€ํ•œ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„์˜ ์˜ํ–ฅ 67 ์ œ 3 ์ ˆ N-best ์ถ”์ • 68 ์ œ 6 ์žฅ ์ธ์‹์‹คํ—˜ 74 ์ œ 1 ์ ˆ ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ ๊ตฌ์„ฑ 74 1.1 ๋ฒ ์ด์Šค๋ผ์ธ ์ธ์‹ ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ 76 1.2 ์กฐ์Œ์ ์ˆ˜ ์ตœ๋Œ€ํ™” ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ 77 1.3 ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ ์ตœ์†Œํ™” ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ 77 ์ œ 2 ์ ˆ ๊ธฐ์ดˆ ์‹คํ—˜ 78 ์ œ 3 ์ ˆ ์‹คํ—˜ ํ™˜๊ฒฝ 81 3.1 ์Œ์„ฑ ์ฝ”ํผ์Šค ๊ตฌ์„ฑ 81 ์ œ 4 ์ ˆ ์ธ์‹ ๊ฒฐ๊ณผ 84 4.1 ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ 84 4.2 ์กฐ์Œ์ ์ˆ˜ ์ตœ๋Œ€ํ™” ๋ชจ๋ธ 84 4.3 ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ ์ตœ์†Œํ™” ๋ชจ๋ธ 86 ์ œ 7 ์žฅ ๊ฒฐ๋ก  88 ์ œ 1 ์ ˆ ์—ฐ๊ตฌ๊ฒฐ๊ณผ ์š”์•ฝ ๋ฐ ํ‰๊ฐ€ 88 ์ œ 2 ์ ˆ ๊ธฐ์—ฌ๋„ ์š”์•ฝ 90 ์ œ 3 ์ ˆ ํ–ฅํ›„ ์—ฐ๊ตฌ 91 ์ฐธ๊ณ  ๋ฌธํ—Œ 94 ๋ถ€๋ก 101 Abstract 114Docto

    Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization

    Full text link
    Automatic speech recognition (ASR) has recently become an important challenge when using deep learning (DL). It requires large-scale training datasets and high computational and storage resources. Moreover, DL techniques and machine learning (ML) approaches in general, hypothesize that training and testing data come from the same domain, with the same input feature space and data distribution characteristics. This assumption, however, is not applicable in some real-world artificial intelligence (AI) applications. Moreover, there are situations where gathering real data is challenging, expensive, or rarely occurring, which can not meet the data requirements of DL models. deep transfer learning (DTL) has been introduced to overcome these issues, which helps develop high-performing models using real datasets that are small or slightly different but related to the training data. This paper presents a comprehensive survey of DTL-based ASR frameworks to shed light on the latest developments and helps academics and professionals understand current challenges. Specifically, after presenting the DTL background, a well-designed taxonomy is adopted to inform the state-of-the-art. A critical analysis is then conducted to identify the limitations and advantages of each framework. Moving on, a comparative study is introduced to highlight the current challenges before deriving opportunities for future research

    Towards Automatic Speech-Language Assessment for Aphasia Rehabilitation

    Full text link
    Speech-based technology has the potential to reinforce traditional aphasia therapy through the development of automatic speech-language assessment systems. Such systems can provide clinicians with supplementary information to assist with progress monitoring and treatment planning, and can provide support for on-demand auxiliary treatment. However, current technology cannot support this type of application due to the difficulties associated with aphasic speech processing. The focus of this dissertation is on the development of computational methods that can accurately assess aphasic speech across a range of clinically-relevant dimensions. The first part of the dissertation focuses on novel techniques for assessing aphasic speech intelligibility in constrained contexts. The second part investigates acoustic modeling methods that lead to significant improvement in aphasic speech recognition and allow the system to work with unconstrained speech samples. The final part demonstrates the efficacy of speech recognition-based analysis in automatic paraphasia detection, extraction of clinically-motivated quantitative measures, and estimation of aphasia severity. The methods and results presented in this work will enable robust technologies for accurately recognizing and assessing aphasic speech, and will provide insights into the link between computational methods and clinical understanding of aphasia.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/140840/1/ducle_1.pd

    Personalised Dialogue Management for Users with Speech Disorders

    Get PDF
    Many electronic devices are beginning to include Voice User Interfaces (VUIs) as an alternative to conventional interfaces. VUIs are especially useful for users with restricted upper limb mobility, because they cannot use keyboards and mice. These users, however, often suffer from speech disorders (e.g. dysarthria), making Automatic Speech Recognition (ASR) challenging, thus degrading the performance of the VUI. Partially Observable Markov Decision Process (POMDP) based Dialogue Management (DM) has been shown to improve the interaction performance in challenging ASR environments, but most of the research in this area has focused on Spoken Dialogue Systems (SDSs) developed to provide information, where the users interact with the system only a few times. In contrast, most VUIs are likely to be used by a single speaker over a long period of time, but very little research has been carried out on adaptation of DM models to specific speakers. This thesis explores methods to adapt DM models (in particular dialogue state tracking models and policy models) to a specific user during a longitudinal interaction. The main differences between personalised VUIs and typical SDSs are identified and studied. Then, state-of-the-art DM models are modified to be used in scenarios which are unique to long-term personalised VUIs, such as personalised models initialised with data from different speakers or scenarios where the dialogue environment (e.g. the ASR) changes over time. In addition, several speaker and environment related features are shown to be useful to improve the interaction performance. This study is done in the context of homeService, a VUI developed to help users with dysarthria to control their home devices. The study shows that personalisation of the POMDP-DM framework can greatly improve the performance of these interfaces

    Graph-based Estimation of Information Divergence Functions

    Get PDF
    abstract: Information divergence functions, such as the Kullback-Leibler divergence or the Hellinger distance, play a critical role in statistical signal processing and information theory; however estimating them can be challenge. Most often, parametric assumptions are made about the two distributions to estimate the divergence of interest. In cases where no parametric model fits the data, non-parametric density estimation is used. In statistical signal processing applications, Gaussianity is usually assumed since closed-form expressions for common divergence measures have been derived for this family of distributions. Parametric assumptions are preferred when it is known that the data follows the model, however this is rarely the case in real-word scenarios. Non-parametric density estimators are characterized by a very large number of parameters that have to be tuned with costly cross-validation. In this dissertation we focus on a specific family of non-parametric estimators, called direct estimators, that bypass density estimation completely and directly estimate the quantity of interest from the data. We introduce a new divergence measure, the DpD_p-divergence, that can be estimated directly from samples without parametric assumptions on the distribution. We show that the DpD_p-divergence bounds the binary, cross-domain, and multi-class Bayes error rates and, in certain cases, provides provably tighter bounds than the Hellinger divergence. In addition, we also propose a new methodology that allows the experimenter to construct direct estimators for existing divergence measures or to construct new divergence measures with custom properties that are tailored to the application. To examine the practical efficacy of these new methods, we evaluate them in a statistical learning framework on a series of real-world data science problems involving speech-based monitoring of neuro-motor disorders.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    Factors influencing the efficacy of delayed auditory feedback in treating dysarthria associated with Parkinson\u27s disease

    Get PDF
    Parkinson\u27s disease patients exhibit a high prevalence of speech deficits including excessive speech rate, reduced intelligibility, and disfluencies. The present study examined the effects of delayed auditory feedback (DAF) as a rate control intervention for dysarthric speakers with Parkinson\u27s disease. Adverse reactions to relatively long delay intervals are commonly observed during clinical use of DAF, and seem to result from improper matching of the delayed signal. To facilitate optimal use of DAF, therefore, clinicians must provide instruction, modeling, and feedback. Clinician instruction is frequently used in speech-language therapy, but has not been evaluated during use of DAF-based interventions. Therefore, the primary purpose of the present study was to evaluate the impact of clinician instruction on the effectiveness of DAF in treating speech deficits. A related purpose was to compare the effects of different delay intervals on speech behaviors. An A-B-A-B single-subject design was utilized. The A phases consisted of a sentence reading task using DAF, while the B phases incorporated clinician instruction into the DAF protocol. During each of the 16 experimental sessions, speakers read with four different delay intervals (0 ms, 50 ms, 100 ms, and 150 ms). During the B phases, the experimenter provided verbal feedback and modeling pertaining to how precisely the speaker matched the delayed signal. Dependent variables measured were speech rate, percent intelligible syllables, and percent disfluencies. Three males with Parkinson\u27s disease and an associated dysarthria participated in the study. Results revealed that for all three speakers, DAF significantly reduced reading rate and produced significant improvements in either intelligibility (for Speaker 3) or fluency (for Speakers 1 and 2). A delay interval of 150 ms produced the greatest reductions in reading rates for all three speakers, although any of the DAF settings used was sufficient to produce significant improvements in either intelligibility or fluency. In addition, supplementing the DAF intervention with clinician instruction resulted in significantly greater gains achieved with DAF. These findings confirmed the effectiveness of various intervals of DAF in improving speech deficits in Parkinson\u27s disease speakers, particular when patients are provided with instruction and modeling from the clinician

    Automatic analysis of pathological speech

    Get PDF
    De ernst van een spraakstoornis wordt vaak gemeten a.d.h.v. spraakverstaanbaarheid. Deze maat wordt in de klinische praktijk vaak bepaald met een perceptuele test. Zoโ€™n test is van nature subjectief vermits de therapeut die de test afneemt de (stoornis van de) patiรซnt vaak kent en ook vertrouwd is met het gebruikte testmateriaal. Daarom is het interessant te onderzoeken of men met spraakherkenning een objectieve beoordelaar van verstaanbaarheid kan creรซren. In deze thesis wordt een methodologie uitgewerkt om een gestandaardiseerde perceptuele test, het Nederlandstalig Spraakverstaanbaarheidsonderzoek (NSVO), te automatiseren. Hiervoor wordt gebruik gemaakt van spraakherkenning om de patiรซnt fonologisch en fonemisch te karakteriseren en uit deze karakterisering een spraakverstaanbaarheidsscore af te leiden. Experimenten hebben aangetoond dat de berekende scores zeer betrouwbaar zijn. Vermits het NSVO met nonsenswoorden werkt, kunnen vooral kinderen hierdoor leesfouten maken. Daarom werden nieuwe methodes ontwikkeld, gebaseerd op betekenisdragende lopende spraak, die hiertegen robuust zijn en tegelijk ook in verschillende talen gebruikt kunnen worden. Met deze nieuwe modellen bleek het mogelijk te zijn om betrouwbare verstaanbaarheidsscores te berekenen voor Vlaamse, Nederlandse en Duitse spraak. Tenslotte heeft het onderzoek ook belangrijke stappen gezet in de richting van een automatische karakterisering van andere aspecten van de spraakstoornis, zoals articulatie en stemgeving

    Multinomial logistic regression probability ratio-based feature vectors for Malay vowel recognition

    Get PDF
    Vowel Recognition is a part of automatic speech recognition (ASR) systems that classifies speech signals into groups of vowels. The performance of Malay vowel recognition (MVR) like any multiclass classification problem depends largely on Feature Vectors (FVs). FVs such as Mel-frequency Cepstral Coefficients (MFCC) have produced high error rates due to poor phoneme information. Classifier transformed probabilistic features have proved a better alternative in conveying phoneme information. However, the high dimensionality of the probabilistic features introduces additional complexity that deteriorates ASR performance. This study aims to improve MVR performance by proposing an algorithm that transforms MFCC FVs into a new set of features using Multinomial Logistic Regression (MLR) to reduce the dimensionality of the probabilistic features. This study was carried out in four phases which are pre-processing and feature extraction, best regression coefficients generation, feature transformation, and performance evaluation. The speech corpus consists of 1953 samples of five Malay vowels of /a/, /e/, /i/, /o/ and /u/ recorded from students of two public universities in Malaysia. Two sets of algorithms were developed which are DBRCs and FELT. DBRCs algorithm determines the best regression coefficients (DBRCs) to obtain the best set of regression coefficients (RCs) from the extracted 39-MFCC FVs through resampling and data swapping approach. FELT algorithm transforms 39-MFCC FVs using logistic transformation method into FELT FVs. Vowel recognition rates of FELT and 39-MFCC FVs were compared using four different classification techniques of Artificial Neural Network, MLR, Linear Discriminant Analysis, and k-Nearest Neighbour. Classification results showed that FELT FVs surpass the performance of 39-MFCC FVs in MVR. Depending on the classifiers used, the improved performance of 1.48% - 11.70% was attained by FELT over MFCC. Furthermore, FELT significantly improved the recognition accuracy of vowels /o/ and /u/ by 5.13% and 8.04% respectively. This study contributes two algorithms for determining the best set of RCs and generating FELT FVs from MFCC. The FELT FVs eliminate the need for dimensionality reduction with comparable performances. Furthermore, FELT FVs improved MVR for all the five vowels especially /o/ and /u/. The improved MVR performance will spur the development of Malay speech-based systems, especially for the Malaysian community

    Exploring and Evaluating Personalized Models for Code Generation

    Full text link
    Large Transformer models achieved the state-of-the-art status for Natural Language Understanding tasks and are increasingly becoming the baseline model architecture for modeling source code. Transformers are usually pre-trained on large unsupervised corpora, learning token representations and transformations relevant to modeling generally available text, and are then fine-tuned on a particular downstream task of interest. While fine-tuning is a tried-and-true method for adapting a model to a new domain -- for example, question-answering on a given topic -- generalization remains an on-going challenge. In this paper, we explore and evaluate transformer model fine-tuning for personalization. In the context of generating unit tests for Java methods, we evaluate learning to personalize to a specific software project using several personalization techniques. We consider three key approaches: (i) custom fine-tuning, which allows all the model parameters to be tuned; (ii) lightweight fine-tuning, which freezes most of the model's parameters, allowing tuning of the token embeddings and softmax layer only or the final layer alone; (iii) prefix tuning, which keeps model parameters frozen, but optimizes a small project-specific prefix vector. Each of these techniques offers a trade-off in total compute cost and predictive performance, which we evaluate by code and task-specific metrics, training time, and total computational operations. We compare these fine-tuning strategies for code generation and discuss the potential generalization and cost benefits of each in various deployment scenarios.Comment: Accepted to the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022), Industry Track - Singapore, November 14-18, 2022, to appear 9 page

    Improving multilingual speech recognition systems

    Get PDF
    End-to-end trainable deep neural networks have become the state-of-the-art architecture for automatic speech recognition (ASR), provided that the network is trained with a sufficiently large dataset. However, many existing languages are too sparsely resourced for deep learning networks to achieve as high accuracy as their resource-abundant counterparts. Multilingual recognition systems mitigate data sparsity issues by training models on data from multiple language resources to learn a speech-to-text or speech-to-phone model universal to all languages. The resulting multilingual ASR models usually have better recognition accuracy than the models trained on the individual dataset. In this work, we propose that two limitations exist for multilingual systems, and resolving the two limitations could result in improved recognition accuracy: (1) existing corpora are of the considerably varied form (spontaneous or read speech), corpus size, noise level, and phoneme distribution and the ASR models trained on the joint multilingual dataset have large performance disparities over different languages. We present an optimizable loss function, equal accuracy ratio (EAR), that measures the sequence-level performance disparity between different user groups and we show that explicitly optimizing this objective reduces the performance gap and improves the multilingual recognition accuracy. (2) While having good accuracy on the seen training language, the multilingual systems do not generalize well to unseen testing languages, which we refer to as cross-lingual recognition accuracy. We introduce language embedding using external linguistic typologies and show that such embedding can significantly increase both multilingual and cross-lingual accuracy. We illustrate the effectiveness of the proposed methods with experiments on multilingual and multi-user and multi-dialect corpora
    • โ€ฆ
    corecore