6 research outputs found

    ์šด์œจ ์ •๋ณด๋ฅผ ์ด์šฉํ•œ ๋งˆ๋น„๋ง์žฅ์•  ์Œ์„ฑ ์ž๋™ ๊ฒ€์ถœ ๋ฐ ํ‰๊ฐ€

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ธ๋ฌธ๋Œ€ํ•™ ์–ธ์–ดํ•™๊ณผ, 2020. 8. Minhwa Chung.๋ง์žฅ์• ๋Š” ์‹ ๊ฒฝ๊ณ„ ๋˜๋Š” ํ‡ดํ–‰์„ฑ ์งˆํ™˜์—์„œ ๊ฐ€์žฅ ๋นจ๋ฆฌ ๋‚˜ํƒ€๋‚˜๋Š” ์ฆ ์ƒ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๋งˆ๋น„๋ง์žฅ์• ๋Š” ํŒŒํ‚จ์Šจ๋ณ‘, ๋‡Œ์„ฑ ๋งˆ๋น„, ๊ทผ์œ„์ถ•์„ฑ ์ธก์‚ญ ๊ฒฝํ™”์ฆ, ๋‹ค๋ฐœ์„ฑ ๊ฒฝํ™”์ฆ ํ™˜์ž ๋“ฑ ๋‹ค์–‘ํ•œ ํ™˜์ž๊ตฐ์—์„œ ๋‚˜ํƒ€๋‚œ๋‹ค. ๋งˆ๋น„๋ง์žฅ์• ๋Š” ์กฐ์Œ๊ธฐ๊ด€ ์‹ ๊ฒฝ์˜ ์†์ƒ์œผ๋กœ ๋ถ€์ •ํ™•ํ•œ ์กฐ์Œ์„ ์ฃผ์š” ํŠน์ง•์œผ๋กœ ๊ฐ€์ง€๊ณ , ์šด์œจ์—๋„ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด๊ณ ๋œ๋‹ค. ์„ ํ–‰ ์—ฐ๊ตฌ์—์„œ๋Š” ์šด์œจ ๊ธฐ๋ฐ˜ ์ธก์ •์น˜๋ฅผ ๋น„์žฅ์•  ๋ฐœํ™”์™€ ๋งˆ๋น„๋ง์žฅ์•  ๋ฐœํ™”๋ฅผ ๊ตฌ๋ณ„ํ•˜๋Š” ๊ฒƒ์— ์‚ฌ์šฉํ–ˆ๋‹ค. ์ž„์ƒ ํ˜„์žฅ์—์„œ๋Š” ๋งˆ๋น„๋ง์žฅ์• ์— ๋Œ€ํ•œ ์šด์œจ ๊ธฐ๋ฐ˜ ๋ถ„์„์ด ๋งˆ๋น„๋ง์žฅ์• ๋ฅผ ์ง„๋‹จํ•˜๊ฑฐ๋‚˜ ์žฅ์•  ์–‘์ƒ์— ๋”ฐ๋ฅธ ์•Œ๋งž์€ ์น˜๋ฃŒ๋ฒ•์„ ์ค€๋น„ํ•˜๋Š” ๊ฒƒ์— ๋„์›€์ด ๋  ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ ๋งˆ๋น„๋ง์žฅ์• ๊ฐ€ ์šด์œจ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์–‘์ƒ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋งˆ๋น„๋ง์žฅ์• ์˜ ์šด์œจ ํŠน์ง•์„ ๊ธด๋ฐ€ํ•˜๊ฒŒ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ์ด ํ•„์š”ํ•˜๋‹ค. ๊ตฌ์ฒด ์ ์œผ๋กœ, ์šด์œจ์ด ์–ด๋–ค ์ธก๋ฉด์—์„œ ๋งˆ๋น„๋ง์žฅ์• ์— ์˜ํ–ฅ์„ ๋ฐ›๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  ์šด์œจ ์• ๊ฐ€ ์žฅ์•  ์ •๋„์— ๋”ฐ๋ผ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฅด๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š”์ง€์— ๋Œ€ํ•œ ๋ถ„์„์ด ํ•„์š”ํ•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ์Œ๋†’์ด, ์Œ์งˆ, ๋ง์†๋„, ๋ฆฌ๋“ฌ ๋“ฑ ์šด์œจ์„ ๋‹ค์–‘ํ•œ ์ธก๋ฉด์— ์„œ ์‚ดํŽด๋ณด๊ณ , ๋งˆ๋น„๋ง์žฅ์•  ๊ฒ€์ถœ ๋ฐ ํ‰๊ฐ€์— ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ถ”์ถœ๋œ ์šด์œจ ํŠน์ง•๋“ค์€ ๋ช‡ ๊ฐ€์ง€ ํŠน์ง• ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ์ตœ์ ํ™”๋˜์–ด ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜๊ธฐ์˜ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ๋ถ„๋ฅ˜๊ธฐ์˜ ์„ฑ๋Šฅ์€ ์ •ํ™•๋„, ์ •๋ฐ€๋„, ์žฌํ˜„์œจ, F1-์ ์ˆ˜๋กœ ํ‰๊ฐ€๋˜์—ˆ๋‹ค. ๋˜ํ•œ, ๋ณธ ๋…ผ๋ฌธ์€ ์žฅ์•  ์ค‘์ฆ๋„(๊ฒฝ๋„, ์ค‘๋“ฑ๋„, ์‹ฌ๋„)์— ๋”ฐ๋ผ ์šด์œจ ์ •๋ณด ์‚ฌ์šฉ์˜ ์œ ์šฉ์„ฑ์„ ๋ถ„์„ํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์žฅ์•  ๋ฐœํ™” ์ˆ˜์ง‘์ด ์–ด๋ ค์šด ๋งŒํผ, ๋ณธ ์—ฐ๊ตฌ๋Š” ๊ต์ฐจ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ํ•œ๊ตญ์–ด์™€ ์˜์–ด ์žฅ์•  ๋ฐœํ™”๊ฐ€ ํ›ˆ๋ จ ์…‹์œผ๋กœ ์‚ฌ์šฉ๋˜์—ˆ์œผ๋ฉฐ, ํ…Œ์ŠคํŠธ์…‹์œผ๋กœ๋Š” ๊ฐ ๋ชฉํ‘œ ์–ธ์–ด๋งŒ์ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ธ ๊ฐ€์ง€๋ฅผ ์‹œ์‚ฌํ•œ๋‹ค. ์ฒซ์งธ, ์šด์œจ ์ •๋ณด ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ๋งˆ๋น„๋ง์žฅ์•  ๊ฒ€์ถœ ๋ฐ ํ‰๊ฐ€์— ๋„์›€์ด ๋œ๋‹ค. MFCC ๋งŒ์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ, ์šด์œจ ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ํ•œ๊ตญ์–ด์™€ ์˜์–ด ๋ฐ์ดํ„ฐ์…‹ ๋ชจ๋‘์—์„œ ๋„์›€์ด ๋˜์—ˆ๋‹ค. ๋‘˜์งธ, ์šด์œจ ์ •๋ณด๋Š” ํ‰๊ฐ€์— ํŠนํžˆ ์œ ์šฉํ•˜๋‹ค. ์˜์–ด์˜ ๊ฒฝ์šฐ ๊ฒ€์ถœ๊ณผ ํ‰๊ฐ€์—์„œ ๊ฐ๊ฐ 1.82%์™€ 20.6%์˜ ์ƒ๋Œ€์  ์ •ํ™•๋„ ํ–ฅ์ƒ์„ ๋ณด์˜€๋‹ค. ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ ๊ฒ€์ถœ์—์„œ๋Š” ํ–ฅ์ƒ์„ ๋ณด์ด์ง€ ์•Š์•˜์ง€๋งŒ, ํ‰๊ฐ€์—์„œ๋Š” 13.6%์˜ ์ƒ๋Œ€์  ํ–ฅ์ƒ์ด ๋‚˜ํƒ€๋‚ฌ๋‹ค. ์…‹์งธ, ๊ต์ฐจ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋Š” ๋‹จ์ผ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋ณด๋‹ค ํ–ฅ์ƒ๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ ๊ต์ฐจ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋Š” ๋‹จ์ผ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ ์ƒ๋Œ€์ ์œผ๋กœ 4.12% ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์˜€๋‹ค. ์ด๊ฒƒ์€ ํŠน์ • ์šด์œจ ์žฅ์• ๋Š” ๋ฒ”์–ธ์–ด์  ํŠน์ง•์„ ๊ฐ€์ง€๋ฉฐ, ๋‹ค๋ฅธ ์–ธ์–ด ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จ์‹œ์ผœ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ•œ ํ›ˆ๋ จ ์…‹์„ ๋ณด์™„ํ•  ์ˆ˜ ์žˆ ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค.One of the earliest cues for neurological or degenerative disorders are speech impairments. Individuals with Parkinsons Disease, Cerebral Palsy, Amyotrophic lateral Sclerosis, Multiple Sclerosis among others are often diagnosed with dysarthria. Dysarthria is a group of speech disorders mainly affecting the articulatory muscles which eventually leads to severe misarticulation. However, impairments in the suprasegmental domain are also present and previous studies have shown that the prosodic patterns of speakers with dysarthria differ from the prosody of healthy speakers. In a clinical setting, a prosodic-based analysis of dysarthric speech can be helpful for diagnosing the presence of dysarthria. Therefore, there is a need to not only determine how the prosody of speech is affected by dysarthria, but also what aspects of prosody are more affected and how prosodic impairments change by the severity of dysarthria. In the current study, several prosodic features related to pitch, voice quality, rhythm and speech rate are used as features for detecting dysarthria in a given speech signal. A variety of feature selection methods are utilized to determine which set of features are optimal for accurate detection. After selecting an optimal set of prosodic features we use them as input to machine learning-based classifiers and assess the performance using the evaluation metrics: accuracy, precision, recall and F1-score. Furthermore, we examine the usefulness of prosodic measures for assessing different levels of severity (e.g. mild, moderate, severe). Finally, as collecting impaired speech data can be difficult, we also implement cross-language classifiers where both Korean and English data are used for training but only one language used for testing. Results suggest that in comparison to solely using Mel-frequency cepstral coefficients, including prosodic measurements can improve the accuracy of classifiers for both Korean and English datasets. In particular, large improvements were seen when assessing different severity levels. For English a relative accuracy improvement of 1.82% for detection and 20.6% for assessment was seen. The Korean dataset saw no improvements for detection but a relative improvement of 13.6% for assessment. The results from cross-language experiments showed a relative improvement of up to 4.12% in comparison to only using a single language during training. It was found that certain prosodic impairments such as pitch and duration may be language independent. Therefore, when training sets of individual languages are limited, they may be supplemented by including data from other languages.1. Introduction 1 1.1. Dysarthria 1 1.2. Impaired Speech Detection 3 1.3. Research Goals & Outline 6 2. Background Research 8 2.1. Prosodic Impairments 8 2.1.1. English 8 2.1.2. Korean 10 2.2. Machine Learning Approaches 12 3. Database 18 3.1. English-TORGO 20 3.2. Korean-QoLT 21 4. Methods 23 4.1. Prosodic Features 23 4.1.1. Pitch 23 4.1.2. Voice Quality 26 4.1.3. Speech Rate 29 4.1.3. Rhythm 30 4.2. Feature Selection 34 4.3. Classification Models 38 4.3.1. Random Forest 38 4.3.1. Support Vector Machine 40 4.3.1 Feed-Forward Neural Network 42 4.4. Mel-Frequency Cepstral Coefficients 43 5. Experiment 46 5.1. Model Parameters 47 5.2. Training Procedure 48 5.2.1. Dysarthria Detection 48 5.2.2. Severity Assessment 50 5.2.3. Cross-Language 51 6. Results 52 6.1. TORGO 52 6.1.1. Dysarthria Detection 52 6.1.2. Severity Assessment 56 6.2. QoLT 57 6.2.1. Dysarthria Detection 57 6.2.2. Severity Assessment 58 6.1. Cross-Language 59 7. Discussion 62 7.1. Linguistic Implications 62 7.2. Clinical Applications 65 8. Conclusion 67 References 69 Appendix 76 Abstract in Korean 79Maste

    Automatic Severity Classification of Dysarthric Speech based on Pronunciation Accuracy

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ธ๋ฌธ๋Œ€ํ•™ ์–ธ์–ดํ•™๊ณผ, 2021. 2. ์ •๋ฏผํ™”.Dysarthria is a motor speech disorder that occurs when muscles related to speech production are paralyzed or weakened. Dysarthria is diagnosed of its severity levels by trained speech therapists, who use perceptual evaluation on the purpose of providing appropriate treatments to each patient. While the professional diagnosis is important, perceptual evaluation not only takes a lot of time and effort but can also be biased and subjective. The automatic severity classification of dysarthric speech could compensate for these shortcomings and aid the therapists. Pronunciation accuracy, consisted of the percentage of correct phonemes and the degree of vowel distortion, is one of the most commonly used features in a clinical setting to classify the severity levels of dysarthria. However, few previous studies have considered pronunciation accuracy as a feature for automatic severity classification. In this paper, we propose pronunciation accuracy to be beneficial in automatically classifying the severity levels for dysarthric speech. Experiments were designed to confirm the usefulness of these features in contrast to the features used in previous studies: spectral features(MFCCs), voice quality features, and prosody features. Two feature selection methods-Recursive Feature Elimination(RFE) and Extra Trees Classifier(ETC) were used to determine the optimal feature set. Each optimal feature set was used as the input to two classifiers-Support Vector Machine(SVM) and Multiple Layer Perceptron(MLP). The classifiers were trained to determine the severity levels of each utterance into five categories - healthy, mild, mild-to-moderate, moderate-to-severe, severe. The performance of the classifier was evaluated using accuracy, precision, recall, and F1-score metrics. Results from the experiments before and after adding pronunciation accuracy features were compared. For the SVM classifier, the classification accuracy showed a relative increase of 15.83%, 25.42%, 23.39% before feature selection, after applying the RFE algorithm, and applying the ETC algorithm, respectively. For the MLP classifier, the relative increase accuracy of 28.97%, 21.19%, 22.95% were seen. ETC algorithm-SVM classifier experiment showed the best performance with 77.5% accuracy. The optimal feature set included % of voice breaks, speech duration, Percentage of Correct Consonants, Percentage of Correct Vowels, Percentage of Correct Phonemes, Vowel Space Area(VSA), Vowel Articulatory Index(VAI), Formant Centralized Ratio(FCR), and F2-ratio. Furthermore, the selected feature sets from each experiment were compared. When the pronunciation accuracy features were included, many voice quality features and prosody features that were selected in the baseline experiment were replaced by the pronunciation accuracy features. The contribution weight of the features from the optimal feature set showed that all pronunciation accuracy features have higher contribution weight compared to voice quality and prosody features. The results suggest two ways. First, the pronunciation accuracy features are helpful for the automatic severity classification of dysarthria. While pronunciation accuracy features have been generally used by speech pathologists, few studies related to automatic severity classification have looked into their effect. This study proves that the pronunciation accuracy features are useful for automatic severity classification as for a clinical setting. Second, the pronunciation accuracy features play a more important role than voice quality features or prosody features. Features related to articulation are proven to have the highest correlation with the speech intelligibility score of dysarthric speech among several features related to speech production. This study indicates that this fact holds the same for automatic severity classification.๋งˆ๋น„๋ง์žฅ์• ๋Š” ์ค‘์ถ” ์‹ ๊ฒฝ๊ณ„ ๋ฐ ์ž์œจ ์‹ ๊ฒฝ๊ณ„์˜ ์†์ƒ์œผ๋กœ ๋ง์†Œ๋ฆฌ ์‚ฐ์ถœ๊ณผ ๊ด€๋ จ๋œ ๊ทผ์œก์ด ๋งˆ๋น„๋˜๊ฑฐ๋‚˜ ์•ฝํ•ด์ง€๋ฉด์„œ ์ƒ๊ธฐ๋Š” ๋ง์šด๋™์žฅ์• ์ด๋‹ค. ์–ธ์–ด์žฌํ™œ์‚ฌ๋Š” ์•Œ๋งž์€ ์ค‘์žฌ๋ฐฉ์•ˆ์„ ๋ชจ์ƒ‰ํ•˜๊ธฐ ์œ„ํ•ด ๋งˆ๋น„๋ง์žฅ์• ์˜ ์ค‘์ฆ๋„๋ฅผ ํŒ๋‹จํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ผ๋ฐ˜์ ์œผ๋กœ ์žฅ์•  ์ค‘์ฆ๋„ ๋ถ„๋ฅ˜์— ์‚ฌ์šฉ๋˜๋Š” ์ฒญ์ง€๊ฐ์  ํ‰๊ฐ€๋Š” ๋งŽ์€ ์‹œ๊ฐ„๊ณผ ๋…ธ๋ ฅ์ด ์†Œ์š”๋  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํ‰๊ฐ€์ž ๊ฐ„ ๋ฐ ํ‰๊ฐ€์ž ๋‚ด ์‹ ๋ขฐ๋„๋ฅผ ํ™•๋ณดํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. ๋งˆ๋น„๋ง์žฅ์•  ์ค‘์ฆ๋„ ์ž๋™ ๋ถ„๋ฅ˜ ๊ธฐ์ˆ ์€ ์ด๋Ÿฌํ•œ ๋‹จ์ ๋“ค์„ ๋ณด์™„ํ•˜๋ฉฐ ์–ธ์–ด์žฌํ™œ์‚ฌ์˜ ์—…๋ฌด๋ฅผ ๋ณด์กฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ถ„๋ฅ˜์— ์†Œ์š”๋˜๋Š” ์‹œ๊ฐ„๊ณผ ๋…ธ๋ ฅ์„ ์ ˆ์•ฝํ•˜๊ณ , ๊ฐ๊ด€์ ์ด๊ณ  ์ผ๊ด€๋œ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์„ ํ–‰์—ฐ๊ตฌ์—์„œ๋Š” ๋งˆ๋น„๋ง์žฅ์•  ์ค‘์ฆ๋„ ์ž๋™ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋‹ค์–‘ํ•œ ํŠน์ง• ์…‹์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์Œ์„ฑ์˜ ์ „๋ฐ˜์  ํŠน์ง•์„ ๋ฐ˜์˜ํ•˜๋Š” ์ŠคํŽ™ํŠธ๋Ÿผ ํŠน์ง•๋งŒ์„ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜, ์Œ์„ฑ์  ํŠน์ง•์„ ์„ธ๋ถ„ํ™”ํ•˜์—ฌ ์Œ์งˆ ํŠน์ง•, ์šด์œจ ํŠน์ง• ๋“ฑ์œผ๋กœ ํŠน์ง• ์…‹์„ ๊ตฌ์„ฑํ•˜์˜€๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์„ ํ–‰ ์—ฐ๊ตฌ์˜ ํŠน์ง• ์…‹์€ ์Œ์†Œ ๋‹จ์œ„์˜ ํŠน์ง•์ธ ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์„ ๋ฐ˜์˜ํ•˜๊ณ  ์žˆ์ง€ ์•Š๋‹ค. ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์€ ์–ธ์–ด์น˜๋ฃŒ ๋ถ„์•ผ์—์„œ ๋งˆ๋น„๋ง์žฅ์• ์˜ ์ค‘์ฆ๋„๋ฅผ ๊ตฌ๋ถ„ํ•  ๋•Œ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ฐ˜๋ฉด, ์žฅ์•  ์ค‘์ฆ๋„ ์ž๋™ ๋ถ„๋ฅ˜ ์—ฐ๊ตฌ์—์„œ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ๊ฑฐ์˜ ์‚ฌ์šฉ๋˜์ง€ ์•Š์•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์„ ๋งˆ๋น„๋ง์žฅ์•  ์ค‘์ฆ๋„ ์ž๋™ ๋ถ„๋ฅ˜์— ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ์ œ์•ˆํ•œ๋‹ค. ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์€ ์Œ์†Œ์ •ํ™•๋„ ํŠน์ง•๊ณผ ๋ชจ์Œ์™œ๊ณก๋„ ํŠน์ง•์„ ํฌํ•จํ•˜๋Š” ๊ฐœ๋…์œผ๋กœ ์Œ์†Œ์ •ํ™•๋„ ํŠน์ง•์€ ์ž์Œ์ •ํ™•๋„, ๋ชจ์Œ์ •ํ™•๋„, ์Œ์†Œ์ •ํ™•๋„(์ž์Œ+๋ชจ์Œ), ๋ชจ์Œ์™œ๊ณก๋„ ํŠน์ง•์€ ๋ชจ์Œ์‚ฌ๊ฐ๋„ ๋ฉด์ , VAI, FCR, F2-Ratio๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์˜ ์œ ์šฉ์„ฑ์„ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด ์•ž์„œ ์–ธ๊ธ‰ํ•œ ์ŠคํŽ™ํŠธ๋Ÿผ ํŠน์ง•(MFCCs), ์Œ์งˆ ํŠน์ง•, ์šด์œจ ํŠน์ง•์„ ๋ฒ ์ด์Šค๋ผ์ธ ํŠน์ง•์œผ๋กœ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ถ”์ถœ๋œ ํŠน์ง• ์…‹์€ Recursive Feature Elimination(RFE)๊ณผ Extra Trees Classifier(ETC) ๋‘ ๊ฐœ์˜ ํŠน์ง• ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ์ตœ์ ํ™”๋˜์—ˆ๋‹ค. ์„ ํƒ๋œ ํŠน์ง•๋“ค์€ SVM(Support Vector Machine)๊ณผ MLP(Multiple Layer Perceptron) ๋ถ„๋ฅ˜๊ธฐ์˜ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ์‚ฌ์šฉ๋˜์—ˆ๊ณ , ๋ถ„๋ฅ˜๊ธฐ๋Š” ๊ฐ ์Œ์„ฑ์˜ ์žฅ์•  ์ค‘์ฆ๋„(๋น„์žฅ์• /๊ฒฝ๋„/๊ฒฝ๋„-์ค‘๋“ฑ๋„/์ค‘๋“ฑ๋„-์ค‘๋„/์ค‘๋„)๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋„๋ก ํ•™์Šต๋˜์—ˆ๋‹ค. ๋ถ„๋ฅ˜๊ธฐ์˜ ์„ฑ๋Šฅ์€ ์ •ํ™•๋„, ์ •๋ฐ€๋„, ์žฌํ˜„์œจ, F1-์ ์ˆ˜๋กœ ํ‰๊ฐ€๋˜์—ˆ๋‹ค. ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง• ์ถ”๊ฐ€ ์ „ํ›„์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณธ ๊ฒฐ๊ณผ, ํŠน์ง• ์„ ํƒ ์ „, RFE ์ ์šฉ ํ›„, ETC ์ ์šฉ ํ›„ SVM์˜ ๋ถ„๋ฅ˜ ์ •ํ™•๋„ ์ƒ๋Œ€์  ์ฆ๊ฐ€์œจ์€ ๊ฐ๊ฐ 15.83%, 25.42%, 23.39%์˜€๊ณ , MLP์˜ ๋ถ„๋ฅ˜ ์ •ํ™•๋„ ์ƒ๋Œ€์  ์ฆ๊ฐ€์œจ์€ ๊ฐ๊ฐ 28.97%, 21.19%, 22.95%๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๋ณด์ธ ์‹คํ—˜์€ ETC ํŠน์ง• ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜-SVM ์กฐํ•ฉ ์‹คํ—˜์œผ๋กœ, 77.5%์˜ ๋ถ„๋ฅ˜ ์ •ํ™•๋„๋ฅผ ๋ณด์˜€๋‹ค. ๋” ๋‚˜์•„๊ฐ€, ๊ฐ ์‹คํ—˜์—์„œ ํŠน์ง• ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์„ ํƒํ•œ ํŠน์ง•๊ณผ ํŠน์ง• ๋ณ„ ๊ฐœ๋ณ„ ๊ธฐ์—ฌ๋„๋ฅผ ์‚ดํŽด๋ณด์•˜๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์„ ์ถ”๊ฐ€ํ–ˆ์„ ๋•Œ ๋ฒ ์ด์Šค๋ผ์ธ์—์„œ ์„ ํƒ๋˜์—ˆ๋˜ ์Œ์งˆ, ์šด์œจ ํŠน์ง• ๋‹ค์ˆ˜๊ฐ€ ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์— ์˜ํ•ด ๋Œ€์ฒด๋˜์—ˆ์œผ๋ฉฐ, ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์€ ์Œ์งˆ, ์šด์œจ ํŠน์ง•๋ณด๋‹ค ๋” ๋†’์€ ๊ธฐ์—ฌ๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ€์กŒ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค. ์ฒซ์งธ, ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์€ ๋งˆ๋น„๋ง์žฅ์•  ์ค‘์ฆ๋„ ์ž๋™ ๋ถ„๋ฅ˜์— ๋„์›€์ด ๋œ๋‹ค. ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง• ์ถ”๊ฐ€ ์ „ํ›„ ๋ถ„๋ฅ˜ ์ •ํ™•๋„๋ฅผ ๋น„๊ตํ–ˆ์„ ๋•Œ, ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์„ ์ถ”๊ฐ€ํ–ˆ์„ ๋•Œ ๋” ๋†’์€ ๋ถ„๋ฅ˜ ์ •ํ™•๋„๋ฅผ ๋ณด์˜€๋‹ค. ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์€ ์–ธ์–ด์น˜๋ฃŒ ๋ถ„์•ผ์—์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜์–ด์™”์ง€๋งŒ, ์ž๋™ ๋ถ„๋ฅ˜ ์—ฐ๊ตฌ์—์„œ๋Š” ๋ช…์‹œ์ ์œผ๋กœ ์‚ฌ์šฉ๋œ ๊ฒฝ์šฐ๊ฐ€ ์ ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์ด ์ž๋™ ๋ถ„๋ฅ˜์—์„œ๋„ ์‚ฌ์šฉ๋˜์–ด์•ผ ํ•จ์„ ์‹œ์‚ฌํ•œ๋‹ค. ๋‘˜์งธ, ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์€ ๋งˆ๋น„๋ง์žฅ์•  ์ค‘์ฆ๋„ ์ž๋™ ๋ถ„๋ฅ˜์—์„œ ์Œ์งˆ, ์šด์œจ ํŠน์ง•๋ณด๋‹ค ๋” ํฐ ์˜ํ–ฅ๋ ฅ์„ ํ–‰์‚ฌํ•œ๋‹ค. ์‹คํ—˜ ๋ณ„ ์„ ํƒ๋œ ํŠน์ง•์„ ์‚ดํŽด๋ณธ ๊ฒฐ๊ณผ, ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์ด ์Œ์งˆ, ์šด์œจ ํŠน์ง•์— ๋น„ํ•ด ๋Œ€์ฒด๋˜์—ˆ๋‹ค. ํŠน์ง• ๋ณ„ ๊ฐœ๋ณ„ ๊ธฐ์—ฌ๋„๋ฅผ ์‚ดํŽด๋ณธ ๊ฒฐ๊ณผ, ๋ชจ๋“  ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์˜ ๊ฐœ๋ณ„ ๊ธฐ์—ฌ๋„๊ฐ€ ์Œ์งˆ, ์šด์œจ ํŠน์ง•์˜ ๊ฐœ๋ณ„ ๊ธฐ์—ฌ๋„๊ฐ€ ๋†’์•˜๋‹ค. ์ด๋Š” ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง•์ด ๋‹ค๋ฅธ ํŠน์ง•๋ณด๋‹ค ๋งˆ๋น„๋ง์žฅ์•  ์ค‘์ฆ๋„์™€ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ธ๋‹ค๋Š” ์„ ํ–‰์—ฐ๊ตฌ์™€ ์ผ๋งฅ์ƒํ†ตํ•œ ๊ฒฐ๊ณผ์ด๋‹ค.1. ์„œ๋ก  ............................................................................................... 1 2. ๊ด€๋ จ ์—ฐ๊ตฌ ....................................................................................... 4 2.1 ์žฅ์•  ์ค‘์ฆ๋„ ๋ถ„๋ฅ˜ ๊ธฐ์ค€ .......................................................................... 4 2.1.1 ๋ง๋ช…๋ฃŒ๋„ ........................................................................................ 4 2.1.2 ์ž์Œ์ •ํ™•๋„ ........................................................................................ 5 2.2 ๋งˆ๋น„๋ง์žฅ์•  ์Œ์„ฑ์˜ ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง• .............................................. 6 2.3 ๋งˆ๋น„๋ง์žฅ์•  ์Œ์„ฑ ์žฅ์•  ์ค‘์ฆ๋„ ์ž๋™ ๋ถ„๋ฅ˜ .......................................... 7 3. ์‹คํ—˜ ๋ฐฉ๋ฒ•๋ก  .................................................................................. 11 3.1 ์‹คํ—˜ ์„ค๊ณ„ ................................................................................................. 11 3.2 ํŠน์ง• ์ •์˜ ๋ฐ ํŠน์ง• ์ถ”์ถœ ๋ฐฉ๋ฒ• ............................................................ 11 3.2.1 Mel Frequency Cepstral Coefficients (MFCCs) .............. 12 3.2.2 ์Œ์งˆ ํŠน์ง• ..................................................................................... 14 3.2.3 ์šด์œจ ํŠน์ง• ..................................................................................... 15 3.2.3.1 ๋ฐœํ™” ์†๋„ ............................................................................ 15 3.3.3.2 ์Œ๋†’์ด ................................................................................. 15 3.3.3.3 ๋ฆฌ๋“ฌ ..................................................................................... 16 3.2.4 ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง• ....................................................................... 17 3.4.1 ์Œ์†Œ์ •ํ™•๋„ ............................................................................ 17 3.4.2 ๋ชจ์Œ์™œ๊ณก๋„ ............................................................................ 18 3.3 ํŠน์ง• ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ............................................................................... 19 3.2.3.1 Recursive Feature Elimination(RFE) ........................ 20 3.3.3.2 Extra Trees Classifier(ETC) ...................................... 20 3.4 ๋จธ์‹ ๋Ÿฌ๋‹ ๋ถ„๋ฅ˜๊ธฐ ..................................................................................... 21 3.4.1 Support Vector Machine(SVM) ............................................ 21 4.3.2 Multiple Layer Perceptron(MLP) ......................................... 22 4. ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ................................................................................ 24 4.1 QoLT(Quality of Life Technology) ์ฝ”ํผ์Šค ................................... 24 4.2 ์ž์Œ์ •ํ™•๋„ ํ‰๊ฐ€ ..................................................................................... 24 4.3 ๋ง๋ช…๋ฃŒ๋„ ํ‰๊ฐ€ ......................................................................................... 25 4.4 ์ž์Œ์ •ํ™•๋„ ํ‰๊ฐ€ ๊ฒฐ๊ณผ์™€ ๋ง๋ช…๋ฃŒ๋„ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ๋น„๊ต ..................... 26 4.5 ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ํ†ต๊ณ„ ๋ถ„์„ ...................................................................... 27 4.5.1 ์Œ์งˆ ํŠน์ง• ..................................................................................... 27 4.5.2 ์šด์œจ ํŠน์ง• ..................................................................................... 28 4.5.2.1 ๋ฐœํ™” ์†๋„ ............................................................................ 28 4.5.2.2 ์Œ๋†’์ด .................................................................................. 29 4.5.2.3 ๋ฆฌ๋“ฌ ...................................................................................... 30 4.5.3 ๋ฐœ์Œ ์ •ํ™•๋„ ํŠน์ง• ..................................................................... 31 4.5.3.1 ์Œ์†Œ์ •ํ™•๋„ .......................................................................... 31 4.5.3.2 ๋ชจ์Œ์™œ๊ณก๋„ ........................................................................ 32 5. ์‹คํ—˜ .............................................................................................. 34 5.1 ์‹คํ—˜ ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ ................................................................................... 34 5.2 ๋ถ„๋ฅ˜๊ธฐ ์„ฑ๋Šฅ ์ฒ™๋„ ................................................................................... 34 5.3 ์‹คํ—˜ ๊ฒฐ๊ณผ ............................................................................................... 35 5.3.1 ํŠน์ง• ์„ ํƒ ์ ์šฉ ์ „ ....................................................................... 35 5.3.2 RFE ์ ์šฉ ํ›„ ................................................................................ 35 5.3.3 ETC ์ ์šฉ ํ›„ ............................................................................... 37 5.4 ์‹คํ—˜ ๊ฒฐ๊ณผ ์ •๋ฆฌ ....................................................................................... 39 5.4.1 ๋ถ„๋ฅ˜ ์ •ํ™•๋„ ๋น„๊ต ............................................................................ 39 5.4.2 ํŠน์ง• ์„ ํƒ .......................................................................................... 41 5.5 ํ† ์˜............................................................................................................. 43 6. ๊ฒฐ๋ก  .............................................................................................. 45 ์ฐธ ๊ณ  ๋ฌธ ํ—Œ ....................................................................................... 47 Abstract ........................................................................................... 52Maste

    Evaluation of STT technologies performance and database design for Spanish dysarthric speech

    Get PDF
    [EN] Automatic Speech Recognition (ASR) systems have become an everyday use tool worldwide. Their use has spread throughout these last years and they have also been implemented in Environmental Control Systems (ECS) or Speech Generating Devices (SGD), among others. These systems might be especially beneficial for people with physical disabilities, as they would be able to control different devices with voice commands, therefore reducing the physical effort they have to make. However, people with functional diversity usually present difficulties in speech articulation too. One of the most common speech articulation problems is dysarthria, a disorder in the nervous system which causes weakness in muscles used for speech. Existing commercial ASR systems are not able to correctly understand dysarthric speech, so people with this condition cannot exploit this technology. Some investigation tackling this issue has been conducted, but an optimal solution has not been reached yet. On the other hand, nearly all existing investigation on the matter is in English, no previous study has approached the problem in other languages. Apart form this, ASR systems require of large speech databases, which are currently very few, most of them in English and they have not been designed for this end. Some commercial ASR systems offer a customization interface where users can train a base model with their speech data and thus improve the recognition accuracy. In this thesis, we evaluated the performance of the commercial ASR system Microsoft Azure Speech to Text. First, we reviewed the current state of the art. Then, we created a pilot database in Spanish and recorded it with 3 heterogeneous people with dysarthria and 1 typical speaker to be used as reference. Lastly, we trained the system and conducted different experiments to measure its accuracy. Results show that, overall, the customized models outperform the base models of the system. However, the results were not homogeneous, but vary depending on the speaker. Even though the recognition accuracy improved considerably, the results were far from being as good as those obtained for typical speech

    Optimizing Vocabulary Modeling for Dysarthric Voice User Interface

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ํ˜‘๋™๊ณผ์ • ์ธ์ง€๊ณผํ•™์ „๊ณต, 2016. 2. ์ •๋ฏผํ™”.๋งˆ๋น„๋ง์žฅ์•  ํ™”์ž์˜ ์Œ์„ฑ ์ธํ„ฐํŽ˜์ด์Šค ํ™œ์šฉ์—์„œ ๋นˆ๋ฒˆํ•œ ์กฐ์Œ์˜ค๋ฅ˜, ๋ˆŒ๋ณ€, ๋นˆ๋ฒˆํ•˜๊ณ  ๋ถˆ๊ทœ์น™์ ์œผ๋กœ ๋ฐœ์ƒํ•˜๋Š” ๋ฐœํ™”์ค‘๋‹จ, ๋Š๋ฆฐ ๋ฐœํ™”์†๋„ ๋“ฑ์€ ์˜ค์ธ์‹์„ ์œ ๋ฐœํ•˜๋Š” ์š”์ธ์ด ๋œ๋‹ค. ์„ ํ–‰์—ฐ๊ตฌ์—์„œ๋Š” ์žฅ์• ๋ฐœํ™”์˜ ์Œํ–ฅ ๋ฐ ์Œ์šด์  ํŠน์„ฑ์„ ๋ถ„์„ํ•˜๊ณ  ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์žฅ์• ๋ฐœํ™” ์ˆ˜์ •, ์Œํ–ฅ๋ชจ๋ธ ์ ์‘, ๋ฐœ์Œ๋ณ€์ด ๋ชจ๋ธ๋ง, ๋ฌธ๋ฒ• ๋ฐ ์–ดํœ˜ ๋ชจ๋ธ๋ง ๋“ฑ์„ ํ†ตํ•ด ์ธ์‹์˜ค๋ฅ˜์˜ ๋ฌธ์ œ๋ฅผ ๋ณด์™„ํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์žฅ์• ๋ฐœํ™”์˜ ํŠน์ง•์„ ๋ฐ˜์˜ํ•˜์—ฌ ์Œํ–ฅ๋ชจ๋ธ์„ ์ตœ์ ํ™”ํ–ˆ๋‹ค. ๋˜ํ•œ ์–ดํœ˜๋ชจ๋ธ์˜ ๊ตฌ์„ฑ์—์„œ ์Œ์†Œ๋ฒ”์ฃผ ๊ธฐ๋ฐ˜์˜ ์กฐ์ŒํŠน์ง•๊ณผ ์ธ์‹์˜ค๋ฅ˜์™€์˜ ๊ด€๊ณ„๋ฅผ ๋ชจํ˜•ํ™”ํ•˜์—ฌ ๋‹จ์–ด์˜ ์„ ํƒ๊ธฐ์ค€์œผ๋กœ ๋„์ž…ํ–ˆ๊ณ  ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ค„์ž„์œผ๋กœ์จ ์žฅ์• ํ™”์ž์˜ ์Œ์„ฑ ์ธํ„ฐํŽ˜์ด์Šค ์‹œ ๋ฐœ์ƒํ•˜๋Š” ์ธ์‹์˜ค๋ฅ˜๋ฅผ ์ค„์˜€๋‹ค. ๋งˆ๋น„๋ง์žฅ์•  ํ™”์ž๋ฅผ ์œ„ํ•œ ์Œํ–ฅ๋ชจ๋ธ์˜ ๊ตฌ์ถ•์„ ์œ„ํ•ด ์ฒซ์งธ๋กœ ์žฅ์• ํ™”์ž์˜ ๋Š๋ฆฐ ๋ฐœํ™”์†๋„์— ๋”ฐ๋ผ ํŠน์ง•์ถ”์ถœ์˜ ์œˆ๋„์šฐ ํฌ๊ธฐ์™€ HMM๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” state ๊ฐœ์ˆ˜๋ฅผ ์กฐ์ •ํ•˜์—ฌ ์˜ค๋ฅ˜๋ฅผ ๋‚ฎ์ท„๋‹ค. ๋‘˜์งธ๋กœ HMM์˜ ์ถœ๋ ฅํ™•๋ฅ  ๋ชจ๋ธ๋กœ์„œ GMM, Subspace GMM, DNN ๋“ฑ์„ ๋„์ž…ํ•˜์—ฌ ์ธ์‹์˜ค๋ฅ˜๋ฅผ ๋น„๊ตํ–ˆ๋‹ค. ์…‹์งธ๋กœ ํ•™์Šต๋ฐ์ดํ„ฐ ๋ถ€์กฑ๋ฌธ์ œ์— ๋Œ€ํ•œ ๋Œ€์‘๋ฐฉ๋ฒ•์œผ๋กœ ์ •์ƒ๋ฐœํ™” ๋„์ž…์˜ ํšจ์œจ์„ฑ์„ ์ธ์‹์‹คํ—˜์œผ๋กœ ํ™•์ธํ–ˆ๋‹ค. ์กฐ์ŒํŠน์ง•๊ณผ ์ธ์‹์˜ค๋ฅ˜์œจ์˜ ํ˜ผํ•ฉ์„ ํ˜•๋ชจ๋ธ ๋ถ„์„์—์„œ ์ž์Œ๋ฒ”์ฃผ ์ค‘ ์˜ค๋ฅ˜์œจ๊ณผ ์œ ์˜์ˆ˜์ค€ 0.05 ์ดํ•˜์—์„œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ธ ๋ฒ”์ฃผ๋Š” ๋งˆ์ฐฐ์Œ๊ณผ ๋น„์Œ์ด๊ณ  ๋ชจ๋“  ๋ชจ์Œ๋ฒ”์ฃผ๋Š” ์˜ค๋ฅ˜์œจ๊ณผ ์œ ์˜์ˆ˜์ค€ 0.05 ์ดํ•˜์—์„œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ํ˜ผํ•ฉ๋ชจ๋ธ์€ ์ž์Œ์„ ์กฐ์Œ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฒ”์ฃผํ™”ํ•  ๋•Œ๊ฐ€ ์กฐ์Œ์œ„์น˜๋กœ ๋ฒ”์ฃผํ™”ํ•  ๋•Œ์— ๋น„ํ•ด ๋‚ฎ์€ AIC๋ฅผ ๋ณด์˜€๊ณ  ๋ชจ์Œ์„ ํ˜€์˜ ์œ„์น˜๋กœ ๋ฒ”์ฃผํ™”ํ•  ๋•Œ๊ฐ€ ๋‚ฎ์€ AIC๋ฅผ ๋ณด์˜€๋‹ค. ์Œ์†Œ๋ฅผ ์กฐ์Œ๋ฐฉ๋ฒ•๊ณผ ํ˜€์˜ ์œ„์น˜๋กœ ๋ฒ”์ฃผํ™”ํ–ˆ์„ ๋•Œ ๋งˆ์ฐฐ์Œ์€ ๋‹จ์–ด์— ํฌํ•จ๋ ์ˆ˜๋ก ์ธ์‹์˜ค๋ฅ˜๋ฅผ ๋†’์ด๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๊ณ  ๋น„์Œ์€ ์ธ์‹์ •ํ™•๋„๋ฅผ ๋†’์˜€๋‹ค. ํ์‡„์Œ, ํŒŒ์ฐฐ์Œ, ์œ ์Œ ๋“ฑ์€ ์ธ์‹๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ์˜ํ–ฅ์ด 0์— ๊ฐ€๊นŒ์› ๋‹ค. ๋ชจ๋“  ๋ชจ์Œ ๋ฒ”์ฃผ๋Š” ์ธ์‹์ •ํ™•๋„๋ฅผ ๋†’์˜€๋‹ค. ๊ทธ ์ค‘ ์ค‘์„ค๋ชจ์Œ์˜ ์˜ํ–ฅ๋ ฅ์ด ๊ฐ€์žฅ ์ปธ๊ณ  ํ›„์„ค๋ชจ์Œ, ์ „์„ค๋ชจ์Œ, ์ด์ค‘๋ชจ์Œ ์ˆœ์œผ๋กœ ์ž‘์•„์กŒ๋‹ค. ๋‹จ์–ด ๊ฐ„ ์ธ์‹์˜ค๋ฅ˜์˜ ์œ ๋ฐœ ๊ฐ€๋Šฅ์„ฑ์„ Levenshtein ๊ฑฐ๋ฆฌ์™€ Cosine ๊ฑฐ๋ฆฌ ๋“ฑ ์—ด ๊ฑฐ๋ฆฌ ๊ธฐ์ค€์˜ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋กœ ๋ชจ๋ธ๋งํ–ˆ๊ณ  ์œ ์‚ฌ๋„์˜ ์ตœ์†Œ-์ตœ๋Œ€ ๋น„๊ต์™€ N-best ์ถ”์ •์œผ๋กœ ์ •์˜๋œ ๋‹จ์–ด๊ฐ„ ์œ ์‚ฌ๋„๊ฐ€ ์Œ์„ฑ์ธ์‹์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์š”์ธ์ด ๋  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ–ˆ๋‹ค. ์žฅ์• ๋ฐœํ™” ์ธ์‹์„ ์œ„ํ•œ ์–ดํœ˜๋ชจ๋ธ์˜ ๊ตฌ์„ฑ์—์„œ ๋จผ์ € ์กฐ์ŒํŠน์ง• ๊ธฐ๋ฐ˜์˜ ํ˜ผํ•ฉ๋ชจ๋ธ๋กœ ๋‹จ์–ด์˜ ์กฐ์Œ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์ ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋‹จ์–ด๋กœ ์ธ์‹๋‹จ์–ด ๋ฆฌ์ŠคํŠธ๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. ๋˜ํ•œ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ตœ์†Œํ™”ํ•˜๋„๋ก ์–ดํœ˜๋ชจ๋ธ์„ ์ˆ˜์ •ํ•˜์—ฌ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ ๊ธฐ์กด ํ†ตํ™”ํ‘œ ๋ฐฉ์‹์— ๋น„ํ•ด ์ ˆ๋Œ€์ ์œผ๋กœ 5.7%, ์ƒ๋Œ€์ ์œผ๋กœ 34.6%์˜ ์ธ์‹์˜ค๋ฅ˜๊ฐ€ ๊ฐ์†Œํ•˜์˜€๋‹ค.Articulation errors, disfluency, impulsive pause, low speaking rate have been suggested as factors of recognition error for dysarthric speakers using voice user interface. In related works, methods for correcting dysarthric speech, AM adaptation, pronunciation variation modeling, grammar modeling and vocabulary modeling based on acoustic and phonetic analyses on dysarthric speech were proposed to compensate those factors. In this paper, acoustic model was optimized. And words in the vocabulary were selected by the GLMM which had modeled relationship between recognition errors and articulatory features for phonetic class and optimized by lowering similarity between words. Problems in training AM for dysarthric speech recognition were addressed: firstly low speaking rate was compensated by varying the window length of FFT and the number of states of HMM. Secondly the efficiency of models for emission probability of HMM was compared. Thirdly AM trained using large amount of non-dysarthric speech was experimented. Fricative and nasal were statistically significant in the analysis of relation between recognition error and consonant classes. And all vowel classes were significant. AIC was lower by classifying consonants based on manner of articulation than based on place and by classifying vowels based on position of tongue than based on height. Fricatives increased WER and nasals increased accuracy of recognition. Estimates of plosive, affricate, liquid were close to zero. All vowel classes increased accuracy. Estimate of central vowel was the largest followed by back vowel, front vowel and diphthong. Triggering recognition error by competitive words was modeled by similarity between words based on Levenshtein and cosine distance respectively. Effect of similarity between words on recognition result was confirmed by the minimum-maximum similarity contrast and the N-best prediction. Prior to model vocabulary, articulation score for each word was calculated. Firstly the vocabulary was composed of the words with the maximum articulation scores. Secondly the words in the vocabulary with high similarity were replaced by the word with less similarity and large articulation score. In dysarthric speech recognitions, the optimized vocabulary lowered WER by 5.72% (34.60% ERR).์ œ 1 ์žฅ ์„œ๋ก  1 ์ œ 2 ์žฅ ๊ด€๋ จ์—ฐ๊ตฌ 5 ์ œ 1 ์ ˆ ๋งˆ๋น„๋ง์žฅ์•  ๋ฐœํ™” ์กฐ์Œ์˜ค๋ฅ˜ ๋ถ„์„ 5 ์ œ 2 ์ ˆ ์Œ์„ฑ์ธ์‹ ์‹œ์Šคํ…œ์˜ ๊ตฌ์กฐ์™€ ๋ชจ๋ธ 9 2.1 ํŠน์ง•์ถ”์ถœ 12 2.2 ์Œํ–ฅ๋ชจ๋ธ 12 2.3 ๋ฐœ์Œ๋ชจ๋ธ 16 2.4 ์–ธ์–ด๋ชจ๋ธ 18 ์ œ 3 ์ ˆ ์Œ์„ฑ์ธ์‹์‹œ์Šคํ…œ์˜ ํ™œ์šฉ๋ถ„์•ผ 18 ์ œ 4 ์ ˆ ๋งˆ๋น„๋ง์žฅ์•  ๋ฐœํ™” ์ธ์‹ 22 ์ œ 3 ์žฅ ํŠน์ง•์ถ”์ถœ ๋ฐ ์Œํ–ฅ๋ชจ๋ธ ๋ฒ ์ด์Šค๋ผ์ธ ๊ตฌ์ถ• 27 ์ œ 1 ์ ˆ ์ ‘๊ทผ๋ฐฉ๋ฒ• 28 ์ œ 2 ์ ˆ ๊ฐœ๋ฐœ ํ™˜๊ฒฝ 29 2.1 ์Œ์„ฑ ๋ฐ์ดํ„ฐ 29 2.2 ์Œ์„ฑ์ธ์‹ ํ™˜๊ฒฝ ๋ฐ ์Œํ–ฅ๋ชจ๋ธ ํ•™์Šต์ ˆ์ฐจ 30 ์ œ 3 ์ ˆ ์ธ์‹์‹คํ—˜ 32 3.1 ๋ฐœํ™”์†๋„ ๋ชจ๋ธ๋ง 32 3.2 ์ถœ๋ ฅํ™•๋ฅ  ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™” 38 3.3 ์Œํ–ฅ๋ชจ๋ธ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์„ฑ 44 3.4 ์‹คํ—˜๊ฒฐ๊ณผ ๋ถ„์„ ๋ฐ ์š”์•ฝ 46 ์ œ 4 ์ ˆ ๊ฒฐ๋ก  48 ์ œ 4 ์žฅ ์กฐ์Œ์˜ค๋ฅ˜ ํŠน์„ฑ ๊ธฐ๋ฐ˜ ์ธ์‹๋‹จ์–ด ์„ ํƒ๊ธฐ์ค€ 49 ์ œ 1 ์ ˆ ํ•œ๊ตญ์–ด ์Œ์†Œ ์ •์˜์™€ ์กฐ์Œ ํŠน์ง•์— ์˜ํ•œ ๋ฒ”์ฃผํ™” 50 ์ œ 2 ์ ˆ ์—ฐ๊ตฌ ๋ชฉํ‘œ 53 ์ œ 3 ์ ˆ ๋ถ„์„ ๋ฐ์ดํ„ฐ 53 3.1 ์Œ์„ฑ ๋ฐ์ดํ„ฐ 53 3.2 ์Œ์†Œ์˜ ๋ฒ”์ฃผ ๋ฐ ํ†ต๊ณ„ 54 3.3 ์Œ์„ฑ์ธ์‹ 56 3.4 ๋ฐ์ดํ„ฐ ๋ถ„์„ 57 ์ œ 4 ์ ˆ ๋ถ„์„ ๊ฒฐ๊ณผ 59 ์ œ 5 ์ ˆ ๊ฒฐ๋ก  62 ์ œ 5 ์žฅ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ ์ตœ์†Œํ™” ๊ธฐ์ค€ ์ธ์‹๋‹จ์–ด ์ตœ์ ํ™” 63 ์ œ 1 ์ ˆ ์—ด ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜์˜ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ 65 ์ œ 2 ์ ˆ ์ธ์‹๋ฅ ์— ๋Œ€ํ•œ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„์˜ ์˜ํ–ฅ 67 ์ œ 3 ์ ˆ N-best ์ถ”์ • 68 ์ œ 6 ์žฅ ์ธ์‹์‹คํ—˜ 74 ์ œ 1 ์ ˆ ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ ๊ตฌ์„ฑ 74 1.1 ๋ฒ ์ด์Šค๋ผ์ธ ์ธ์‹ ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ 76 1.2 ์กฐ์Œ์ ์ˆ˜ ์ตœ๋Œ€ํ™” ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ 77 1.3 ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ ์ตœ์†Œํ™” ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ 77 ์ œ 2 ์ ˆ ๊ธฐ์ดˆ ์‹คํ—˜ 78 ์ œ 3 ์ ˆ ์‹คํ—˜ ํ™˜๊ฒฝ 81 3.1 ์Œ์„ฑ ์ฝ”ํผ์Šค ๊ตฌ์„ฑ 81 ์ œ 4 ์ ˆ ์ธ์‹ ๊ฒฐ๊ณผ 84 4.1 ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ 84 4.2 ์กฐ์Œ์ ์ˆ˜ ์ตœ๋Œ€ํ™” ๋ชจ๋ธ 84 4.3 ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ ์ตœ์†Œํ™” ๋ชจ๋ธ 86 ์ œ 7 ์žฅ ๊ฒฐ๋ก  88 ์ œ 1 ์ ˆ ์—ฐ๊ตฌ๊ฒฐ๊ณผ ์š”์•ฝ ๋ฐ ํ‰๊ฐ€ 88 ์ œ 2 ์ ˆ ๊ธฐ์—ฌ๋„ ์š”์•ฝ 90 ์ œ 3 ์ ˆ ํ–ฅํ›„ ์—ฐ๊ตฌ 91 ์ฐธ๊ณ  ๋ฌธํ—Œ 94 ๋ถ€๋ก 101 Abstract 114Docto

    Subspace Gaussian Mixture Models for Language Identification and Dysarthric Speech Intelligibility Assessment

    Get PDF
    En esta Tesis se ha investigado la aplicaciรณn de tรฉcnicas de modelado de subespacios de mezclas de Gaussianas en dos problemas relacionados con las tecnologรญas del habla, como son la identificaciรณn automรกtica de idioma (LID, por sus siglas en inglรฉs) y la evaluaciรณn automรกtica de inteligibilidad en el habla de personas con disartria. Una de las tรฉcnicas mรกs importantes estudiadas es el anรกlisis factorial conjunto (JFA, por sus siglas en inglรฉs). JFA es, en esencia, un modelo de mezclas de Gaussianas en el que la media de cada componente se expresa como una suma de factores de dimensiรณn reducida, y donde cada factor representa una contribuciรณn diferente a la seรฑal de audio. Esta factorizaciรณn nos permite compensar nuestros modelos frente a contribuciones indeseadas presentes en la seรฑal, como la informaciรณn de canal. JFA se ha investigado como clasficador y como extractor de parรกmetros. En esta รบltima aproximaciรณn se modela un solo factor que representa todas las contribuciones presentes en la seรฑal. Los puntos en este subespacio se denominan i-Vectors. Asรญ, un i-Vector es un vector de baja dimensiรณn que representa una grabaciรณn de audio. Los i-Vectors han resultado ser muy รบtiles como vector de caracterรญsticas para representar seรฑales en diferentes problemas relacionados con el aprendizaje de mรกquinas. En relaciรณn al problema de LID, se han investigado dos sistemas diferentes de acuerdo al tipo de informaciรณn extraรญda de la seรฑal. En el primero, la seรฑal se parametriza en vectores acรบsticos con informaciรณn espectral a corto plazo. En este caso, observamos mejoras de hasta un 50% con el sistema basado en i-Vectors respecto al sistema que utilizaba JFA como clasificador. Se comprobรณ que el subespacio de canal del modelo JFA tambiรฉn contenรญa informaciรณn del idioma, mientras que con los i-Vectors no se descarta ningรบn tipo de informaciรณn, y ademรกs, son รบtiles para mitigar diferencias entre los datos de entrenamiento y de evaluaciรณn. En la fase de clasificaciรณn, los i-Vectors de cada idioma se modelaron con una distribuciรณn Gaussiana en la que la matriz de covarianza era comรบn para todos. Este mรฉtodo es simple y rรกpido, y no requiere de ningรบn post-procesado de los i-Vectors. En el segundo sistema, se introdujo el uso de informaciรณn prosรณdica y formรกntica en un sistema de LID basado en i-Vectors. La precisiรณn de รฉste estaba por debajo de la del sistema acรบstico. Sin embargo, los dos sistemas son complementarios, y se obtuvo hasta un 20% de mejora con la fusiรณn de los dos respecto al sistema acรบstico solo. Tras los buenos resultados obtenidos para LID, y dado que, teรณricamente, los i-Vectors capturan toda la informaciรณn presente en la seรฑal, decidimos usarlos para la evaluar de manera automรกtica la inteligibilidad en el habla de personas con disartria. Los logopedas estรกn muy interesados en esta tecnologรญa porque permitirรญa evaluar a sus pacientes de una manera objetiva y consistente. En este caso, los i-Vectors se obtuvieron a partir de informaciรณn espectral a corto plazo de la seรฑal, y la inteligibilidad se calculรณ a partir de los i-Vectors obtenidos para un conjunto de palabras dichas por el locutor evaluado. Comprobamos que los resultados eran mucho mejores si en el entrenamiento del sistema se incorporaban datos de la persona que iba a ser evaluada. No obstante, esta limitaciรณn podrรญa aliviarse utilizando una mayor cantidad de datos para entrenar el sistema.In this Thesis, we investigated how to effciently apply subspace Gaussian mixture modeling techniques onto two speech technology problems, namely automatic spoken language identification (LID) and automatic intelligibility assessment of dysarthric speech. One of the most important of such techniques in this Thesis was joint factor analysis (JFA). JFA is essentially a Gaussian mixture model where the mean of the components is expressed as a sum of low-dimension factors that represent different contributions to the speech signal. This factorization makes it possible to compensate for undesired sources of variability, like the channel. JFA was investigated as final classiffer and as feature extractor. In the latter approach, a single subspace including all sources of variability is trained, and points in this subspace are known as i-Vectors. Thus, one i-Vector is defined as a low-dimension representation of a single utterance, and they are a very powerful feature for different machine learning problems. We have investigated two different LID systems according to the type of features extracted from speech. First, we extracted acoustic features representing short-time spectral information. In this case, we observed relative improvements with i-Vectors with respect to JFA of up to 50%. We realized that the channel subspace in a JFA model also contains language information whereas i-Vectors do not discard any language information, and moreover, they help to reduce mismatches between training and testing data. For classification, we modeled the i-Vectors of each language with a Gaussian distribution with covariance matrix shared among languages. This method is simple and fast, and it worked well without any post-processing. Second, we introduced the use of prosodic and formant information with the i-Vectors system. The performance was below the acoustic system but both were found to be complementary and we obtained up to a 20% relative improvement with the fusion with respect to the acoustic system alone. Given the success in LID and the fact that i-Vectors capture all the information that is present in the data, we decided to use i-Vectors for other tasks, specifically, the assessment of speech intelligibility in speakers with different types of dysarthria. Speech therapists are very interested in this technology because it would allow them to objectively and consistently rate the intelligibility of their patients. In this case, the input features were extracted from short-term spectral information, and the intelligibility was assessed from the i-Vectors calculated from a set of words uttered by the tested speaker. We found that the performance was clearly much better if we had available data for training of the person that would use the application. We think that this limitation could be relaxed if we had larger databases for training. However, the recording process is not easy for people with disabilities, and it is difficult to obtain large datasets of dysarthric speakers open to the research community. Finally, the same system architecture for intelligibility assessment based on i-Vectors was used for predicting the accuracy that an automatic speech recognizer (ASR) system would obtain with dysarthric speakers. The only difference between both was the ground truth label set used for training. Predicting the performance response of an ASR system would increase the confidence of speech therapists in these systems and would diminish health related costs. The results were not as satisfactory as in the previous case, probably because an ASR is a complex system whose accuracy can be very difficult to be predicted only with acoustic information. Nonetheless, we think that we opened a door to an interesting research direction for the two problems
    corecore