2,038 research outputs found

    Articulatory and bottleneck features for speaker-independent ASR of dysarthric speech

    Full text link
    The rapid population aging has stimulated the development of assistive devices that provide personalized medical support to the needies suffering from various etiologies. One prominent clinical application is a computer-assisted speech training system which enables personalized speech therapy to patients impaired by communicative disorders in the patient's home environment. Such a system relies on the robust automatic speech recognition (ASR) technology to be able to provide accurate articulation feedback. With the long-term aim of developing off-the-shelf ASR systems that can be incorporated in clinical context without prior speaker information, we compare the ASR performance of speaker-independent bottleneck and articulatory features on dysarthric speech used in conjunction with dedicated neural network-based acoustic models that have been shown to be robust against spectrotemporal deviations. We report ASR performance of these systems on two dysarthric speech datasets of different characteristics to quantify the achieved performance gains. Despite the remaining performance gap between the dysarthric and normal speech, significant improvements have been reported on both datasets using speaker-independent ASR architectures.Comment: to appear in Computer Speech & Language - https://doi.org/10.1016/j.csl.2019.05.002 - arXiv admin note: substantial text overlap with arXiv:1807.1094

    A Few-Shot Approach to Dysarthric Speech Intelligibility Level Classification Using Transformers

    Full text link
    Dysarthria is a speech disorder that hinders communication due to difficulties in articulating words. Detection of dysarthria is important for several reasons as it can be used to develop a treatment plan and help improve a person's quality of life and ability to communicate effectively. Much of the literature focused on improving ASR systems for dysarthric speech. The objective of the current work is to develop models that can accurately classify the presence of dysarthria and also give information about the intelligibility level using limited data by employing a few-shot approach using a transformer model. This work also aims to tackle the data leakage that is present in previous studies. Our whisper-large-v2 transformer model trained on a subset of the UASpeech dataset containing medium intelligibility level patients achieved an accuracy of 85%, precision of 0.92, recall of 0.8 F1-score of 0.85, and specificity of 0.91. Experimental results also demonstrate that the model trained using the 'words' dataset performed better compared to the model trained on the 'letters' and 'digits' dataset. Moreover, the multiclass model achieved an accuracy of 67%.Comment: Paper has been presented at ICCCNT 2023 and the final version will be published in IEEE Digital Library Xplor

    Wav2vec-based Detection and Severity Level Classification of Dysarthria from Speech

    Full text link
    Automatic detection and severity level classification of dysarthria directly from acoustic speech signals can be used as a tool in medical diagnosis. In this work, the pre-trained wav2vec 2.0 model is studied as a feature extractor to build detection and severity level classification systems for dysarthric speech. The experiments were carried out with the popularly used UA-speech database. In the detection experiments, the results revealed that the best performance was obtained using the embeddings from the first layer of the wav2vec model that yielded an absolute improvement of 1.23% in accuracy compared to the best performing baseline feature (spectrogram). In the studied severity level classification task, the results revealed that the embeddings from the final layer gave an absolute improvement of 10.62% in accuracy compared to the best baseline features (mel-frequency cepstral coefficients)

    ์šด์œจ ์ •๋ณด๋ฅผ ์ด์šฉํ•œ ๋งˆ๋น„๋ง์žฅ์•  ์Œ์„ฑ ์ž๋™ ๊ฒ€์ถœ ๋ฐ ํ‰๊ฐ€

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ธ๋ฌธ๋Œ€ํ•™ ์–ธ์–ดํ•™๊ณผ, 2020. 8. Minhwa Chung.๋ง์žฅ์• ๋Š” ์‹ ๊ฒฝ๊ณ„ ๋˜๋Š” ํ‡ดํ–‰์„ฑ ์งˆํ™˜์—์„œ ๊ฐ€์žฅ ๋นจ๋ฆฌ ๋‚˜ํƒ€๋‚˜๋Š” ์ฆ ์ƒ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๋งˆ๋น„๋ง์žฅ์• ๋Š” ํŒŒํ‚จ์Šจ๋ณ‘, ๋‡Œ์„ฑ ๋งˆ๋น„, ๊ทผ์œ„์ถ•์„ฑ ์ธก์‚ญ ๊ฒฝํ™”์ฆ, ๋‹ค๋ฐœ์„ฑ ๊ฒฝํ™”์ฆ ํ™˜์ž ๋“ฑ ๋‹ค์–‘ํ•œ ํ™˜์ž๊ตฐ์—์„œ ๋‚˜ํƒ€๋‚œ๋‹ค. ๋งˆ๋น„๋ง์žฅ์• ๋Š” ์กฐ์Œ๊ธฐ๊ด€ ์‹ ๊ฒฝ์˜ ์†์ƒ์œผ๋กœ ๋ถ€์ •ํ™•ํ•œ ์กฐ์Œ์„ ์ฃผ์š” ํŠน์ง•์œผ๋กœ ๊ฐ€์ง€๊ณ , ์šด์œจ์—๋„ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด๊ณ ๋œ๋‹ค. ์„ ํ–‰ ์—ฐ๊ตฌ์—์„œ๋Š” ์šด์œจ ๊ธฐ๋ฐ˜ ์ธก์ •์น˜๋ฅผ ๋น„์žฅ์•  ๋ฐœํ™”์™€ ๋งˆ๋น„๋ง์žฅ์•  ๋ฐœํ™”๋ฅผ ๊ตฌ๋ณ„ํ•˜๋Š” ๊ฒƒ์— ์‚ฌ์šฉํ–ˆ๋‹ค. ์ž„์ƒ ํ˜„์žฅ์—์„œ๋Š” ๋งˆ๋น„๋ง์žฅ์• ์— ๋Œ€ํ•œ ์šด์œจ ๊ธฐ๋ฐ˜ ๋ถ„์„์ด ๋งˆ๋น„๋ง์žฅ์• ๋ฅผ ์ง„๋‹จํ•˜๊ฑฐ๋‚˜ ์žฅ์•  ์–‘์ƒ์— ๋”ฐ๋ฅธ ์•Œ๋งž์€ ์น˜๋ฃŒ๋ฒ•์„ ์ค€๋น„ํ•˜๋Š” ๊ฒƒ์— ๋„์›€์ด ๋  ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ ๋งˆ๋น„๋ง์žฅ์• ๊ฐ€ ์šด์œจ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์–‘์ƒ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋งˆ๋น„๋ง์žฅ์• ์˜ ์šด์œจ ํŠน์ง•์„ ๊ธด๋ฐ€ํ•˜๊ฒŒ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ์ด ํ•„์š”ํ•˜๋‹ค. ๊ตฌ์ฒด ์ ์œผ๋กœ, ์šด์œจ์ด ์–ด๋–ค ์ธก๋ฉด์—์„œ ๋งˆ๋น„๋ง์žฅ์• ์— ์˜ํ–ฅ์„ ๋ฐ›๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  ์šด์œจ ์• ๊ฐ€ ์žฅ์•  ์ •๋„์— ๋”ฐ๋ผ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฅด๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š”์ง€์— ๋Œ€ํ•œ ๋ถ„์„์ด ํ•„์š”ํ•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ์Œ๋†’์ด, ์Œ์งˆ, ๋ง์†๋„, ๋ฆฌ๋“ฌ ๋“ฑ ์šด์œจ์„ ๋‹ค์–‘ํ•œ ์ธก๋ฉด์— ์„œ ์‚ดํŽด๋ณด๊ณ , ๋งˆ๋น„๋ง์žฅ์•  ๊ฒ€์ถœ ๋ฐ ํ‰๊ฐ€์— ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ถ”์ถœ๋œ ์šด์œจ ํŠน์ง•๋“ค์€ ๋ช‡ ๊ฐ€์ง€ ํŠน์ง• ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ์ตœ์ ํ™”๋˜์–ด ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜๊ธฐ์˜ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ๋ถ„๋ฅ˜๊ธฐ์˜ ์„ฑ๋Šฅ์€ ์ •ํ™•๋„, ์ •๋ฐ€๋„, ์žฌํ˜„์œจ, F1-์ ์ˆ˜๋กœ ํ‰๊ฐ€๋˜์—ˆ๋‹ค. ๋˜ํ•œ, ๋ณธ ๋…ผ๋ฌธ์€ ์žฅ์•  ์ค‘์ฆ๋„(๊ฒฝ๋„, ์ค‘๋“ฑ๋„, ์‹ฌ๋„)์— ๋”ฐ๋ผ ์šด์œจ ์ •๋ณด ์‚ฌ์šฉ์˜ ์œ ์šฉ์„ฑ์„ ๋ถ„์„ํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์žฅ์•  ๋ฐœํ™” ์ˆ˜์ง‘์ด ์–ด๋ ค์šด ๋งŒํผ, ๋ณธ ์—ฐ๊ตฌ๋Š” ๊ต์ฐจ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ํ•œ๊ตญ์–ด์™€ ์˜์–ด ์žฅ์•  ๋ฐœํ™”๊ฐ€ ํ›ˆ๋ จ ์…‹์œผ๋กœ ์‚ฌ์šฉ๋˜์—ˆ์œผ๋ฉฐ, ํ…Œ์ŠคํŠธ์…‹์œผ๋กœ๋Š” ๊ฐ ๋ชฉํ‘œ ์–ธ์–ด๋งŒ์ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ธ ๊ฐ€์ง€๋ฅผ ์‹œ์‚ฌํ•œ๋‹ค. ์ฒซ์งธ, ์šด์œจ ์ •๋ณด ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ๋งˆ๋น„๋ง์žฅ์•  ๊ฒ€์ถœ ๋ฐ ํ‰๊ฐ€์— ๋„์›€์ด ๋œ๋‹ค. MFCC ๋งŒ์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ, ์šด์œจ ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ํ•œ๊ตญ์–ด์™€ ์˜์–ด ๋ฐ์ดํ„ฐ์…‹ ๋ชจ๋‘์—์„œ ๋„์›€์ด ๋˜์—ˆ๋‹ค. ๋‘˜์งธ, ์šด์œจ ์ •๋ณด๋Š” ํ‰๊ฐ€์— ํŠนํžˆ ์œ ์šฉํ•˜๋‹ค. ์˜์–ด์˜ ๊ฒฝ์šฐ ๊ฒ€์ถœ๊ณผ ํ‰๊ฐ€์—์„œ ๊ฐ๊ฐ 1.82%์™€ 20.6%์˜ ์ƒ๋Œ€์  ์ •ํ™•๋„ ํ–ฅ์ƒ์„ ๋ณด์˜€๋‹ค. ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ ๊ฒ€์ถœ์—์„œ๋Š” ํ–ฅ์ƒ์„ ๋ณด์ด์ง€ ์•Š์•˜์ง€๋งŒ, ํ‰๊ฐ€์—์„œ๋Š” 13.6%์˜ ์ƒ๋Œ€์  ํ–ฅ์ƒ์ด ๋‚˜ํƒ€๋‚ฌ๋‹ค. ์…‹์งธ, ๊ต์ฐจ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋Š” ๋‹จ์ผ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋ณด๋‹ค ํ–ฅ์ƒ๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ ๊ต์ฐจ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋Š” ๋‹จ์ผ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ ์ƒ๋Œ€์ ์œผ๋กœ 4.12% ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์˜€๋‹ค. ์ด๊ฒƒ์€ ํŠน์ • ์šด์œจ ์žฅ์• ๋Š” ๋ฒ”์–ธ์–ด์  ํŠน์ง•์„ ๊ฐ€์ง€๋ฉฐ, ๋‹ค๋ฅธ ์–ธ์–ด ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จ์‹œ์ผœ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ•œ ํ›ˆ๋ จ ์…‹์„ ๋ณด์™„ํ•  ์ˆ˜ ์žˆ ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค.One of the earliest cues for neurological or degenerative disorders are speech impairments. Individuals with Parkinsons Disease, Cerebral Palsy, Amyotrophic lateral Sclerosis, Multiple Sclerosis among others are often diagnosed with dysarthria. Dysarthria is a group of speech disorders mainly affecting the articulatory muscles which eventually leads to severe misarticulation. However, impairments in the suprasegmental domain are also present and previous studies have shown that the prosodic patterns of speakers with dysarthria differ from the prosody of healthy speakers. In a clinical setting, a prosodic-based analysis of dysarthric speech can be helpful for diagnosing the presence of dysarthria. Therefore, there is a need to not only determine how the prosody of speech is affected by dysarthria, but also what aspects of prosody are more affected and how prosodic impairments change by the severity of dysarthria. In the current study, several prosodic features related to pitch, voice quality, rhythm and speech rate are used as features for detecting dysarthria in a given speech signal. A variety of feature selection methods are utilized to determine which set of features are optimal for accurate detection. After selecting an optimal set of prosodic features we use them as input to machine learning-based classifiers and assess the performance using the evaluation metrics: accuracy, precision, recall and F1-score. Furthermore, we examine the usefulness of prosodic measures for assessing different levels of severity (e.g. mild, moderate, severe). Finally, as collecting impaired speech data can be difficult, we also implement cross-language classifiers where both Korean and English data are used for training but only one language used for testing. Results suggest that in comparison to solely using Mel-frequency cepstral coefficients, including prosodic measurements can improve the accuracy of classifiers for both Korean and English datasets. In particular, large improvements were seen when assessing different severity levels. For English a relative accuracy improvement of 1.82% for detection and 20.6% for assessment was seen. The Korean dataset saw no improvements for detection but a relative improvement of 13.6% for assessment. The results from cross-language experiments showed a relative improvement of up to 4.12% in comparison to only using a single language during training. It was found that certain prosodic impairments such as pitch and duration may be language independent. Therefore, when training sets of individual languages are limited, they may be supplemented by including data from other languages.1. Introduction 1 1.1. Dysarthria 1 1.2. Impaired Speech Detection 3 1.3. Research Goals & Outline 6 2. Background Research 8 2.1. Prosodic Impairments 8 2.1.1. English 8 2.1.2. Korean 10 2.2. Machine Learning Approaches 12 3. Database 18 3.1. English-TORGO 20 3.2. Korean-QoLT 21 4. Methods 23 4.1. Prosodic Features 23 4.1.1. Pitch 23 4.1.2. Voice Quality 26 4.1.3. Speech Rate 29 4.1.3. Rhythm 30 4.2. Feature Selection 34 4.3. Classification Models 38 4.3.1. Random Forest 38 4.3.1. Support Vector Machine 40 4.3.1 Feed-Forward Neural Network 42 4.4. Mel-Frequency Cepstral Coefficients 43 5. Experiment 46 5.1. Model Parameters 47 5.2. Training Procedure 48 5.2.1. Dysarthria Detection 48 5.2.2. Severity Assessment 50 5.2.3. Cross-Language 51 6. Results 52 6.1. TORGO 52 6.1.1. Dysarthria Detection 52 6.1.2. Severity Assessment 56 6.2. QoLT 57 6.2.1. Dysarthria Detection 57 6.2.2. Severity Assessment 58 6.1. Cross-Language 59 7. Discussion 62 7.1. Linguistic Implications 62 7.2. Clinical Applications 65 8. Conclusion 67 References 69 Appendix 76 Abstract in Korean 79Maste

    A cross-linguistic perspective to classification of healthiness of speech in Parkinson's disease

    Get PDF
    People with Parkinson's disease often experience communication problems. The current cross-linguistic study investigates how listeners' perceptual judgements of speech healthiness are related to the acoustic changes appearing in the speech of people with Parkinson's disease. Accordingly, we report on an online experiment targeting perceived healthiness of speech. We studied the relations between healthiness perceptual judgements and a set of acoustic characteristics of speech in a cross-sectional design. We recruited 169 participants, who performed a classification task judging speech recordings of Dutch speakers with Parkinson's disease and of Dutch control speakers as โ€˜healthyโ€™ or โ€˜unhealthyโ€™. The groups of listeners differed in their training and expertise in speech language therapy as well as in their native languages. Such group separation allowed us to investigate the acoustic correlates of speech healthiness without influence of the content of the recordings. We used a Random Forest method to predict listeners' responses. Our findings demonstrate that, independently of expertise and language background, when classifying speech as healthy or unhealthy listeners are more sensitive to speech rate, presence of phonation deficiency reflected by maximum phonation time measurement, and centralization of the vowels. The results indicate that both specifics of the expertise and language background may lead to listeners relying more on the features from either prosody or phonation domains. Our findings demonstrate that more global perceptual judgements of different listeners classifying speech of people with Parkinson's disease may be predicted with sufficient reliability from conventional acoustic features. This suggests universality of acoustic change in speech of people with Parkinson's disease. Therefore, we concluded that certain aspects of phonation and prosody serve as prominent markers of speech healthiness for listeners independent of their first language or expertise. Our findings have outcomes for the clinical practice and real-life implications for subjective perception of speech of people with Parkinson's disease, while information about particular acoustic changes that trigger listeners to classify speech as โ€˜unhealthyโ€™ can provide specific therapeutic targets in addition to the existing dysarthria treatment in people with Parkinson's disease

    Accurate synthesis of Dysarthric Speech for ASR data augmentation

    Full text link
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers. This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthric-ness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysartrhic-ness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthriaComment: arXiv admin note: text overlap with arXiv:2201.1157
    • โ€ฆ
    corecore