37 research outputs found

    Automatic speech recognition with deep neural networks for impaired speech

    Get PDF
    The final publication is available at https://link.springer.com/chapter/10.1007%2F978-3-319-49169-1_10Automatic Speech Recognition has reached almost human performance in some controlled scenarios. However, recognition of impaired speech is a difficult task for two main reasons: data is (i) scarce and (ii) heterogeneous. In this work we train different architectures on a database of dysarthric speech. A comparison between architectures shows that, even with a small database, hybrid DNN-HMM models outperform classical GMM-HMM according to word error rate measures. A DNN is able to improve the recognition word error rate a 13% for subjects with dysarthria with respect to the best classical architecture. This improvement is higher than the one given by other deep neural networks such as CNNs, TDNNs and LSTMs. All the experiments have been done with the Kaldi toolkit for speech recognition for which we have adapted several recipes to deal with dysarthric speech and work on the TORGO database. These recipes are publicly available.Peer ReviewedPostprint (author's final draft

    Accurate synthesis of Dysarthric Speech for ASR data augmentation

    Full text link
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers. This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthric-ness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysartrhic-ness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthriaComment: arXiv admin note: text overlap with arXiv:2201.1157

    SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION

    Get PDF
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers. In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range. To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems

    Simulating dysarthric speech for training data augmentation in clinical speech applications

    Full text link
    Training machine learning algorithms for speech applications requires large, labeled training data sets. This is problematic for clinical applications where obtaining such data is prohibitively expensive because of privacy concerns or lack of access. As a result, clinical speech applications are typically developed using small data sets with only tens of speakers. In this paper, we propose a method for simulating training data for clinical applications by transforming healthy speech to dysarthric speech using adversarial training. We evaluate the efficacy of our approach using both objective and subjective criteria. We present the transformed samples to five experienced speech-language pathologists (SLPs) and ask them to identify the samples as healthy or dysarthric. The results reveal that the SLPs identify the transformed speech as dysarthric 65% of the time. In a pilot classification experiment, we show that by using the simulated speech samples to balance an existing dataset, the classification accuracy improves by about 10% after data augmentation.Comment: Will appear in Proc. of ICASSP 201

    ์šด์œจ ์ •๋ณด๋ฅผ ์ด์šฉํ•œ ๋งˆ๋น„๋ง์žฅ์•  ์Œ์„ฑ ์ž๋™ ๊ฒ€์ถœ ๋ฐ ํ‰๊ฐ€

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ธ๋ฌธ๋Œ€ํ•™ ์–ธ์–ดํ•™๊ณผ, 2020. 8. Minhwa Chung.๋ง์žฅ์• ๋Š” ์‹ ๊ฒฝ๊ณ„ ๋˜๋Š” ํ‡ดํ–‰์„ฑ ์งˆํ™˜์—์„œ ๊ฐ€์žฅ ๋นจ๋ฆฌ ๋‚˜ํƒ€๋‚˜๋Š” ์ฆ ์ƒ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๋งˆ๋น„๋ง์žฅ์• ๋Š” ํŒŒํ‚จ์Šจ๋ณ‘, ๋‡Œ์„ฑ ๋งˆ๋น„, ๊ทผ์œ„์ถ•์„ฑ ์ธก์‚ญ ๊ฒฝํ™”์ฆ, ๋‹ค๋ฐœ์„ฑ ๊ฒฝํ™”์ฆ ํ™˜์ž ๋“ฑ ๋‹ค์–‘ํ•œ ํ™˜์ž๊ตฐ์—์„œ ๋‚˜ํƒ€๋‚œ๋‹ค. ๋งˆ๋น„๋ง์žฅ์• ๋Š” ์กฐ์Œ๊ธฐ๊ด€ ์‹ ๊ฒฝ์˜ ์†์ƒ์œผ๋กœ ๋ถ€์ •ํ™•ํ•œ ์กฐ์Œ์„ ์ฃผ์š” ํŠน์ง•์œผ๋กœ ๊ฐ€์ง€๊ณ , ์šด์œจ์—๋„ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด๊ณ ๋œ๋‹ค. ์„ ํ–‰ ์—ฐ๊ตฌ์—์„œ๋Š” ์šด์œจ ๊ธฐ๋ฐ˜ ์ธก์ •์น˜๋ฅผ ๋น„์žฅ์•  ๋ฐœํ™”์™€ ๋งˆ๋น„๋ง์žฅ์•  ๋ฐœํ™”๋ฅผ ๊ตฌ๋ณ„ํ•˜๋Š” ๊ฒƒ์— ์‚ฌ์šฉํ–ˆ๋‹ค. ์ž„์ƒ ํ˜„์žฅ์—์„œ๋Š” ๋งˆ๋น„๋ง์žฅ์• ์— ๋Œ€ํ•œ ์šด์œจ ๊ธฐ๋ฐ˜ ๋ถ„์„์ด ๋งˆ๋น„๋ง์žฅ์• ๋ฅผ ์ง„๋‹จํ•˜๊ฑฐ๋‚˜ ์žฅ์•  ์–‘์ƒ์— ๋”ฐ๋ฅธ ์•Œ๋งž์€ ์น˜๋ฃŒ๋ฒ•์„ ์ค€๋น„ํ•˜๋Š” ๊ฒƒ์— ๋„์›€์ด ๋  ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ ๋งˆ๋น„๋ง์žฅ์• ๊ฐ€ ์šด์œจ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์–‘์ƒ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋งˆ๋น„๋ง์žฅ์• ์˜ ์šด์œจ ํŠน์ง•์„ ๊ธด๋ฐ€ํ•˜๊ฒŒ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ์ด ํ•„์š”ํ•˜๋‹ค. ๊ตฌ์ฒด ์ ์œผ๋กœ, ์šด์œจ์ด ์–ด๋–ค ์ธก๋ฉด์—์„œ ๋งˆ๋น„๋ง์žฅ์• ์— ์˜ํ–ฅ์„ ๋ฐ›๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  ์šด์œจ ์• ๊ฐ€ ์žฅ์•  ์ •๋„์— ๋”ฐ๋ผ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฅด๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š”์ง€์— ๋Œ€ํ•œ ๋ถ„์„์ด ํ•„์š”ํ•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ์Œ๋†’์ด, ์Œ์งˆ, ๋ง์†๋„, ๋ฆฌ๋“ฌ ๋“ฑ ์šด์œจ์„ ๋‹ค์–‘ํ•œ ์ธก๋ฉด์— ์„œ ์‚ดํŽด๋ณด๊ณ , ๋งˆ๋น„๋ง์žฅ์•  ๊ฒ€์ถœ ๋ฐ ํ‰๊ฐ€์— ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ถ”์ถœ๋œ ์šด์œจ ํŠน์ง•๋“ค์€ ๋ช‡ ๊ฐ€์ง€ ํŠน์ง• ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ์ตœ์ ํ™”๋˜์–ด ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜๊ธฐ์˜ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ๋ถ„๋ฅ˜๊ธฐ์˜ ์„ฑ๋Šฅ์€ ์ •ํ™•๋„, ์ •๋ฐ€๋„, ์žฌํ˜„์œจ, F1-์ ์ˆ˜๋กœ ํ‰๊ฐ€๋˜์—ˆ๋‹ค. ๋˜ํ•œ, ๋ณธ ๋…ผ๋ฌธ์€ ์žฅ์•  ์ค‘์ฆ๋„(๊ฒฝ๋„, ์ค‘๋“ฑ๋„, ์‹ฌ๋„)์— ๋”ฐ๋ผ ์šด์œจ ์ •๋ณด ์‚ฌ์šฉ์˜ ์œ ์šฉ์„ฑ์„ ๋ถ„์„ํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์žฅ์•  ๋ฐœํ™” ์ˆ˜์ง‘์ด ์–ด๋ ค์šด ๋งŒํผ, ๋ณธ ์—ฐ๊ตฌ๋Š” ๊ต์ฐจ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ํ•œ๊ตญ์–ด์™€ ์˜์–ด ์žฅ์•  ๋ฐœํ™”๊ฐ€ ํ›ˆ๋ จ ์…‹์œผ๋กœ ์‚ฌ์šฉ๋˜์—ˆ์œผ๋ฉฐ, ํ…Œ์ŠคํŠธ์…‹์œผ๋กœ๋Š” ๊ฐ ๋ชฉํ‘œ ์–ธ์–ด๋งŒ์ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ธ ๊ฐ€์ง€๋ฅผ ์‹œ์‚ฌํ•œ๋‹ค. ์ฒซ์งธ, ์šด์œจ ์ •๋ณด ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ๋งˆ๋น„๋ง์žฅ์•  ๊ฒ€์ถœ ๋ฐ ํ‰๊ฐ€์— ๋„์›€์ด ๋œ๋‹ค. MFCC ๋งŒ์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ, ์šด์œจ ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ํ•œ๊ตญ์–ด์™€ ์˜์–ด ๋ฐ์ดํ„ฐ์…‹ ๋ชจ๋‘์—์„œ ๋„์›€์ด ๋˜์—ˆ๋‹ค. ๋‘˜์งธ, ์šด์œจ ์ •๋ณด๋Š” ํ‰๊ฐ€์— ํŠนํžˆ ์œ ์šฉํ•˜๋‹ค. ์˜์–ด์˜ ๊ฒฝ์šฐ ๊ฒ€์ถœ๊ณผ ํ‰๊ฐ€์—์„œ ๊ฐ๊ฐ 1.82%์™€ 20.6%์˜ ์ƒ๋Œ€์  ์ •ํ™•๋„ ํ–ฅ์ƒ์„ ๋ณด์˜€๋‹ค. ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ ๊ฒ€์ถœ์—์„œ๋Š” ํ–ฅ์ƒ์„ ๋ณด์ด์ง€ ์•Š์•˜์ง€๋งŒ, ํ‰๊ฐ€์—์„œ๋Š” 13.6%์˜ ์ƒ๋Œ€์  ํ–ฅ์ƒ์ด ๋‚˜ํƒ€๋‚ฌ๋‹ค. ์…‹์งธ, ๊ต์ฐจ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋Š” ๋‹จ์ผ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋ณด๋‹ค ํ–ฅ์ƒ๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ ๊ต์ฐจ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋Š” ๋‹จ์ผ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ ์ƒ๋Œ€์ ์œผ๋กœ 4.12% ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์˜€๋‹ค. ์ด๊ฒƒ์€ ํŠน์ • ์šด์œจ ์žฅ์• ๋Š” ๋ฒ”์–ธ์–ด์  ํŠน์ง•์„ ๊ฐ€์ง€๋ฉฐ, ๋‹ค๋ฅธ ์–ธ์–ด ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จ์‹œ์ผœ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ•œ ํ›ˆ๋ จ ์…‹์„ ๋ณด์™„ํ•  ์ˆ˜ ์žˆ ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค.One of the earliest cues for neurological or degenerative disorders are speech impairments. Individuals with Parkinsons Disease, Cerebral Palsy, Amyotrophic lateral Sclerosis, Multiple Sclerosis among others are often diagnosed with dysarthria. Dysarthria is a group of speech disorders mainly affecting the articulatory muscles which eventually leads to severe misarticulation. However, impairments in the suprasegmental domain are also present and previous studies have shown that the prosodic patterns of speakers with dysarthria differ from the prosody of healthy speakers. In a clinical setting, a prosodic-based analysis of dysarthric speech can be helpful for diagnosing the presence of dysarthria. Therefore, there is a need to not only determine how the prosody of speech is affected by dysarthria, but also what aspects of prosody are more affected and how prosodic impairments change by the severity of dysarthria. In the current study, several prosodic features related to pitch, voice quality, rhythm and speech rate are used as features for detecting dysarthria in a given speech signal. A variety of feature selection methods are utilized to determine which set of features are optimal for accurate detection. After selecting an optimal set of prosodic features we use them as input to machine learning-based classifiers and assess the performance using the evaluation metrics: accuracy, precision, recall and F1-score. Furthermore, we examine the usefulness of prosodic measures for assessing different levels of severity (e.g. mild, moderate, severe). Finally, as collecting impaired speech data can be difficult, we also implement cross-language classifiers where both Korean and English data are used for training but only one language used for testing. Results suggest that in comparison to solely using Mel-frequency cepstral coefficients, including prosodic measurements can improve the accuracy of classifiers for both Korean and English datasets. In particular, large improvements were seen when assessing different severity levels. For English a relative accuracy improvement of 1.82% for detection and 20.6% for assessment was seen. The Korean dataset saw no improvements for detection but a relative improvement of 13.6% for assessment. The results from cross-language experiments showed a relative improvement of up to 4.12% in comparison to only using a single language during training. It was found that certain prosodic impairments such as pitch and duration may be language independent. Therefore, when training sets of individual languages are limited, they may be supplemented by including data from other languages.1. Introduction 1 1.1. Dysarthria 1 1.2. Impaired Speech Detection 3 1.3. Research Goals & Outline 6 2. Background Research 8 2.1. Prosodic Impairments 8 2.1.1. English 8 2.1.2. Korean 10 2.2. Machine Learning Approaches 12 3. Database 18 3.1. English-TORGO 20 3.2. Korean-QoLT 21 4. Methods 23 4.1. Prosodic Features 23 4.1.1. Pitch 23 4.1.2. Voice Quality 26 4.1.3. Speech Rate 29 4.1.3. Rhythm 30 4.2. Feature Selection 34 4.3. Classification Models 38 4.3.1. Random Forest 38 4.3.1. Support Vector Machine 40 4.3.1 Feed-Forward Neural Network 42 4.4. Mel-Frequency Cepstral Coefficients 43 5. Experiment 46 5.1. Model Parameters 47 5.2. Training Procedure 48 5.2.1. Dysarthria Detection 48 5.2.2. Severity Assessment 50 5.2.3. Cross-Language 51 6. Results 52 6.1. TORGO 52 6.1.1. Dysarthria Detection 52 6.1.2. Severity Assessment 56 6.2. QoLT 57 6.2.1. Dysarthria Detection 57 6.2.2. Severity Assessment 58 6.1. Cross-Language 59 7. Discussion 62 7.1. Linguistic Implications 62 7.2. Clinical Applications 65 8. Conclusion 67 References 69 Appendix 76 Abstract in Korean 79Maste

    Improving Dysarthric Speech Recognition by Enriching Training Datasets

    Get PDF
    Dysarthria is a motor speech disorder that results from disruptions in the neuro-motor interface and is characterised by poor articulation of phonemes and hyper-nasality and is characteristically different from normal speech. Many modern automatic speech recognition systems focus on a narrow range of speech diversity therefore as a consequence of this they exclude a groups of speakers who deviate in aspects of gender, race, age and speech impairment when building training datasets. This study attempts to develop an automatic speech recognition system that deals with dysarthric speech with limited dysarthric speech data. Speech utterances collected from the TORGO database are used to conduct experiments on a wav2vec2.0 model only trained on the Librispeech 960h dataset to obtain a baseline performance of the word error rate (WER) when recognising dysarthric speech. A version of the Librispeech model fine-tuned on multi-language datasets was tested to see if it would improve accuracy and achieved a top reduction of 24.15% in the WER for one of the male dysarthric speakers in the dataset. Transfer learning with speech recognition models and preprocessing dysarthric speech to improve its intelligibility by using general adversarial networks were limited in their potential due to a lack of dysarthric speech dataset of adequate size to use these technologies. The main conclusion drawn from this study is that a large diverse dysarthric speech dataset comparable to the size of datasets used to train machine learning ASR systems like Librispeech,with different types of speech, scripted and unscripted, is required to improve performance.

    On the Impact of Dysarthric Speech on Contemporary ASR Cloud Platforms

    Get PDF
    The spread of voice-driven devices has a positive impact for people with disabilities in smart environments, since such devices allow them to perform a series of daily activities that were difficult or impossible before. As a result, their quality of life and autonomy increase. However, the speech recognition technology employed in such devices becomes limited with people having communication disorders, like dysarthria. People with dysarthria may be unable to control their smart environments, at least with the needed proficiency; this problem may negatively affect the perceived reliability of the entire environment. By exploiting the TORGO database of speech samples pronounced by people with dysarthria, this paper compares the accuracy of the dysarthric speech recognition as achieved by three speech recognition cloud platforms, namely IBM Watson Speech-to- Text, Google Cloud Speech, and Microsoft Azure Bing Speech. Such services, indeed, are used in many virtual assistants deployed in smart environments, such as Google Home. The goal is to investigate whether such cloud platforms are usable to recognize dysarthric speech, and to understand which of them is the most suitable for people with dysarthria. Results suggest that the three platforms have comparable performance in recognizing dysarthric speech, and that the accuracy of the recognition is related to the speech intelligibility of the person. Overall, the platforms are limited when the dysarthric speech intelligibility is low (80-90% of word error rate), while they improve up to reach a word error rate of 15-25% for people without abnormality in their speech intelligibility

    Exploring appropriate acoustic and language modelling choices for continuous dysarthric speech recognition

    Get PDF
    There has been much recent interest in building continuous speech recognition systems for people with severe speech impairments, e.g., dysarthria. However, the datasets that are commonly used are typically designed for tasks other than ASR development, or they contain only isolated words. As such, they contain much overlap in the prompts read by the speakers. Previous ASR evaluations have often neglected this, using language models (LMs) trained on non-disjoint training and test data, potentially producing unrealistically optimistic results. In this paper, we investigate the impact of LM design using the widely used TORGO database. We combine state-of-the-art acoustic models with LMs trained with data originating from LibriSpeech. Using LMs with varying vocabulary size, we examine the trade-off between the out-of-vocabulary rate and recognition confusions for speakers with varying degrees of dysarthria. It is found that the optimal LM complexity is highly speaker dependent, highlighting the need to design speaker-dependent LMs alongside speaker-dependent acoustic models when considering atypical speech

    Articulatory Knowledge in the Recognition of Dysarthric Speech

    Full text link
    corecore