7 research outputs found

    Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition

    Full text link
    Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. This paper presents novel variational auto-encoder generative adversarial network (VAE-GAN) based personalized disordered speech augmentation approaches that simultaneously learn to encode, generate and discriminate synthesized impaired speech. Separate latent features are derived to learn dysarthric speech characteristics and phoneme context representations. Self-supervised pre-trained Wav2vec 2.0 embedding features are also incorporated. Experiments conducted on the UASpeech corpus suggest the proposed adversarial data augmentation approach consistently outperformed the baseline speed perturbation and non-VAE GAN augmentation methods with trained hybrid TDNN and End-to-end Conformer systems. After LHUC speaker adaptation, the best system using VAE-GAN based augmentation produced an overall WER of 27.78% on the UASpeech test set of 16 dysarthric speakers, and the lowest published WER of 57.31% on the subset of speakers with "Very Low" intelligibility.Comment: Submitted to ICASSP 202

    CDSD: Chinese Dysarthria Speech Database

    Full text link
    We present the Chinese Dysarthria Speech Database (CDSD) as a valuable resource for dysarthria research. This database comprises speech data from 24 participants with dysarthria. Among these participants, one recorded an additional 10 hours of speech data, while each recorded one hour, resulting in 34 hours of speech material. To accommodate participants with varying cognitive levels, our text pool primarily consists of content from the AISHELL-1 dataset and speeches by primary and secondary school students. When participants read these texts, they must use a mobile device or the ZOOM F8n multi-track field recorder to record their speeches. In this paper, we elucidate the data collection and annotation processes and present an approach for establishing a baseline for dysarthric speech recognition. Furthermore, we conducted a speaker-dependent dysarthric speech recognition experiment using an additional 10 hours of speech data from one of our participants. Our research findings indicate that, through extensive data-driven model training, fine-tuning limited quantities of specific individual data yields commendable results in speaker-dependent dysarthric speech recognition. However, we observe significant variations in recognition results among different dysarthric speakers. These insights provide valuable reference points for speaker-dependent dysarthric speech recognition.Comment: 9 pages, 3 figure

    Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment

    Full text link
    Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person's speech. Because of this effect, the normal speech processing systems can not work properly on impaired speech. This disability is usually associated with physical disabilities. Therefore, designing a system that can perform some tasks by receiving voice commands in the smart home can be a significant achievement. In this work, we introduce gammatonegram as an effective method to represent audio files with discriminative details, which is used as input for the convolutional neural network. On the other word, we convert each speech file into an image and propose image recognition system to classify speech in different scenarios. Proposed CNN is based on the transfer learning method on the pre-trained Alexnet. In this research, the efficiency of the proposed system for speech recognition, speaker identification, and intelligibility assessment is evaluated. According to the results on the UA dataset, the proposed speech recognition system achieved 91.29% accuracy in speaker-dependent mode, the speaker identification system acquired 87.74% accuracy in text-dependent mode, and the intelligibility assessment system achieved 96.47% accuracy in two-class mode. Finally, we propose a multi-network speech recognition system that works fully automatically. This system is located in a cascade arrangement with the two-class intelligibility assessment system, and the output of this system activates each one of the speech recognition networks. This architecture achieves an accuracy of 92.3% WRR. The source code of this paper is available.Comment: 12 pages, 8 figure

    On the Impact of Dysarthric Speech on Contemporary ASR Cloud Platforms

    Get PDF
    The spread of voice-driven devices has a positive impact for people with disabilities in smart environments, since such devices allow them to perform a series of daily activities that were difficult or impossible before. As a result, their quality of life and autonomy increase. However, the speech recognition technology employed in such devices becomes limited with people having communication disorders, like dysarthria. People with dysarthria may be unable to control their smart environments, at least with the needed proficiency; this problem may negatively affect the perceived reliability of the entire environment. By exploiting the TORGO database of speech samples pronounced by people with dysarthria, this paper compares the accuracy of the dysarthric speech recognition as achieved by three speech recognition cloud platforms, namely IBM Watson Speech-to- Text, Google Cloud Speech, and Microsoft Azure Bing Speech. Such services, indeed, are used in many virtual assistants deployed in smart environments, such as Google Home. The goal is to investigate whether such cloud platforms are usable to recognize dysarthric speech, and to understand which of them is the most suitable for people with dysarthria. Results suggest that the three platforms have comparable performance in recognizing dysarthric speech, and that the accuracy of the recognition is related to the speech intelligibility of the person. Overall, the platforms are limited when the dysarthric speech intelligibility is low (80-90% of word error rate), while they improve up to reach a word error rate of 15-25% for people without abnormality in their speech intelligibility

    운율 정보λ₯Ό μ΄μš©ν•œ λ§ˆλΉ„λ§μž₯μ•  μŒμ„± μžλ™ κ²€μΆœ 및 평가

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (석사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : μΈλ¬ΈλŒ€ν•™ μ–Έμ–΄ν•™κ³Ό, 2020. 8. Minhwa Chung.말μž₯μ• λŠ” 신경계 λ˜λŠ” 퇴행성 μ§ˆν™˜μ—μ„œ κ°€μž₯ 빨리 λ‚˜νƒ€λ‚˜λŠ” 증 상 쀑 ν•˜λ‚˜μ΄λ‹€. λ§ˆλΉ„λ§μž₯μ• λŠ” νŒŒν‚¨μŠ¨λ³‘, λ‡Œμ„± λ§ˆλΉ„, κ·Όμœ„μΆ•μ„± μΈ‘μ‚­ 경화증, λ‹€λ°œμ„± 경화증 ν™˜μž λ“± λ‹€μ–‘ν•œ ν™˜μžκ΅°μ—μ„œ λ‚˜νƒ€λ‚œλ‹€. λ§ˆλΉ„λ§μž₯μ• λŠ” μ‘°μŒκΈ°κ΄€ μ‹ κ²½μ˜ μ†μƒμœΌλ‘œ λΆ€μ •ν™•ν•œ μ‘°μŒμ„ μ£Όμš” νŠΉμ§•μœΌλ‘œ 가지고, μš΄μœ¨μ—λ„ 영ν–₯을 λ―ΈμΉ˜λŠ” κ²ƒμœΌλ‘œ λ³΄κ³ λœλ‹€. μ„ ν–‰ μ—°κ΅¬μ—μ„œλŠ” 운율 기반 μΈ‘μ •μΉ˜λ₯Ό λΉ„μž₯μ•  λ°œν™”μ™€ λ§ˆλΉ„λ§μž₯μ•  λ°œν™”λ₯Ό κ΅¬λ³„ν•˜λŠ” 것에 μ‚¬μš©ν–ˆλ‹€. μž„μƒ ν˜„μž₯μ—μ„œλŠ” λ§ˆλΉ„λ§μž₯애에 λŒ€ν•œ 운율 기반 뢄석이 λ§ˆλΉ„λ§μž₯μ• λ₯Ό μ§„λ‹¨ν•˜κ±°λ‚˜ μž₯μ•  양상에 λ”°λ₯Έ μ•Œλ§žμ€ μΉ˜λ£Œλ²•μ„ μ€€λΉ„ν•˜λŠ” 것에 도움이 될 것이닀. λ”°λΌμ„œ λ§ˆλΉ„λ§μž₯μ• κ°€ μš΄μœ¨μ— 영ν–₯을 λ―ΈμΉ˜λŠ” μ–‘μƒλΏλ§Œ μ•„λ‹ˆλΌ λ§ˆλΉ„λ§μž₯μ• μ˜ 운율 νŠΉμ§•μ„ κΈ΄λ°€ν•˜κ²Œ μ‚΄νŽ΄λ³΄λŠ” 것이 ν•„μš”ν•˜λ‹€. ꡬ체 적으둜, 운율이 μ–΄λ–€ μΈ‘λ©΄μ—μ„œ λ§ˆλΉ„λ§μž₯애에 영ν–₯을 λ°›λŠ”μ§€, 그리고 운율 μ• κ°€ μž₯μ•  정도에 따라 μ–΄λ–»κ²Œ λ‹€λ₯΄κ²Œ λ‚˜νƒ€λ‚˜λŠ”μ§€μ— λŒ€ν•œ 뢄석이 ν•„μš”ν•˜λ‹€. λ³Έ 논문은 μŒλ†’μ΄, 음질, 말속도, 리듬 λ“± μš΄μœ¨μ„ λ‹€μ–‘ν•œ 츑면에 μ„œ μ‚΄νŽ΄λ³΄κ³ , λ§ˆλΉ„λ§μž₯μ•  κ²€μΆœ 및 평가에 μ‚¬μš©ν•˜μ˜€λ‹€. μΆ”μΆœλœ 운율 νŠΉμ§•λ“€μ€ λͺ‡ 가지 νŠΉμ§• 선택 μ•Œκ³ λ¦¬μ¦˜μ„ 톡해 μ΅œμ ν™”λ˜μ–΄ λ¨Έμ‹ λŸ¬λ‹ 기반 λΆ„λ₯˜κΈ°μ˜ μž…λ ₯κ°’μœΌλ‘œ μ‚¬μš©λ˜μ—ˆλ‹€. λΆ„λ₯˜κΈ°μ˜ μ„±λŠ₯은 정확도, 정밀도, μž¬ν˜„μœ¨, F1-점수둜 ν‰κ°€λ˜μ—ˆλ‹€. λ˜ν•œ, λ³Έ 논문은 μž₯μ•  쀑증도(경도, 쀑등도, 심도)에 따라 운율 정보 μ‚¬μš©μ˜ μœ μš©μ„±μ„ λΆ„μ„ν•˜μ˜€λ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ, μž₯μ•  λ°œν™” μˆ˜μ§‘μ΄ μ–΄λ €μš΄ 만큼, λ³Έ μ—°κ΅¬λŠ” ꡐ차 μ–Έμ–΄ λΆ„λ₯˜κΈ°λ₯Ό μ‚¬μš©ν•˜μ˜€λ‹€. ν•œκ΅­μ–΄μ™€ μ˜μ–΄ μž₯μ•  λ°œν™”κ°€ ν›ˆλ ¨ μ…‹μœΌλ‘œ μ‚¬μš©λ˜μ—ˆμœΌλ©°, ν…ŒμŠ€νŠΈμ…‹μœΌλ‘œλŠ” 각 λͺ©ν‘œ μ–Έμ–΄λ§Œμ΄ μ‚¬μš©λ˜μ—ˆλ‹€. μ‹€ν—˜ κ²°κ³ΌλŠ” λ‹€μŒκ³Ό 같이 μ„Έ 가지λ₯Ό μ‹œμ‚¬ν•œλ‹€. 첫째, 운율 정보 λ₯Ό μ‚¬μš©ν•˜λŠ” 것은 λ§ˆλΉ„λ§μž₯μ•  κ²€μΆœ 및 평가에 도움이 λœλ‹€. MFCC λ§Œμ„ μ‚¬μš©ν–ˆμ„ λ•Œμ™€ λΉ„κ΅ν–ˆμ„ λ•Œ, 운율 정보λ₯Ό ν•¨κ»˜ μ‚¬μš©ν•˜λŠ” 것이 ν•œκ΅­μ–΄μ™€ μ˜μ–΄ 데이터셋 λͺ¨λ‘μ—μ„œ 도움이 λ˜μ—ˆλ‹€. λ‘˜μ§Έ, 운율 μ •λ³΄λŠ” 평가에 특히 μœ μš©ν•˜λ‹€. μ˜μ–΄μ˜ 경우 κ²€μΆœκ³Ό ν‰κ°€μ—μ„œ 각각 1.82%와 20.6%의 μƒλŒ€μ  정확도 ν–₯상을 λ³΄μ˜€λ‹€. ν•œκ΅­μ–΄μ˜ 경우 κ²€μΆœμ—μ„œλŠ” ν–₯상을 보이지 μ•Šμ•˜μ§€λ§Œ, ν‰κ°€μ—μ„œλŠ” 13.6%의 μƒλŒ€μ  ν–₯상이 λ‚˜νƒ€λ‚¬λ‹€. μ…‹μ§Έ, ꡐ차 μ–Έμ–΄ λΆ„λ₯˜κΈ°λŠ” 단일 μ–Έμ–΄ λΆ„λ₯˜κΈ°λ³΄λ‹€ ν–₯μƒλœ κ²°κ³Όλ₯Ό 보인닀. μ‹€ν—˜ κ²°κ³Ό ꡐ차언어 λΆ„λ₯˜κΈ°λŠ” 단일 μ–Έμ–΄ λΆ„λ₯˜κΈ°μ™€ λΉ„κ΅ν–ˆμ„ λ•Œ μƒλŒ€μ μœΌλ‘œ 4.12% 높은 정확도λ₯Ό λ³΄μ˜€λ‹€. 이것은 νŠΉμ • 운율 μž₯μ• λŠ” 범언어적 νŠΉμ§•μ„ 가지며, λ‹€λ₯Έ μ–Έμ–΄ 데이터λ₯Ό ν¬ν•¨μ‹œμΌœ 데이터가 λΆ€μ‘±ν•œ ν›ˆλ ¨ 셋을 보완할 수 있 μŒμ„ μ‹œμ‚¬ν•œλ‹€.One of the earliest cues for neurological or degenerative disorders are speech impairments. Individuals with Parkinsons Disease, Cerebral Palsy, Amyotrophic lateral Sclerosis, Multiple Sclerosis among others are often diagnosed with dysarthria. Dysarthria is a group of speech disorders mainly affecting the articulatory muscles which eventually leads to severe misarticulation. However, impairments in the suprasegmental domain are also present and previous studies have shown that the prosodic patterns of speakers with dysarthria differ from the prosody of healthy speakers. In a clinical setting, a prosodic-based analysis of dysarthric speech can be helpful for diagnosing the presence of dysarthria. Therefore, there is a need to not only determine how the prosody of speech is affected by dysarthria, but also what aspects of prosody are more affected and how prosodic impairments change by the severity of dysarthria. In the current study, several prosodic features related to pitch, voice quality, rhythm and speech rate are used as features for detecting dysarthria in a given speech signal. A variety of feature selection methods are utilized to determine which set of features are optimal for accurate detection. After selecting an optimal set of prosodic features we use them as input to machine learning-based classifiers and assess the performance using the evaluation metrics: accuracy, precision, recall and F1-score. Furthermore, we examine the usefulness of prosodic measures for assessing different levels of severity (e.g. mild, moderate, severe). Finally, as collecting impaired speech data can be difficult, we also implement cross-language classifiers where both Korean and English data are used for training but only one language used for testing. Results suggest that in comparison to solely using Mel-frequency cepstral coefficients, including prosodic measurements can improve the accuracy of classifiers for both Korean and English datasets. In particular, large improvements were seen when assessing different severity levels. For English a relative accuracy improvement of 1.82% for detection and 20.6% for assessment was seen. The Korean dataset saw no improvements for detection but a relative improvement of 13.6% for assessment. The results from cross-language experiments showed a relative improvement of up to 4.12% in comparison to only using a single language during training. It was found that certain prosodic impairments such as pitch and duration may be language independent. Therefore, when training sets of individual languages are limited, they may be supplemented by including data from other languages.1. Introduction 1 1.1. Dysarthria 1 1.2. Impaired Speech Detection 3 1.3. Research Goals & Outline 6 2. Background Research 8 2.1. Prosodic Impairments 8 2.1.1. English 8 2.1.2. Korean 10 2.2. Machine Learning Approaches 12 3. Database 18 3.1. English-TORGO 20 3.2. Korean-QoLT 21 4. Methods 23 4.1. Prosodic Features 23 4.1.1. Pitch 23 4.1.2. Voice Quality 26 4.1.3. Speech Rate 29 4.1.3. Rhythm 30 4.2. Feature Selection 34 4.3. Classification Models 38 4.3.1. Random Forest 38 4.3.1. Support Vector Machine 40 4.3.1 Feed-Forward Neural Network 42 4.4. Mel-Frequency Cepstral Coefficients 43 5. Experiment 46 5.1. Model Parameters 47 5.2. Training Procedure 48 5.2.1. Dysarthria Detection 48 5.2.2. Severity Assessment 50 5.2.3. Cross-Language 51 6. Results 52 6.1. TORGO 52 6.1.1. Dysarthria Detection 52 6.1.2. Severity Assessment 56 6.2. QoLT 57 6.2.1. Dysarthria Detection 57 6.2.2. Severity Assessment 58 6.1. Cross-Language 59 7. Discussion 62 7.1. Linguistic Implications 62 7.2. Clinical Applications 65 8. Conclusion 67 References 69 Appendix 76 Abstract in Korean 79Maste

    SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION

    Get PDF
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers. In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range. To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems
    corecore