178 research outputs found
์ด์จ ์ ๋ณด๋ฅผ ์ด์ฉํ ๋ง๋น๋ง์ฅ์ ์์ฑ ์๋ ๊ฒ์ถ ๋ฐ ํ๊ฐ
ํ์๋
ผ๋ฌธ (์์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ์ธ๋ฌธ๋ํ ์ธ์ดํ๊ณผ, 2020. 8. Minhwa Chung.๋ง์ฅ์ ๋ ์ ๊ฒฝ๊ณ ๋๋ ํดํ์ฑ ์งํ์์ ๊ฐ์ฅ ๋นจ๋ฆฌ ๋ํ๋๋ ์ฆ ์ ์ค ํ๋์ด๋ค. ๋ง๋น๋ง์ฅ์ ๋ ํํจ์จ๋ณ, ๋์ฑ ๋ง๋น, ๊ทผ์์ถ์ฑ ์ธก์ญ ๊ฒฝํ์ฆ, ๋ค๋ฐ์ฑ ๊ฒฝํ์ฆ ํ์ ๋ฑ ๋ค์ํ ํ์๊ตฐ์์ ๋ํ๋๋ค. ๋ง๋น๋ง์ฅ์ ๋ ์กฐ์๊ธฐ๊ด ์ ๊ฒฝ์ ์์์ผ๋ก ๋ถ์ ํํ ์กฐ์์ ์ฃผ์ ํน์ง์ผ๋ก ๊ฐ์ง๊ณ , ์ด์จ์๋ ์ํฅ์ ๋ฏธ์น๋ ๊ฒ์ผ๋ก ๋ณด๊ณ ๋๋ค. ์ ํ ์ฐ๊ตฌ์์๋ ์ด์จ ๊ธฐ๋ฐ ์ธก์ ์น๋ฅผ ๋น์ฅ์ ๋ฐํ์ ๋ง๋น๋ง์ฅ์ ๋ฐํ๋ฅผ ๊ตฌ๋ณํ๋ ๊ฒ์ ์ฌ์ฉํ๋ค. ์์ ํ์ฅ์์๋ ๋ง๋น๋ง์ฅ์ ์ ๋ํ ์ด์จ ๊ธฐ๋ฐ ๋ถ์์ด ๋ง๋น๋ง์ฅ์ ๋ฅผ ์ง๋จํ๊ฑฐ๋ ์ฅ์ ์์์ ๋ฐ๋ฅธ ์๋ง์ ์น๋ฃ๋ฒ์ ์ค๋นํ๋ ๊ฒ์ ๋์์ด ๋ ๊ฒ์ด๋ค. ๋ฐ๋ผ์ ๋ง๋น๋ง์ฅ์ ๊ฐ ์ด์จ์ ์ํฅ์ ๋ฏธ์น๋ ์์๋ฟ๋ง ์๋๋ผ ๋ง๋น๋ง์ฅ์ ์ ์ด์จ ํน์ง์ ๊ธด๋ฐํ๊ฒ ์ดํด๋ณด๋ ๊ฒ์ด ํ์ํ๋ค. ๊ตฌ์ฒด ์ ์ผ๋ก, ์ด์จ์ด ์ด๋ค ์ธก๋ฉด์์ ๋ง๋น๋ง์ฅ์ ์ ์ํฅ์ ๋ฐ๋์ง, ๊ทธ๋ฆฌ๊ณ ์ด์จ ์ ๊ฐ ์ฅ์ ์ ๋์ ๋ฐ๋ผ ์ด๋ป๊ฒ ๋ค๋ฅด๊ฒ ๋ํ๋๋์ง์ ๋ํ ๋ถ์์ด ํ์ํ๋ค. ๋ณธ ๋
ผ๋ฌธ์ ์๋์ด, ์์ง, ๋ง์๋, ๋ฆฌ๋ฌ ๋ฑ ์ด์จ์ ๋ค์ํ ์ธก๋ฉด์ ์ ์ดํด๋ณด๊ณ , ๋ง๋น๋ง์ฅ์ ๊ฒ์ถ ๋ฐ ํ๊ฐ์ ์ฌ์ฉํ์๋ค. ์ถ์ถ๋ ์ด์จ ํน์ง๋ค์ ๋ช ๊ฐ์ง ํน์ง ์ ํ ์๊ณ ๋ฆฌ์ฆ์ ํตํด ์ต์ ํ๋์ด ๋จธ์ ๋ฌ๋ ๊ธฐ๋ฐ ๋ถ๋ฅ๊ธฐ์ ์
๋ ฅ๊ฐ์ผ๋ก ์ฌ์ฉ๋์๋ค. ๋ถ๋ฅ๊ธฐ์ ์ฑ๋ฅ์ ์ ํ๋, ์ ๋ฐ๋, ์ฌํ์จ, F1-์ ์๋ก ํ๊ฐ๋์๋ค. ๋ํ, ๋ณธ ๋
ผ๋ฌธ์ ์ฅ์ ์ค์ฆ๋(๊ฒฝ๋, ์ค๋ฑ๋, ์ฌ๋)์ ๋ฐ๋ผ ์ด์จ ์ ๋ณด ์ฌ์ฉ์ ์ ์ฉ์ฑ์ ๋ถ์ํ์๋ค. ๋ง์ง๋ง์ผ๋ก, ์ฅ์ ๋ฐํ ์์ง์ด ์ด๋ ค์ด ๋งํผ, ๋ณธ ์ฐ๊ตฌ๋ ๊ต์ฐจ ์ธ์ด ๋ถ๋ฅ๊ธฐ๋ฅผ ์ฌ์ฉํ์๋ค. ํ๊ตญ์ด์ ์์ด ์ฅ์ ๋ฐํ๊ฐ ํ๋ จ ์
์ผ๋ก ์ฌ์ฉ๋์์ผ๋ฉฐ, ํ
์คํธ์
์ผ๋ก๋ ๊ฐ ๋ชฉํ ์ธ์ด๋ง์ด ์ฌ์ฉ๋์๋ค. ์คํ ๊ฒฐ๊ณผ๋ ๋ค์๊ณผ ๊ฐ์ด ์ธ ๊ฐ์ง๋ฅผ ์์ฌํ๋ค. ์ฒซ์งธ, ์ด์จ ์ ๋ณด ๋ฅผ ์ฌ์ฉํ๋ ๊ฒ์ ๋ง๋น๋ง์ฅ์ ๊ฒ์ถ ๋ฐ ํ๊ฐ์ ๋์์ด ๋๋ค. MFCC ๋ง์ ์ฌ์ฉํ์ ๋์ ๋น๊ตํ์ ๋, ์ด์จ ์ ๋ณด๋ฅผ ํจ๊ป ์ฌ์ฉํ๋ ๊ฒ์ด ํ๊ตญ์ด์ ์์ด ๋ฐ์ดํฐ์
๋ชจ๋์์ ๋์์ด ๋์๋ค. ๋์งธ, ์ด์จ ์ ๋ณด๋ ํ๊ฐ์ ํนํ ์ ์ฉํ๋ค. ์์ด์ ๊ฒฝ์ฐ ๊ฒ์ถ๊ณผ ํ๊ฐ์์ ๊ฐ๊ฐ 1.82%์ 20.6%์ ์๋์ ์ ํ๋ ํฅ์์ ๋ณด์๋ค. ํ๊ตญ์ด์ ๊ฒฝ์ฐ ๊ฒ์ถ์์๋ ํฅ์์ ๋ณด์ด์ง ์์์ง๋ง, ํ๊ฐ์์๋ 13.6%์ ์๋์ ํฅ์์ด ๋ํ๋ฌ๋ค. ์
์งธ, ๊ต์ฐจ ์ธ์ด ๋ถ๋ฅ๊ธฐ๋ ๋จ์ผ ์ธ์ด ๋ถ๋ฅ๊ธฐ๋ณด๋ค ํฅ์๋ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ๋ค. ์คํ ๊ฒฐ๊ณผ ๊ต์ฐจ์ธ์ด ๋ถ๋ฅ๊ธฐ๋ ๋จ์ผ ์ธ์ด ๋ถ๋ฅ๊ธฐ์ ๋น๊ตํ์ ๋ ์๋์ ์ผ๋ก 4.12% ๋์ ์ ํ๋๋ฅผ ๋ณด์๋ค. ์ด๊ฒ์ ํน์ ์ด์จ ์ฅ์ ๋ ๋ฒ์ธ์ด์ ํน์ง์ ๊ฐ์ง๋ฉฐ, ๋ค๋ฅธ ์ธ์ด ๋ฐ์ดํฐ๋ฅผ ํฌํจ์์ผ ๋ฐ์ดํฐ๊ฐ ๋ถ์กฑํ ํ๋ จ ์
์ ๋ณด์ํ ์ ์ ์์ ์์ฌํ๋ค.One of the earliest cues for neurological or degenerative disorders are speech impairments. Individuals with Parkinsons Disease, Cerebral Palsy, Amyotrophic lateral Sclerosis, Multiple Sclerosis among others are often diagnosed with dysarthria. Dysarthria is a group of speech disorders mainly affecting the articulatory muscles which eventually leads to severe misarticulation. However, impairments in the suprasegmental domain are also present and previous studies have shown that the prosodic patterns of speakers with dysarthria differ from the prosody of healthy speakers. In a clinical setting, a prosodic-based analysis of dysarthric speech can be helpful for diagnosing the presence of dysarthria. Therefore, there is a need to not only determine how the prosody of speech is affected by dysarthria, but also what aspects of prosody are more affected and how prosodic impairments change by the severity of dysarthria.
In the current study, several prosodic features related to pitch, voice quality, rhythm and speech rate are used as features for detecting dysarthria in a given speech signal. A variety of feature selection methods are utilized to determine which set of features are optimal for accurate detection. After selecting an optimal set of prosodic features we use them as input to machine learning-based classifiers and assess the performance using the evaluation metrics: accuracy, precision, recall and F1-score. Furthermore, we examine the usefulness of prosodic measures for assessing different levels of severity (e.g. mild, moderate, severe). Finally, as collecting impaired speech data can be difficult, we also implement cross-language classifiers where both Korean and English data are used for training but only one language used for testing. Results suggest that in comparison to solely using Mel-frequency cepstral coefficients, including prosodic measurements can improve the accuracy of classifiers for both Korean and English datasets. In particular, large improvements were seen when assessing different severity levels. For English a relative accuracy improvement of 1.82% for detection and 20.6% for assessment was seen. The Korean dataset saw no improvements for detection but a relative improvement of 13.6% for assessment. The results from cross-language experiments showed a relative improvement of up to 4.12% in comparison to only using a single language during training. It was found that certain prosodic impairments such as pitch and duration may be language independent. Therefore, when training sets of individual languages are limited, they may be supplemented by including data from other languages.1. Introduction 1
1.1. Dysarthria 1
1.2. Impaired Speech Detection 3
1.3. Research Goals & Outline 6
2. Background Research 8
2.1. Prosodic Impairments 8
2.1.1. English 8
2.1.2. Korean 10
2.2. Machine Learning Approaches 12
3. Database 18
3.1. English-TORGO 20
3.2. Korean-QoLT 21
4. Methods 23
4.1. Prosodic Features 23
4.1.1. Pitch 23
4.1.2. Voice Quality 26
4.1.3. Speech Rate 29
4.1.3. Rhythm 30
4.2. Feature Selection 34
4.3. Classification Models 38
4.3.1. Random Forest 38
4.3.1. Support Vector Machine 40
4.3.1 Feed-Forward Neural Network 42
4.4. Mel-Frequency Cepstral Coefficients 43
5. Experiment 46
5.1. Model Parameters 47
5.2. Training Procedure 48
5.2.1. Dysarthria Detection 48
5.2.2. Severity Assessment 50
5.2.3. Cross-Language 51
6. Results 52
6.1. TORGO 52
6.1.1. Dysarthria Detection 52
6.1.2. Severity Assessment 56
6.2. QoLT 57
6.2.1. Dysarthria Detection 57
6.2.2. Severity Assessment 58
6.1. Cross-Language 59
7. Discussion 62
7.1. Linguistic Implications 62
7.2. Clinical Applications 65
8. Conclusion 67
References 69
Appendix 76
Abstract in Korean 79Maste
SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers.
In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range.
To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems
Accurate synthesis of Dysarthric Speech for ASR data augmentation
Dysarthria is a motor speech disorder often characterized by reduced speech
intelligibility through slow, uncoordinated control of speech production
muscles. Automatic Speech recognition (ASR) systems can help dysarthric talkers
communicate more effectively. However, robust dysarthria-specific ASR requires
a significant amount of training speech, which is not readily available for
dysarthric talkers. This paper presents a new dysarthric speech synthesis
method for the purpose of ASR training data augmentation. Differences in
prosodic and acoustic characteristics of dysarthric spontaneous speech at
varying severity levels are important components for dysarthric speech
modeling, synthesis, and augmentation. For dysarthric speech synthesis, a
modified neural multi-talker TTS is implemented by adding a dysarthria severity
level coefficient and a pause insertion model to synthesize dysarthric speech
for varying severity levels. To evaluate the effectiveness for synthesis of
training data for ASR, dysarthria-specific speech recognition was used. Results
show that a DNN-HMM model trained on additional synthetic dysarthric speech
achieves WER improvement of 12.2% compared to the baseline, and that the
addition of the severity level and pause insertion controls decrease WER by
6.5%, showing the effectiveness of adding these parameters. Overall results on
the TORGO database demonstrate that using dysarthric synthetic speech to
increase the amount of dysarthric-patterned speech for training has significant
impact on the dysarthric ASR systems. In addition, we have conducted a
subjective evaluation to evaluate the dysarthric-ness and similarity of
synthesized speech. Our subjective evaluation shows that the perceived
dysartrhic-ness of synthesized speech is similar to that of true dysarthric
speech, especially for higher levels of dysarthriaComment: arXiv admin note: text overlap with arXiv:2201.1157
How expertise and language familiarity influence perception of speech of people with Parkinsonโs disease
Parkinsonโs disease (PD) is a progressive neurological disorder characterized by several motor and non-motor manifestations. PD frequently leads to hypokinetic dysarthria, which affects speech production and often has a detrimental impact on everyday communication. Among the typical manifestations of hypokinetic dysarthria, speech and language therapists (SLTs) identify prosody as the most affected cluster of speech characteristics. However, less is known about how untrained listeners perceive PD speech and how affected prosody influences their assessments of speech. This study explores the perception of sentence type intonation and healthiness of PD speech by listeners with different levels of familiarity with speech disorders in Dutch. We investigated assessments and classification accuracy differences between Dutch-speaking SLTs (nย =ย 18) and Dutch/non-Dutch speaking untrained listeners (nย =ย 27 and n =ย 124, respectively). We collected speech data from 30 Dutch speakers diagnosed with PD and 30 Dutch healthy controls. The stimuli set consisted of short phrases from spontaneous and read speech and of phrases produced with different sentence type intonation. Listeners participated in an online experiment targeting classification of sentence type intonation and perceived healthiness of speech. Results indicate that both familiarity with speech disorders and with speakersโ language are significant and have different effects depending on the task type, as different listener groups demonstrate different classification accuracy. There is evidence that untrained Dutch listeners classify PD speech as unhealthy more accurately than both trained Dutch and untrained non-Dutch listeners, while trained Dutch listeners outperform the other two groups in sentence type classification
A cross-linguistic perspective to classification of healthiness of speech in Parkinson's disease
People with Parkinson's disease often experience communication problems. The current cross-linguistic study investigates how listeners' perceptual judgements of speech healthiness are related to the acoustic changes appearing in the speech of people with Parkinson's disease. Accordingly, we report on an online experiment targeting perceived healthiness of speech. We studied the relations between healthiness perceptual judgements and a set of acoustic characteristics of speech in a cross-sectional design. We recruited 169 participants, who performed a classification task judging speech recordings of Dutch speakers with Parkinson's disease and of Dutch control speakers as โhealthyโ or โunhealthyโ. The groups of listeners differed in their training and expertise in speech language therapy as well as in their native languages. Such group separation allowed us to investigate the acoustic correlates of speech healthiness without influence of the content of the recordings. We used a Random Forest method to predict listeners' responses. Our findings demonstrate that, independently of expertise and language background, when classifying speech as healthy or unhealthy listeners are more sensitive to speech rate, presence of phonation deficiency reflected by maximum phonation time measurement, and centralization of the vowels. The results indicate that both specifics of the expertise and language background may lead to listeners relying more on the features from either prosody or phonation domains. Our findings demonstrate that more global perceptual judgements of different listeners classifying speech of people with Parkinson's disease may be predicted with sufficient reliability from conventional acoustic features. This suggests universality of acoustic change in speech of people with Parkinson's disease. Therefore, we concluded that certain aspects of phonation and prosody serve as prominent markers of speech healthiness for listeners independent of their first language or expertise. Our findings have outcomes for the clinical practice and real-life implications for subjective perception of speech of people with Parkinson's disease, while information about particular acoustic changes that trigger listeners to classify speech as โunhealthyโ can provide specific therapeutic targets in addition to the existing dysarthria treatment in people with Parkinson's disease
More than words: Recognizing speech of people with Parkinson's disease
Parkinsonโs disease (PD) is the fastest-growing neurological disorder in the world, with approximately 10 million people currently living with the diagnosis. Hypokinetic dysarthria (HD) is one of the symptoms that appear in early stages of the disease progression. The main aim of this dissertation is to gain insights into listenersโ impressions of dysarthric speech and to uncover acoustic correlates of those impressions. We do this by exploring two sides of communication: speech production of people with PD, and listenersโ recognition of speech of people with PD. Therefore, the studies in this dissertation approach the topic of speech changes in PD from both the speakers' side - via acoustic analysis of speech, and the listeners' side - via experiments exploring the influence of expertise and language background on recognition of speech of people with PD. Moreover, to obtain a more comprehensive picture of these perspectives, the studies of this dissertation are multifaceted, explore cross-linguistic aspects of dysarthric speech recognition and include both cross-sectional and longitudinal designs. The results demonstrate that listeners' ability to recognize speech of people with PD as unhealthy is rooted in the acoustic changes in speech, not in its content. Listenersโ experience in the fields of speech and language therapy or speech sciences affect dysarthric speech recognition. The results also suggest that tracking speech parameters is a useful tool for monitoring the progression and/or development of dysarthria and objectively evaluating long-term effects of speech therapy
Acoustic identification of sentence accent in speakers with dysarthria : cross-population validation and severity related patterns
Dysprosody is a hallmark of dysarthria, which can affect the intelligibility and naturalness of speech. This includes sentence accent, which helps to draw listenersโ attention to important information in the message. Although some studies have investigated this feature, we currently lack properly validated automated procedures that can distinguish between subtle performance differences observed across speakers with dysarthria. This study aims for cross-population validation of a set of acoustic features that have previously been shown to correlate with sentence accent. In addition, the impact of dysarthria severity levels on sentence accent production is investigated. Two groups of adults were analysed (Dutch and English speakers). Fifty-eight participants with dysarthria and 30 healthy control participants (HCP) produced sentences with varying accent positions. All speech samples were evaluated perceptually and analysed acoustically with an algorithm that extracts ten meaningful prosodic features and allows a classification between accented and unaccented syllables based on a linear combination of these parameters. The data were statistically analysed using discriminant analysis. Within the Dutch and English dysarthric population, the algorithm correctly identified 82.8 and 91.9% of the accented target syllables, respectively, indicating that the capacity to discriminate between accented and unaccented syllables in a sentence is consistent with perceptual impressions. Moreover, different strategies for accent production across dysarthria severity levels could be demonstrated, which is an important step toward a better understanding of the nature of the deficit and the automatic classification of dysarthria severity using prosodic features
ACOUSTIC SPEECH MARKERS FOR TRACKING CHANGES IN HYPOKINETIC DYSARTHRIA ASSOCIATED WITH PARKINSONโS DISEASE
Previous research has identified certain overarching features of hypokinetic dysarthria
associated with Parkinsonโs Disease and found it manifests differently between
individuals. Acoustic analysis has often been used to find correlates of perceptual
features for differential diagnosis. However, acoustic parameters that are robust for
differential diagnosis may not be sensitive to tracking speech changes. Previous
longitudinal studies have had limited sample sizes or variable lengths between data
collection. This study focused on using acoustic correlates of perceptual features to
identify acoustic markers able to track speech changes in people with Parkinsonโs
Disease (PwPD) over six months. The thesis presents how this study has addressed
limitations of previous studies to make a novel contribution to current knowledge.
Speech data was collected from 63 PwPD and 47 control speakers using an online
podcast software at two time points, six months apart (T1 and T2). Recordings of a
standard reading passage, minimal pairs, sustained phonation, and spontaneous speech
were collected. Perceptual severity ratings were given by two speech and language
therapists for T1 and T2, and acoustic parameters of voice, articulation and prosody
were investigated. Two analyses were conducted: a) to identify which acoustic
parameters can track perceptual speech changes over time and b) to identify which
acoustic parameters can track changes in speech intelligibility over time. An additional
attempt was made to identify if these parameters showed group differences for
differential diagnosis between PwPD and control speakers at T1 and T2.
Results showed that specific acoustic parameters in voice quality, articulation and
prosody could differentiate between PwPD and controls, or detect speech changes
between T1 and T2, but not both factors. However, specific acoustic parameters within
articulation could detect significant group and speech change differences across T1 and
T2. The thesis discusses these results, their implications, and the potential for future
studies
- โฆ