472 research outputs found

    ์šด์œจ ์ •๋ณด๋ฅผ ์ด์šฉํ•œ ๋งˆ๋น„๋ง์žฅ์•  ์Œ์„ฑ ์ž๋™ ๊ฒ€์ถœ ๋ฐ ํ‰๊ฐ€

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ธ๋ฌธ๋Œ€ํ•™ ์–ธ์–ดํ•™๊ณผ, 2020. 8. Minhwa Chung.๋ง์žฅ์• ๋Š” ์‹ ๊ฒฝ๊ณ„ ๋˜๋Š” ํ‡ดํ–‰์„ฑ ์งˆํ™˜์—์„œ ๊ฐ€์žฅ ๋นจ๋ฆฌ ๋‚˜ํƒ€๋‚˜๋Š” ์ฆ ์ƒ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๋งˆ๋น„๋ง์žฅ์• ๋Š” ํŒŒํ‚จ์Šจ๋ณ‘, ๋‡Œ์„ฑ ๋งˆ๋น„, ๊ทผ์œ„์ถ•์„ฑ ์ธก์‚ญ ๊ฒฝํ™”์ฆ, ๋‹ค๋ฐœ์„ฑ ๊ฒฝํ™”์ฆ ํ™˜์ž ๋“ฑ ๋‹ค์–‘ํ•œ ํ™˜์ž๊ตฐ์—์„œ ๋‚˜ํƒ€๋‚œ๋‹ค. ๋งˆ๋น„๋ง์žฅ์• ๋Š” ์กฐ์Œ๊ธฐ๊ด€ ์‹ ๊ฒฝ์˜ ์†์ƒ์œผ๋กœ ๋ถ€์ •ํ™•ํ•œ ์กฐ์Œ์„ ์ฃผ์š” ํŠน์ง•์œผ๋กœ ๊ฐ€์ง€๊ณ , ์šด์œจ์—๋„ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด๊ณ ๋œ๋‹ค. ์„ ํ–‰ ์—ฐ๊ตฌ์—์„œ๋Š” ์šด์œจ ๊ธฐ๋ฐ˜ ์ธก์ •์น˜๋ฅผ ๋น„์žฅ์•  ๋ฐœํ™”์™€ ๋งˆ๋น„๋ง์žฅ์•  ๋ฐœํ™”๋ฅผ ๊ตฌ๋ณ„ํ•˜๋Š” ๊ฒƒ์— ์‚ฌ์šฉํ–ˆ๋‹ค. ์ž„์ƒ ํ˜„์žฅ์—์„œ๋Š” ๋งˆ๋น„๋ง์žฅ์• ์— ๋Œ€ํ•œ ์šด์œจ ๊ธฐ๋ฐ˜ ๋ถ„์„์ด ๋งˆ๋น„๋ง์žฅ์• ๋ฅผ ์ง„๋‹จํ•˜๊ฑฐ๋‚˜ ์žฅ์•  ์–‘์ƒ์— ๋”ฐ๋ฅธ ์•Œ๋งž์€ ์น˜๋ฃŒ๋ฒ•์„ ์ค€๋น„ํ•˜๋Š” ๊ฒƒ์— ๋„์›€์ด ๋  ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ ๋งˆ๋น„๋ง์žฅ์• ๊ฐ€ ์šด์œจ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์–‘์ƒ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋งˆ๋น„๋ง์žฅ์• ์˜ ์šด์œจ ํŠน์ง•์„ ๊ธด๋ฐ€ํ•˜๊ฒŒ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ์ด ํ•„์š”ํ•˜๋‹ค. ๊ตฌ์ฒด ์ ์œผ๋กœ, ์šด์œจ์ด ์–ด๋–ค ์ธก๋ฉด์—์„œ ๋งˆ๋น„๋ง์žฅ์• ์— ์˜ํ–ฅ์„ ๋ฐ›๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  ์šด์œจ ์• ๊ฐ€ ์žฅ์•  ์ •๋„์— ๋”ฐ๋ผ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฅด๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š”์ง€์— ๋Œ€ํ•œ ๋ถ„์„์ด ํ•„์š”ํ•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ์Œ๋†’์ด, ์Œ์งˆ, ๋ง์†๋„, ๋ฆฌ๋“ฌ ๋“ฑ ์šด์œจ์„ ๋‹ค์–‘ํ•œ ์ธก๋ฉด์— ์„œ ์‚ดํŽด๋ณด๊ณ , ๋งˆ๋น„๋ง์žฅ์•  ๊ฒ€์ถœ ๋ฐ ํ‰๊ฐ€์— ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ถ”์ถœ๋œ ์šด์œจ ํŠน์ง•๋“ค์€ ๋ช‡ ๊ฐ€์ง€ ํŠน์ง• ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ์ตœ์ ํ™”๋˜์–ด ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜๊ธฐ์˜ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ๋ถ„๋ฅ˜๊ธฐ์˜ ์„ฑ๋Šฅ์€ ์ •ํ™•๋„, ์ •๋ฐ€๋„, ์žฌํ˜„์œจ, F1-์ ์ˆ˜๋กœ ํ‰๊ฐ€๋˜์—ˆ๋‹ค. ๋˜ํ•œ, ๋ณธ ๋…ผ๋ฌธ์€ ์žฅ์•  ์ค‘์ฆ๋„(๊ฒฝ๋„, ์ค‘๋“ฑ๋„, ์‹ฌ๋„)์— ๋”ฐ๋ผ ์šด์œจ ์ •๋ณด ์‚ฌ์šฉ์˜ ์œ ์šฉ์„ฑ์„ ๋ถ„์„ํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์žฅ์•  ๋ฐœํ™” ์ˆ˜์ง‘์ด ์–ด๋ ค์šด ๋งŒํผ, ๋ณธ ์—ฐ๊ตฌ๋Š” ๊ต์ฐจ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ํ•œ๊ตญ์–ด์™€ ์˜์–ด ์žฅ์•  ๋ฐœํ™”๊ฐ€ ํ›ˆ๋ จ ์…‹์œผ๋กœ ์‚ฌ์šฉ๋˜์—ˆ์œผ๋ฉฐ, ํ…Œ์ŠคํŠธ์…‹์œผ๋กœ๋Š” ๊ฐ ๋ชฉํ‘œ ์–ธ์–ด๋งŒ์ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ธ ๊ฐ€์ง€๋ฅผ ์‹œ์‚ฌํ•œ๋‹ค. ์ฒซ์งธ, ์šด์œจ ์ •๋ณด ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ๋งˆ๋น„๋ง์žฅ์•  ๊ฒ€์ถœ ๋ฐ ํ‰๊ฐ€์— ๋„์›€์ด ๋œ๋‹ค. MFCC ๋งŒ์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ, ์šด์œจ ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ํ•œ๊ตญ์–ด์™€ ์˜์–ด ๋ฐ์ดํ„ฐ์…‹ ๋ชจ๋‘์—์„œ ๋„์›€์ด ๋˜์—ˆ๋‹ค. ๋‘˜์งธ, ์šด์œจ ์ •๋ณด๋Š” ํ‰๊ฐ€์— ํŠนํžˆ ์œ ์šฉํ•˜๋‹ค. ์˜์–ด์˜ ๊ฒฝ์šฐ ๊ฒ€์ถœ๊ณผ ํ‰๊ฐ€์—์„œ ๊ฐ๊ฐ 1.82%์™€ 20.6%์˜ ์ƒ๋Œ€์  ์ •ํ™•๋„ ํ–ฅ์ƒ์„ ๋ณด์˜€๋‹ค. ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ ๊ฒ€์ถœ์—์„œ๋Š” ํ–ฅ์ƒ์„ ๋ณด์ด์ง€ ์•Š์•˜์ง€๋งŒ, ํ‰๊ฐ€์—์„œ๋Š” 13.6%์˜ ์ƒ๋Œ€์  ํ–ฅ์ƒ์ด ๋‚˜ํƒ€๋‚ฌ๋‹ค. ์…‹์งธ, ๊ต์ฐจ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋Š” ๋‹จ์ผ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋ณด๋‹ค ํ–ฅ์ƒ๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ ๊ต์ฐจ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ๋Š” ๋‹จ์ผ ์–ธ์–ด ๋ถ„๋ฅ˜๊ธฐ์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ ์ƒ๋Œ€์ ์œผ๋กœ 4.12% ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์˜€๋‹ค. ์ด๊ฒƒ์€ ํŠน์ • ์šด์œจ ์žฅ์• ๋Š” ๋ฒ”์–ธ์–ด์  ํŠน์ง•์„ ๊ฐ€์ง€๋ฉฐ, ๋‹ค๋ฅธ ์–ธ์–ด ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จ์‹œ์ผœ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ•œ ํ›ˆ๋ จ ์…‹์„ ๋ณด์™„ํ•  ์ˆ˜ ์žˆ ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค.One of the earliest cues for neurological or degenerative disorders are speech impairments. Individuals with Parkinsons Disease, Cerebral Palsy, Amyotrophic lateral Sclerosis, Multiple Sclerosis among others are often diagnosed with dysarthria. Dysarthria is a group of speech disorders mainly affecting the articulatory muscles which eventually leads to severe misarticulation. However, impairments in the suprasegmental domain are also present and previous studies have shown that the prosodic patterns of speakers with dysarthria differ from the prosody of healthy speakers. In a clinical setting, a prosodic-based analysis of dysarthric speech can be helpful for diagnosing the presence of dysarthria. Therefore, there is a need to not only determine how the prosody of speech is affected by dysarthria, but also what aspects of prosody are more affected and how prosodic impairments change by the severity of dysarthria. In the current study, several prosodic features related to pitch, voice quality, rhythm and speech rate are used as features for detecting dysarthria in a given speech signal. A variety of feature selection methods are utilized to determine which set of features are optimal for accurate detection. After selecting an optimal set of prosodic features we use them as input to machine learning-based classifiers and assess the performance using the evaluation metrics: accuracy, precision, recall and F1-score. Furthermore, we examine the usefulness of prosodic measures for assessing different levels of severity (e.g. mild, moderate, severe). Finally, as collecting impaired speech data can be difficult, we also implement cross-language classifiers where both Korean and English data are used for training but only one language used for testing. Results suggest that in comparison to solely using Mel-frequency cepstral coefficients, including prosodic measurements can improve the accuracy of classifiers for both Korean and English datasets. In particular, large improvements were seen when assessing different severity levels. For English a relative accuracy improvement of 1.82% for detection and 20.6% for assessment was seen. The Korean dataset saw no improvements for detection but a relative improvement of 13.6% for assessment. The results from cross-language experiments showed a relative improvement of up to 4.12% in comparison to only using a single language during training. It was found that certain prosodic impairments such as pitch and duration may be language independent. Therefore, when training sets of individual languages are limited, they may be supplemented by including data from other languages.1. Introduction 1 1.1. Dysarthria 1 1.2. Impaired Speech Detection 3 1.3. Research Goals & Outline 6 2. Background Research 8 2.1. Prosodic Impairments 8 2.1.1. English 8 2.1.2. Korean 10 2.2. Machine Learning Approaches 12 3. Database 18 3.1. English-TORGO 20 3.2. Korean-QoLT 21 4. Methods 23 4.1. Prosodic Features 23 4.1.1. Pitch 23 4.1.2. Voice Quality 26 4.1.3. Speech Rate 29 4.1.3. Rhythm 30 4.2. Feature Selection 34 4.3. Classification Models 38 4.3.1. Random Forest 38 4.3.1. Support Vector Machine 40 4.3.1 Feed-Forward Neural Network 42 4.4. Mel-Frequency Cepstral Coefficients 43 5. Experiment 46 5.1. Model Parameters 47 5.2. Training Procedure 48 5.2.1. Dysarthria Detection 48 5.2.2. Severity Assessment 50 5.2.3. Cross-Language 51 6. Results 52 6.1. TORGO 52 6.1.1. Dysarthria Detection 52 6.1.2. Severity Assessment 56 6.2. QoLT 57 6.2.1. Dysarthria Detection 57 6.2.2. Severity Assessment 58 6.1. Cross-Language 59 7. Discussion 62 7.1. Linguistic Implications 62 7.2. Clinical Applications 65 8. Conclusion 67 References 69 Appendix 76 Abstract in Korean 79Maste

    A Sound Approach to Language Matters: In Honor of Ocke-Schwen Bohn

    Get PDF
    The contributions in this Festschrift were written by Ockeโ€™s current and former PhD-students, colleagues and research collaborators. The Festschrift is divided into six sections, moving from the smallest building blocks of language, through gradually expanding objects of linguistic inquiry to the highest levels of description - all of which have formed a part of Ockeโ€™s career, in connection with his teaching and/or his academic productions: โ€œSegmentsโ€, โ€œPerception of Accentโ€, โ€œBetween Sounds and Graphemesโ€, โ€œProsodyโ€, โ€œMorphology and Syntaxโ€ and โ€œSecond Language Acquisitionโ€.ย Each one of these illustrates a sound approach to language matters

    Prosody generation for text-to-speech synthesis

    Get PDF
    The absence of convincing intonation makes current parametric speech synthesis systems sound dull and lifeless, even when trained on expressive speech data. Typically, these systems use regression techniques to predict the fundamental frequency (F0) frame-by-frame. This approach leads to overlysmooth pitch contours and fails to construct an appropriate prosodic structure across the full utterance. In order to capture and reproduce larger-scale pitch patterns, we propose a template-based approach for automatic F0 generation, where per-syllable pitch-contour templates (from a small, automatically learned set) are predicted by a recurrent neural network (RNN). The use of syllable templates mitigates the over-smoothing problem and is able to reproduce pitch patterns observed in the data. The use of an RNN, paired with connectionist temporal classification (CTC), enables the prediction of structure in the pitch contour spanning the entire utterance. This novel F0 prediction system is used alongside separate LSTMs for predicting phone durations and the other acoustic features, to construct a complete text-to-speech system. Later, we investigate the benefits of including long-range dependencies in duration prediction at frame-level using uni-directional recurrent neural networks. Since prosody is a supra-segmental property, we consider an alternate approach to intonation generation which exploits long-term dependencies of F0 by effective modelling of linguistic features using recurrent neural networks. For this purpose, we propose a hierarchical encoder-decoder and multi-resolution parallel encoder where the encoder takes word and higher level linguistic features at the input and upsamples them to phone-level through a series of hidden layers and is integrated into a Hybrid system which is then submitted to Blizzard challenge workshop. We then highlight some of the issues in current approaches and a plan for future directions of investigation is outlined along with on-going work

    Intonation Modelling for Speech Synthesis and Emphasis Preservation

    Get PDF
    Speech-to-speech translation is a framework which recognises speech in an input language, translates it to a target language and synthesises speech in this target language. In such a system, variations in the speech signal which are inherent to natural human speech are lost, as the information goes through the different building blocks of the translation process. The work presented in this thesis addresses aspects of speech synthesis which are lost in traditional speech-to-speech translation approaches. The main research axis of this thesis is the study of prosody for speech synthesis and emphasis preservation. A first investigation of regional accents of spoken French is carried out to understand the sensitivity of native listeners with respect to accented speech synthesis. Listening tests show that standard adaptation methods for speech synthesis are not sufficient for listeners to perceive accentedness. On the other hand, combining adaptation with original prosody allows perception of accents. Addressing the need of a more suitable prosody model, a physiologically plausible intonation model is proposed. Inspired by the command-response model, it has basic components, which can be related to muscle responses to nerve impulses. These components are assumed to be a representation of muscle control of the vocal folds. A motivation for such a model is its theoretical language independence, based on the fact that humans share the same vocal apparatus. An automatic parameter extraction method which integrates a perceptually relevant measure is proposed with the model. This approach is evaluated and compared with the standard command-response model. Two corpora including sentences with emphasised words are presented, in the context of the SIWIS project. The first is a multilingual corpus with speech from multiple speaker; the second is a high quality speech synthesis oriented corpus from a professional speaker. Two broad uses of the model are evaluated. The first shows that it is difficult to predict model parameters; however the second shows that parameters can be transferred in the context of emphasis synthesis. A relation between model parameters and linguistic features such as stress and accent is demonstrated. Similar observations are made between the parameters and emphasis. Following, we investigate the extraction of atoms in emphasised speech and their transfer in neutral speech, which turns out to elicit emphasis perception. Using clustering methods, this is extended to the emphasis of other words, using linguistic context. This approach is validated by listening tests, in the case of English

    Synthesising prosody with insufficient context

    Get PDF
    Prosody is a key component in human spoken communication, signalling emotion, attitude, information structure, intention, and other communicative functions through perceived variation in intonation, loudness, timing, and voice quality. However, the prosody in text-to-speech (TTS) systems is often monotonous and adds no additional meaning to the text. Synthesising prosody is difficult for several reasons: I focus on three challenges. First, prosody is embedded in the speech signal, making it hard to model with machine learning. Second, there is no clear orthography for prosody, meaning it is underspecified in the input text and making it difficult to directly control. Third, and most importantly, prosody is determined by the context of a speech act, which TTS systems do not, and will never, have complete access to. Without the context, we cannot say if prosody is appropriate or inappropriate. Context is wide ranging, but state-of-the-art TTS acoustic models only have access to phonetic information and limited structural information. Unfortunately, most context is either difficult, expensive, or impos- sible to collect. Thus, fully specified prosodic context will never exist. Given there is insufficient context, prosody synthesis is a one-to-many generative task: it necessitates the ability to produce multiple renditions. To provide this ability, I propose methods for prosody control in TTS, using either explicit prosody features, such as F0 and duration, or learnt prosody representations disentangled from the acoustics. I demonstrate that without control of the prosodic variability in speech, TTS will produce average prosodyโ€”i.e. flat and monotonous prosody. This thesis explores different options for operating these control mechanisms. Random sampling of a learnt distribution of prosody produces more varied and realistic prosody. Alternatively, a human-in-the-loop can operate the control mechanismโ€”using their intuition to choose appropriate prosody. To improve the effectiveness of human-driven control, I design two novel approaches to make control mechanisms more human interpretable. Finally, it is important to take advantage of additional context as it becomes available. I present a novel framework that can incorporate arbitrary additional context, and demonstrate my state-of- the-art context-aware model of prosody using a pre-trained and fine-tuned language model. This thesis demonstrates empirically that appropriate prosody can be synthesised with insufficient context by accounting for unexplained prosodic variation

    Three-dimensional point-cloud room model in room acoustics simulations

    Get PDF

    SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION

    Get PDF
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers. In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range. To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems

    Investigating the build-up of precedence effect using reflection masking

    Get PDF
    The auditory processing level involved in the buildโ€up of precedence [Freyman et al., J. Acoust. Soc. Am. 90, 874โ€“884 (1991)] has been investigated here by employing reflection masked threshold (RMT) techniques. Given that RMT techniques are generally assumed to address lower levels of the auditory signal processing, such an approach represents a bottomโ€up approach to the buildup of precedence. Three conditioner configurations measuring a possible buildup of reflection suppression were compared to the baseline RMT for four reflection delays ranging from 2.5โ€“15 ms. No buildup of reflection suppression was observed for any of the conditioner configurations. Buildup of template (decrease in RMT for two of the conditioners), on the other hand, was found to be delay dependent. For five of six listeners, with reflection delay=2.5 and 15 ms, RMT decreased relative to the baseline. For 5โ€ and 10โ€ms delay, no change in threshold was observed. It is concluded that the lowโ€level auditory processing involved in RMT is not sufficient to realize a buildup of reflection suppression. This confirms suggestions that higher level processing is involved in PE buildup. The observed enhancement of reflection detection (RMT) may contribute to active suppression at higher processing levels

    Culture Clubs: Processing Speech by Deriving and Exploiting Linguistic Subcultures

    Full text link
    Spoken language understanding systems are error-prone for several reasons, including individual speech variability. This is manifested in many ways, among which are differences in pronunciation, lexical inventory, grammar and disfluencies. There is, however, a lot of evidence pointing to stable language usage within subgroups of a language population. We call these subgroups linguistic subcultures. The two broad problems are defined and a survey of the work in this space is performed. The two broad problems are: linguistic subculture detection, commonly performed via Language Identification, Accent Identification or Dialect Identification approaches; and speech and language processing tasks taken which may see increases in performance by modeling for each linguistic subculture. The data used in the experiments are drawn from four corpora: Accents of the British Isles (ABI), Intonational Variation in English (IViE), the NIST Language Recognition Evaluation Plan (LRE15) and Switchboard. The speakers in the corpora come from different parts of the United Kingdom and the United States and were provided different stimuli. From the speech samples, two features sets are used in the experiments. A number of experiments to determine linguistic subcultures are conducted. The set of experiments cover a number of approaches including the use traditional machine learning approaches shown to be effective for similar tasks in the past, each with multiple feature sets. State-of-the-art deep learning approaches are also applied to this problem. Two large automatic speech recognition (ASR) experiments are performed against all three corpora: one, monolithic experiment for all the speakers in each corpus and another for the speakers in groups according to their identified linguistic subcultures. For the discourse markers labeled in the Switchboard corpus, there are some interesting trends when examined through the lens of the speakers in their linguistic subcultures. Two large dialogue acts experiments are performed against the labeled portion of the Switchboard corpus: one, monocultural (or monolithic ) experiment for all the speakers in each corpus and another for the speakers in groups according to their identified linguistic subcultures. We conclude by discussing applications of this work, the changing landscape of natural language processing and suggestions for future research

    Speech Based Machine Learning Models for Emotional State Recognition and PTSD Detection

    Get PDF
    Recognition of emotional state and diagnosis of trauma related illnesses such as posttraumatic stress disorder (PTSD) using speech signals have been active research topics over the past decade. A typical emotion recognition system consists of three components: speech segmentation, feature extraction and emotion identification. Various speech features have been developed for emotional state recognition which can be divided into three categories, namely, excitation, vocal tract and prosodic. However, the capabilities of different feature categories and advanced machine learning techniques have not been fully explored for emotion recognition and PTSD diagnosis. For PTSD assessment, clinical diagnosis through structured interviews is a widely accepted means of diagnosis, but patients are often embarrassed to get diagnosed at clinics. The speech signal based system is a recently developed alternative. Unfortunately,PTSD speech corpora are limited in size which presents difficulties in training complex diagnostic models. This dissertation proposed sparse coding methods and deep belief network models for emotional state identification and PTSD diagnosis. It also includes an additional transfer learning strategy for PTSD diagnosis. Deep belief networks are complex models that cannot work with small data like the PTSD speech database. Thus, a transfer learning strategy was adopted to mitigate the small data problem. Transfer learning aims to extract knowledge from one or more source tasks and apply the knowledge to a target task with the intention of improving the learning. It has proved to be useful when the target task has limited high quality training data. We evaluated the proposed methods on the speech under simulated and actual stress database (SUSAS) for emotional state recognition and on two PTSD speech databases for PTSD diagnosis. Experimental results and statistical tests showed that the proposed models outperformed most state-of-the-art methods in the literature and are potentially efficient models for emotional state recognition and PTSD diagnosis
    • โ€ฆ
    corecore