22 research outputs found

    CAPTλ₯Ό μœ„ν•œ 발음 변이 뢄석 및 CycleGAN 기반 ν”Όλ“œλ°± 생성

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사)--μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› :μΈλ¬ΈλŒ€ν•™ ν˜‘λ™κ³Όμ • 인지과학전곡,2020. 2. μ •λ―Όν™”.Despite the growing popularity in learning Korean as a foreign language and the rapid development in language learning applications, the existing computer-assisted pronunciation training (CAPT) systems in Korean do not utilize linguistic characteristics of non-native Korean speech. Pronunciation variations in non-native speech are far more diverse than those observed in native speech, which may pose a difficulty in combining such knowledge in an automatic system. Moreover, most of the existing methods rely on feature extraction results from signal processing, prosodic analysis, and natural language processing techniques. Such methods entail limitations since they necessarily depend on finding the right features for the task and the extraction accuracies. This thesis presents a new approach for corrective feedback generation in a CAPT system, in which pronunciation variation patterns and linguistic correlates with accentedness are analyzed and combined with a deep neural network approach, so that feature engineering efforts are minimized while maintaining the linguistically important factors for the corrective feedback generation task. Investigations on non-native Korean speech characteristics in contrast with those of native speakers, and their correlation with accentedness judgement show that both segmental and prosodic variations are important factors in a Korean CAPT system. The present thesis argues that the feedback generation task can be interpreted as a style transfer problem, and proposes to evaluate the idea using generative adversarial network. A corrective feedback generation model is trained on 65,100 read utterances by 217 non-native speakers of 27 mother tongue backgrounds. The features are automatically learnt in an unsupervised way in an auxiliary classifier CycleGAN setting, in which the generator learns to map a foreign accented speech to native speech distributions. In order to inject linguistic knowledge into the network, an auxiliary classifier is trained so that the feedback also identifies the linguistic error types that were defined in the first half of the thesis. The proposed approach generates a corrected version the speech using the learners own voice, outperforming the conventional Pitch-Synchronous Overlap-and-Add method.μ™Έκ΅­μ–΄λ‘œμ„œμ˜ ν•œκ΅­μ–΄ κ΅μœ‘μ— λŒ€ν•œ 관심이 κ³ μ‘°λ˜μ–΄ ν•œκ΅­μ–΄ ν•™μŠ΅μžμ˜ μˆ˜κ°€ 크게 μ¦κ°€ν•˜κ³  있으며, μŒμ„±μ–Έμ–΄μ²˜λ¦¬ κΈ°μˆ μ„ μ μš©ν•œ 컴퓨터 기반 발음 ꡐ윑(Computer-Assisted Pronunciation Training; CAPT) μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ— λŒ€ν•œ 연ꡬ λ˜ν•œ 적극적으둜 이루어지고 μžˆλ‹€. κ·ΈλŸΌμ—λ„ λΆˆκ΅¬ν•˜κ³  ν˜„μ‘΄ν•˜λŠ” ν•œκ΅­μ–΄ λ§ν•˜κΈ° ꡐ윑 μ‹œμŠ€ν…œμ€ μ™Έκ΅­μΈμ˜ ν•œκ΅­μ–΄μ— λŒ€ν•œ 언어학적 νŠΉμ§•μ„ μΆ©λΆ„νžˆ ν™œμš©ν•˜μ§€ μ•Šκ³  있으며, μ΅œμ‹  μ–Έμ–΄μ²˜λ¦¬ 기술 λ˜ν•œ μ μš©λ˜μ§€ μ•Šκ³  μžˆλŠ” 싀정이닀. κ°€λŠ₯ν•œ μ›μΈμœΌλ‘œμ¨λŠ” 외ꡭ인 λ°œν™” ν•œκ΅­μ–΄ ν˜„μƒμ— λŒ€ν•œ 뢄석이 μΆ©λΆ„ν•˜κ²Œ 이루어지지 μ•Šμ•˜λ‹€λŠ” 점, 그리고 κ΄€λ ¨ 연ꡬ가 μžˆμ–΄λ„ 이λ₯Ό μžλ™ν™”λœ μ‹œμŠ€ν…œμ— λ°˜μ˜ν•˜κΈ°μ—λŠ” κ³ λ„ν™”λœ 연ꡬ가 ν•„μš”ν•˜λ‹€λŠ” 점이 μžˆλ‹€. 뿐만 μ•„λ‹ˆλΌ CAPT 기술 μ „λ°˜μ μœΌλ‘œλŠ” μ‹ ν˜Έμ²˜λ¦¬, 운율 뢄석, μžμ—°μ–΄μ²˜λ¦¬ 기법과 같은 νŠΉμ§• μΆ”μΆœμ— μ˜μ‘΄ν•˜κ³  μžˆμ–΄μ„œ μ ν•©ν•œ νŠΉμ§•μ„ μ°Ύκ³  이λ₯Ό μ •ν™•ν•˜κ²Œ μΆ”μΆœν•˜λŠ” 데에 λ§Žμ€ μ‹œκ°„κ³Ό λ…Έλ ₯이 ν•„μš”ν•œ 싀정이닀. μ΄λŠ” μ΅œμ‹  λ”₯λŸ¬λ‹ 기반 μ–Έμ–΄μ²˜λ¦¬ κΈ°μˆ μ„ ν™œμš©ν•¨μœΌλ‘œμ¨ 이 κ³Όμ • λ˜ν•œ λ°œμ „μ˜ 여지가 λ§Žλ‹€λŠ” λ°”λ₯Ό μ‹œμ‚¬ν•œλ‹€. λ”°λΌμ„œ λ³Έ μ—°κ΅¬λŠ” λ¨Όμ € CAPT μ‹œμŠ€ν…œ κ°œλ°œμ— μžˆμ–΄ 발음 변이 양상과 언어학적 상관관계λ₯Ό λΆ„μ„ν•˜μ˜€λ‹€. 외ꡭ인 ν™”μžλ“€μ˜ 낭독체 변이 양상과 ν•œκ΅­μ–΄ 원어민 ν™”μžλ“€μ˜ 낭독체 변이 양상을 λŒ€μ‘°ν•˜κ³  μ£Όμš”ν•œ 변이λ₯Ό ν™•μΈν•œ ν›„, 상관관계 뢄석을 ν†΅ν•˜μ—¬ μ˜μ‚¬μ†Œν†΅μ— 영ν–₯을 λ―ΈμΉ˜λŠ” μ€‘μš”λ„λ₯Ό νŒŒμ•…ν•˜μ˜€λ‹€. κ·Έ κ²°κ³Ό, μ’…μ„± μ‚­μ œμ™€ 3쀑 λŒ€λ¦½μ˜ ν˜Όλ™, μ΄ˆλΆ„μ ˆ κ΄€λ ¨ 였λ₯˜κ°€ λ°œμƒν•  경우 ν”Όλ“œλ°± 생성에 μš°μ„ μ μœΌλ‘œ λ°˜μ˜ν•˜λŠ” 것이 ν•„μš”ν•˜λ‹€λŠ” 것이 ν™•μΈλ˜μ—ˆλ‹€. κ΅μ •λœ ν”Όλ“œλ°±μ„ μžλ™μœΌλ‘œ μƒμ„±ν•˜λŠ” 것은 CAPT μ‹œμŠ€ν…œμ˜ μ€‘μš”ν•œ 과제 쀑 ν•˜λ‚˜μ΄λ‹€. λ³Έ μ—°κ΅¬λŠ” 이 κ³Όμ œκ°€ λ°œν™”μ˜ μŠ€νƒ€μΌ λ³€ν™”μ˜ 문제둜 해석이 κ°€λŠ₯ν•˜λ‹€κ³  λ³΄μ•˜μœΌλ©°, 생성적 μ λŒ€ 신경망 (Cycle-consistent Generative Adversarial Network; CycleGAN) κ΅¬μ‘°μ—μ„œ λͺ¨λΈλ§ν•˜λŠ” 것을 μ œμ•ˆν•˜μ˜€λ‹€. GAN λ„€νŠΈμ›Œν¬μ˜ 생성λͺ¨λΈμ€ 비원어민 λ°œν™”μ˜ 뢄포와 원어민 λ°œν™” λΆ„ν¬μ˜ 맀핑을 ν•™μŠ΅ν•˜λ©°, Cycle consistency μ†μ‹€ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•¨μœΌλ‘œμ¨ λ°œν™”κ°„ μ „λ°˜μ μΈ ꡬ쑰λ₯Ό μœ μ§€ν•¨κ³Ό λ™μ‹œμ— κ³Όλ„ν•œ ꡐ정을 λ°©μ§€ν•˜μ˜€λ‹€. λ³„λ„μ˜ νŠΉμ§• μΆ”μΆœ 과정이 없이 ν•„μš”ν•œ νŠΉμ§•λ“€μ΄ CycleGAN ν”„λ ˆμž„μ›Œν¬μ—μ„œ 무감독 λ°©λ²•μœΌλ‘œ 슀슀둜 ν•™μŠ΅λ˜λŠ” λ°©λ²•μœΌλ‘œ, μ–Έμ–΄ ν™•μž₯이 μš©μ΄ν•œ 방법이닀. 언어학적 λΆ„μ„μ—μ„œ λ“œλŸ¬λ‚œ μ£Όμš”ν•œ 변이듀 κ°„μ˜ μš°μ„ μˆœμœ„λŠ” Auxiliary Classifier CycleGAN κ΅¬μ‘°μ—μ„œ λͺ¨λΈλ§ν•˜λŠ” 것을 μ œμ•ˆν•˜μ˜€λ‹€. 이 방법은 기쑴의 CycleGAN에 지식을 μ ‘λͺ©μ‹œμΌœ ν”Όλ“œλ°± μŒμ„±μ„ 생성함과 λ™μ‹œμ— ν•΄λ‹Ή ν”Όλ“œλ°±μ΄ μ–΄λ–€ μœ ν˜•μ˜ 였λ₯˜μΈμ§€ λΆ„λ₯˜ν•˜λŠ” 문제λ₯Ό μˆ˜ν–‰ν•œλ‹€. μ΄λŠ” 도메인 지식이 ꡐ정 ν”Όλ“œλ°± 생성 λ‹¨κ³„κΉŒμ§€ μœ μ§€λ˜κ³  ν†΅μ œκ°€ κ°€λŠ₯ν•˜λ‹€λŠ” μž₯점이 μžˆλ‹€λŠ” 데에 κ·Έ μ˜μ˜κ°€ μžˆλ‹€. λ³Έ μ—°κ΅¬μ—μ„œ μ œμ•ˆν•œ 방법을 ν‰κ°€ν•˜κΈ° μœ„ν•΄μ„œ 27개의 λͺ¨κ΅­μ–΄λ₯Ό κ°–λŠ” 217λͺ…μ˜ 유의미 μ–΄νœ˜ λ°œν™” 65,100개둜 ν”Όλ“œλ°± μžλ™ 생성 λͺ¨λΈμ„ ν›ˆλ ¨ν•˜κ³ , κ°œμ„  μ—¬λΆ€ 및 정도에 λŒ€ν•œ 지각 평가λ₯Ό μˆ˜ν–‰ν•˜μ˜€λ‹€. μ œμ•ˆλœ 방법을 μ‚¬μš©ν•˜μ˜€μ„ λ•Œ ν•™μŠ΅μž 본인의 λͺ©μ†Œλ¦¬λ₯Ό μœ μ§€ν•œ 채 κ΅μ •λœ 발음으둜 λ³€ν™˜ν•˜λŠ” 것이 κ°€λŠ₯ν•˜λ©°, 전톡적인 방법인 μŒλ†’μ΄ 동기식 쀑첩가산 (Pitch-Synchronous Overlap-and-Add) μ•Œκ³ λ¦¬μ¦˜μ„ μ‚¬μš©ν•˜λŠ” 방법에 λΉ„ν•΄ μƒλŒ€ κ°œμ„ λ₯  16.67%이 ν™•μΈλ˜μ—ˆλ‹€.Chapter 1. Introduction 1 1.1. Motivation 1 1.1.1. An Overview of CAPT Systems 3 1.1.2. Survey of existing Korean CAPT Systems 5 1.2. Problem Statement 7 1.3. Thesis Structure 7 Chapter 2. Pronunciation Analysis of Korean Produced by Chinese 9 2.1. Comparison between Korean and Chinese 11 2.1.1. Phonetic and Syllable Structure Comparisons 11 2.1.2. Phonological Comparisons 14 2.2. Related Works 16 2.3. Proposed Analysis Method 19 2.3.1. Corpus 19 2.3.2. Transcribers and Agreement Rates 22 2.4. Salient Pronunciation Variations 22 2.4.1. Segmental Variation Patterns 22 2.4.1.1. Discussions 25 2.4.2. Phonological Variation Patterns 26 2.4.1.2. Discussions 27 2.5. Summary 29 Chapter 3. Correlation Analysis of Pronunciation Variations and Human Evaluation 30 3.1. Related Works 31 3.1.1. Criteria used in L2 Speech 31 3.1.2. Criteria used in L2 Korean Speech 32 3.2. Proposed Human Evaluation Method 36 3.2.1. Reading Prompt Design 36 3.2.2. Evaluation Criteria Design 37 3.2.3. Raters and Agreement Rates 40 3.3. Linguistic Factors Affecting L2 Korean Accentedness 41 3.3.1. Pearsons Correlation Analysis 41 3.3.2. Discussions 42 3.3.3. Implications for Automatic Feedback Generation 44 3.4. Summary 45 Chapter 4. Corrective Feedback Generation for CAPT 46 4.1. Related Works 46 4.1.1. Prosody Transplantation 47 4.1.2. Recent Speech Conversion Methods 49 4.1.3. Evaluation of Corrective Feedback 50 4.2. Proposed Method: Corrective Feedback as a Style Transfer 51 4.2.1. Speech Analysis at Spectral Domain 53 4.2.2. Self-imitative Learning 55 4.2.3. An Analogy: CAPT System and GAN Architecture 57 4.3. Generative Adversarial Networks 59 4.3.1. Conditional GAN 61 4.3.2. CycleGAN 62 4.4. Experiment 63 4.4.1. Corpus 64 4.4.2. Baseline Implementation 65 4.4.3. Adversarial Training Implementation 65 4.4.4. Spectrogram-to-Spectrogram Training 66 4.5. Results and Evaluation 69 4.5.1. Spectrogram Generation Results 69 4.5.2. Perceptual Evaluation 70 4.5.3. Discussions 72 4.6. Summary 74 Chapter 5. Integration of Linguistic Knowledge in an Auxiliary Classifier CycleGAN for Feedback Generation 75 5.1. Linguistic Class Selection 75 5.2. Auxiliary Classifier CycleGAN Design 77 5.3. Experiment and Results 80 5.3.1. Corpus 80 5.3.2. Feature Annotations 81 5.3.3. Experiment Setup 81 5.3.4. Results 82 5.4. Summary 84 Chapter 6. Conclusion 86 6.1. Thesis Results 86 6.2. Thesis Contributions 88 6.3. Recommendations for Future Work 89 Bibliography 91 Appendix 107 Abstract in Korean 117 Acknowledgments 120Docto

    Automatic Screening of Childhood Speech Sound Disorders and Detection of Associated Pronunciation Errors

    Full text link
    Speech disorders in children can affect their fluency and intelligibility. Delay in their diagnosis and treatment increases the risk of social impairment and learning disabilities. With the significant shortage of Speech and Language Pathologists (SLPs), there is an increasing interest in Computer-Aided Speech Therapy tools with automatic detection and diagnosis capability. However, the scarcity and unreliable annotation of disordered child speech corpora along with the high acoustic variations in the child speech data has impeded the development of reliable automatic detection and diagnosis of childhood speech sound disorders. Therefore, this thesis investigates two types of detection systems that can be achieved with minimum dependency on annotated mispronounced speech data. First, a novel approach that adopts paralinguistic features which represent the prosodic, spectral, and voice quality characteristics of the speech was proposed to perform segment- and subject-level classification of Typically Developing (TD) and Speech Sound Disordered (SSD) child speech using a binary Support Vector Machine (SVM) classifier. As paralinguistic features are both language- and content-independent, they can be extracted from an unannotated speech signal. Second, a novel Mispronunciation Detection and Diagnosis (MDD) approach was introduced to detect the pronunciation errors made due to SSDs and provide low-level diagnostic information that can be used in constructing formative feedback and a detailed diagnostic report. Unlike existing MDD methods where detection and diagnosis are performed at the phoneme level, the proposed method achieved MDD at the speech attribute level, namely the manners and places of articulations. The speech attribute features describe the involved articulators and their interactions when making a speech sound allowing a low-level description of the pronunciation error to be provided. Two novel methods to model speech attributes are further proposed in this thesis, a frame-based (phoneme-alignment) method leveraging the Multi-Task Learning (MTL) criterion and training a separate model for each attribute, and an alignment-free jointly-learnt method based on the Connectionist Temporal Classification (CTC) sequence to sequence criterion. The proposed techniques have been evaluated using standard and publicly accessible adult and child speech corpora, while the MDD method has been validated using L2 speech corpora

    Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

    Full text link
    The automatic identification and analysis of pronunciation errors, known as Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning or speech therapy applications. Existing MDD methods relying on analysing phonemes can only detect categorical errors of phonemes that have an adequate amount of training data to be modelled. With the unpredictable nature of the pronunciation errors of non-native or disordered speakers and the scarcity of training datasets, it is unfeasible to model all types of mispronunciations. Moreover, phoneme-level MDD approaches have a limited ability to provide detailed diagnostic information about the error made. In this paper, we propose a low-level MDD approach based on the detection of speech attribute features. Speech attribute features break down phoneme production into elementary components that are directly related to the articulatory system leading to more formative feedback to the learner. We further propose a multi-label variant of the Connectionist Temporal Classification (CTC) approach to jointly model the non-mutually exclusive speech attributes using a single model. The pre-trained wav2vec2 model was employed as a core model for the speech attribute detector. The proposed method was applied to L2 speech corpora collected from English learners from different native languages. The proposed speech attribute MDD method was further compared to the traditional phoneme-level MDD and achieved a significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR), and Diagnostic Error Rate (DER) over all speech attributes compared to the phoneme-level equivalent

    Apraxia World: Deploying a Mobile Game and Automatic Speech Recognition for Independent Child Speech Therapy

    Get PDF
    Children with speech sound disorders typically improve pronunciation quality by undergoing speech therapy, which must be delivered frequently and with high intensity to be effective. As such, clinic sessions are supplemented with home practice, often under caregiver supervision. However, traditional home practice can grow boring for children due to monotony. Furthermore, practice frequency is limited by caregiver availability, making it difficult for some children to reach therapy dosage. To address these issues, this dissertation presents a novel speech therapy game to increase engagement, and explores automatic pronunciation evaluation techniques to afford children independent practice. Children with speech sound disorders typically improve pronunciation quality by undergoing speech therapy, which must be delivered frequently and with high intensity to be effective. As such, clinic sessions are supplemented with home practice, often under caregiver supervision. However, traditional home practice can grow boring for children due to monotony. Furthermore, practice frequency is limited by caregiver availability, making it difficult for some children to reach therapy dosage. To address these issues, this dissertation presents a novel speech therapy game to increase engagement, and explores automatic pronunciation evaluation techniques to afford children independent practice. The therapy game, called Apraxia World, delivers customizable, repetition-based speech therapy while children play through platformer-style levels using typical on-screen tablet controls; children complete in-game speech exercises to collect assets required to progress through the levels. Additionally, Apraxia World provides pronunciation feedback according to an automated pronunciation evaluation system running locally on the tablet. Apraxia World offers two advantages over current commercial and research speech therapy games; first, the game provides extended gameplay to support long therapy treatments; second, it affords some therapy practice independence via automatic pronunciation evaluation, allowing caregivers to lightly supervise instead of directly administer the practice. Pilot testing indicated that children enjoyed the game-based therapy much more than traditional practice and that the exercises did not interfere with gameplay. During a longitudinal study, children made clinically-significant pronunciation improvements while playing Apraxia World at home. Furthermore, children remained engaged in the game-based therapy over the two-month testing period and some even wanted to continue playing post-study. The second part of the dissertation explores word- and phoneme-level pronunciation verification for child speech therapy applications. Word-level pronunciation verification is accomplished using a child-specific template-matching framework, where an utterance is compared against correctly and incorrectly pronounced examples of the word. This framework identified mispronounced words better than both a standard automated baseline and co-located caregivers. Phoneme-level mispronunciation detection is investigated using a technique from the second-language learning literature: training phoneme-specific classifiers with phonetic posterior features. This method also outperformed the standard baseline, but more significantly, identified mispronunciations better than student clinicians

    Computer analysis of children's non-native English speech for language learning and assessment

    Get PDF
    Children's ASR appears to be more challenging than adults' and it's even more difficult when it comes to non-native children's speech. This research investigates different techniques to compensate for the effects of non-native and children on the performance of ASR systems. The study mainly utilises hybrid DNN-HMM systems with conventional DNNs, LSTMs and more advanced TDNN models. This work uses the CALL-ST corpus and TLT-school corpus to study children's non-native English speech. Initially, data augmentation was explored on the CALL-ST corpus to address the lack of data problem using the AMI corpus and PF-STAR German corpus. Feature selection, acoustic model adaptation and selection were also investigated on CALL-ST. More aspects of the ASR system, including pronunciation modelling, acoustic modelling, language modelling and system fusion, were explored on the TLT-school corpus as this corpus has a bigger amount of data. Then, the relationships between the CALL-ST and TLT-school corpora were studied and utilised to improve ASR performance. The other part of the present work is text processing for non-native children's English speech. We focused on providing accept/reject feedback to learners based on the text generated by the ASR system from learners' spoken responses. A rule-based and a machine learning-based system were proposed for making the judgement, several aspects of the systems were evaluated. The influence of the ASR system on the text processing system was explored

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Max-Planck-Institute for Psycholinguistics: Annual Report 2001

    No full text
    corecore