6,151 research outputs found

    Self-imitating Feedback Generation Using GAN for Computer-Assisted Pronunciation Training

    Full text link
    Self-imitating feedback is an effective and learner-friendly method for non-native learners in Computer-Assisted Pronunciation Training. Acoustic characteristics in native utterances are extracted and transplanted onto learner's own speech input, and given back to the learner as a corrective feedback. Previous works focused on speech conversion using prosodic transplantation techniques based on PSOLA algorithm. Motivated by the visual differences found in spectrograms of native and non-native speeches, we investigated applying GAN to generate self-imitating feedback by utilizing generator's ability through adversarial training. Because this mapping is highly under-constrained, we also adopt cycle consistency loss to encourage the output to preserve the global structure, which is shared by native and non-native utterances. Trained on 97,200 spectrogram images of short utterances produced by native and non-native speakers of Korean, the generator is able to successfully transform the non-native spectrogram input to a spectrogram with properties of self-imitating feedback. Furthermore, the transformed spectrogram shows segmental corrections that cannot be obtained by prosodic transplantation. Perceptual test comparing the self-imitating and correcting abilities of our method with the baseline PSOLA method shows that the generative approach with cycle consistency loss is promising

    CAPTλ₯Ό μœ„ν•œ 발음 변이 뢄석 및 CycleGAN 기반 ν”Όλ“œλ°± 생성

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사)--μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› :μΈλ¬ΈλŒ€ν•™ ν˜‘λ™κ³Όμ • 인지과학전곡,2020. 2. μ •λ―Όν™”.Despite the growing popularity in learning Korean as a foreign language and the rapid development in language learning applications, the existing computer-assisted pronunciation training (CAPT) systems in Korean do not utilize linguistic characteristics of non-native Korean speech. Pronunciation variations in non-native speech are far more diverse than those observed in native speech, which may pose a difficulty in combining such knowledge in an automatic system. Moreover, most of the existing methods rely on feature extraction results from signal processing, prosodic analysis, and natural language processing techniques. Such methods entail limitations since they necessarily depend on finding the right features for the task and the extraction accuracies. This thesis presents a new approach for corrective feedback generation in a CAPT system, in which pronunciation variation patterns and linguistic correlates with accentedness are analyzed and combined with a deep neural network approach, so that feature engineering efforts are minimized while maintaining the linguistically important factors for the corrective feedback generation task. Investigations on non-native Korean speech characteristics in contrast with those of native speakers, and their correlation with accentedness judgement show that both segmental and prosodic variations are important factors in a Korean CAPT system. The present thesis argues that the feedback generation task can be interpreted as a style transfer problem, and proposes to evaluate the idea using generative adversarial network. A corrective feedback generation model is trained on 65,100 read utterances by 217 non-native speakers of 27 mother tongue backgrounds. The features are automatically learnt in an unsupervised way in an auxiliary classifier CycleGAN setting, in which the generator learns to map a foreign accented speech to native speech distributions. In order to inject linguistic knowledge into the network, an auxiliary classifier is trained so that the feedback also identifies the linguistic error types that were defined in the first half of the thesis. The proposed approach generates a corrected version the speech using the learners own voice, outperforming the conventional Pitch-Synchronous Overlap-and-Add method.μ™Έκ΅­μ–΄λ‘œμ„œμ˜ ν•œκ΅­μ–΄ κ΅μœ‘μ— λŒ€ν•œ 관심이 κ³ μ‘°λ˜μ–΄ ν•œκ΅­μ–΄ ν•™μŠ΅μžμ˜ μˆ˜κ°€ 크게 μ¦κ°€ν•˜κ³  있으며, μŒμ„±μ–Έμ–΄μ²˜λ¦¬ κΈ°μˆ μ„ μ μš©ν•œ 컴퓨터 기반 발음 ꡐ윑(Computer-Assisted Pronunciation Training; CAPT) μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ— λŒ€ν•œ 연ꡬ λ˜ν•œ 적극적으둜 이루어지고 μžˆλ‹€. κ·ΈλŸΌμ—λ„ λΆˆκ΅¬ν•˜κ³  ν˜„μ‘΄ν•˜λŠ” ν•œκ΅­μ–΄ λ§ν•˜κΈ° ꡐ윑 μ‹œμŠ€ν…œμ€ μ™Έκ΅­μΈμ˜ ν•œκ΅­μ–΄μ— λŒ€ν•œ 언어학적 νŠΉμ§•μ„ μΆ©λΆ„νžˆ ν™œμš©ν•˜μ§€ μ•Šκ³  있으며, μ΅œμ‹  μ–Έμ–΄μ²˜λ¦¬ 기술 λ˜ν•œ μ μš©λ˜μ§€ μ•Šκ³  μžˆλŠ” 싀정이닀. κ°€λŠ₯ν•œ μ›μΈμœΌλ‘œμ¨λŠ” 외ꡭ인 λ°œν™” ν•œκ΅­μ–΄ ν˜„μƒμ— λŒ€ν•œ 뢄석이 μΆ©λΆ„ν•˜κ²Œ 이루어지지 μ•Šμ•˜λ‹€λŠ” 점, 그리고 κ΄€λ ¨ 연ꡬ가 μžˆμ–΄λ„ 이λ₯Ό μžλ™ν™”λœ μ‹œμŠ€ν…œμ— λ°˜μ˜ν•˜κΈ°μ—λŠ” κ³ λ„ν™”λœ 연ꡬ가 ν•„μš”ν•˜λ‹€λŠ” 점이 μžˆλ‹€. 뿐만 μ•„λ‹ˆλΌ CAPT 기술 μ „λ°˜μ μœΌλ‘œλŠ” μ‹ ν˜Έμ²˜λ¦¬, 운율 뢄석, μžμ—°μ–΄μ²˜λ¦¬ 기법과 같은 νŠΉμ§• μΆ”μΆœμ— μ˜μ‘΄ν•˜κ³  μžˆμ–΄μ„œ μ ν•©ν•œ νŠΉμ§•μ„ μ°Ύκ³  이λ₯Ό μ •ν™•ν•˜κ²Œ μΆ”μΆœν•˜λŠ” 데에 λ§Žμ€ μ‹œκ°„κ³Ό λ…Έλ ₯이 ν•„μš”ν•œ 싀정이닀. μ΄λŠ” μ΅œμ‹  λ”₯λŸ¬λ‹ 기반 μ–Έμ–΄μ²˜λ¦¬ κΈ°μˆ μ„ ν™œμš©ν•¨μœΌλ‘œμ¨ 이 κ³Όμ • λ˜ν•œ λ°œμ „μ˜ 여지가 λ§Žλ‹€λŠ” λ°”λ₯Ό μ‹œμ‚¬ν•œλ‹€. λ”°λΌμ„œ λ³Έ μ—°κ΅¬λŠ” λ¨Όμ € CAPT μ‹œμŠ€ν…œ κ°œλ°œμ— μžˆμ–΄ 발음 변이 양상과 언어학적 상관관계λ₯Ό λΆ„μ„ν•˜μ˜€λ‹€. 외ꡭ인 ν™”μžλ“€μ˜ 낭독체 변이 양상과 ν•œκ΅­μ–΄ 원어민 ν™”μžλ“€μ˜ 낭독체 변이 양상을 λŒ€μ‘°ν•˜κ³  μ£Όμš”ν•œ 변이λ₯Ό ν™•μΈν•œ ν›„, 상관관계 뢄석을 ν†΅ν•˜μ—¬ μ˜μ‚¬μ†Œν†΅μ— 영ν–₯을 λ―ΈμΉ˜λŠ” μ€‘μš”λ„λ₯Ό νŒŒμ•…ν•˜μ˜€λ‹€. κ·Έ κ²°κ³Ό, μ’…μ„± μ‚­μ œμ™€ 3쀑 λŒ€λ¦½μ˜ ν˜Όλ™, μ΄ˆλΆ„μ ˆ κ΄€λ ¨ 였λ₯˜κ°€ λ°œμƒν•  경우 ν”Όλ“œλ°± 생성에 μš°μ„ μ μœΌλ‘œ λ°˜μ˜ν•˜λŠ” 것이 ν•„μš”ν•˜λ‹€λŠ” 것이 ν™•μΈλ˜μ—ˆλ‹€. κ΅μ •λœ ν”Όλ“œλ°±μ„ μžλ™μœΌλ‘œ μƒμ„±ν•˜λŠ” 것은 CAPT μ‹œμŠ€ν…œμ˜ μ€‘μš”ν•œ 과제 쀑 ν•˜λ‚˜μ΄λ‹€. λ³Έ μ—°κ΅¬λŠ” 이 κ³Όμ œκ°€ λ°œν™”μ˜ μŠ€νƒ€μΌ λ³€ν™”μ˜ 문제둜 해석이 κ°€λŠ₯ν•˜λ‹€κ³  λ³΄μ•˜μœΌλ©°, 생성적 μ λŒ€ 신경망 (Cycle-consistent Generative Adversarial Network; CycleGAN) κ΅¬μ‘°μ—μ„œ λͺ¨λΈλ§ν•˜λŠ” 것을 μ œμ•ˆν•˜μ˜€λ‹€. GAN λ„€νŠΈμ›Œν¬μ˜ 생성λͺ¨λΈμ€ 비원어민 λ°œν™”μ˜ 뢄포와 원어민 λ°œν™” λΆ„ν¬μ˜ 맀핑을 ν•™μŠ΅ν•˜λ©°, Cycle consistency μ†μ‹€ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•¨μœΌλ‘œμ¨ λ°œν™”κ°„ μ „λ°˜μ μΈ ꡬ쑰λ₯Ό μœ μ§€ν•¨κ³Ό λ™μ‹œμ— κ³Όλ„ν•œ ꡐ정을 λ°©μ§€ν•˜μ˜€λ‹€. λ³„λ„μ˜ νŠΉμ§• μΆ”μΆœ 과정이 없이 ν•„μš”ν•œ νŠΉμ§•λ“€μ΄ CycleGAN ν”„λ ˆμž„μ›Œν¬μ—μ„œ 무감독 λ°©λ²•μœΌλ‘œ 슀슀둜 ν•™μŠ΅λ˜λŠ” λ°©λ²•μœΌλ‘œ, μ–Έμ–΄ ν™•μž₯이 μš©μ΄ν•œ 방법이닀. 언어학적 λΆ„μ„μ—μ„œ λ“œλŸ¬λ‚œ μ£Όμš”ν•œ 변이듀 κ°„μ˜ μš°μ„ μˆœμœ„λŠ” Auxiliary Classifier CycleGAN κ΅¬μ‘°μ—μ„œ λͺ¨λΈλ§ν•˜λŠ” 것을 μ œμ•ˆν•˜μ˜€λ‹€. 이 방법은 기쑴의 CycleGAN에 지식을 μ ‘λͺ©μ‹œμΌœ ν”Όλ“œλ°± μŒμ„±μ„ 생성함과 λ™μ‹œμ— ν•΄λ‹Ή ν”Όλ“œλ°±μ΄ μ–΄λ–€ μœ ν˜•μ˜ 였λ₯˜μΈμ§€ λΆ„λ₯˜ν•˜λŠ” 문제λ₯Ό μˆ˜ν–‰ν•œλ‹€. μ΄λŠ” 도메인 지식이 ꡐ정 ν”Όλ“œλ°± 생성 λ‹¨κ³„κΉŒμ§€ μœ μ§€λ˜κ³  ν†΅μ œκ°€ κ°€λŠ₯ν•˜λ‹€λŠ” μž₯점이 μžˆλ‹€λŠ” 데에 κ·Έ μ˜μ˜κ°€ μžˆλ‹€. λ³Έ μ—°κ΅¬μ—μ„œ μ œμ•ˆν•œ 방법을 ν‰κ°€ν•˜κΈ° μœ„ν•΄μ„œ 27개의 λͺ¨κ΅­μ–΄λ₯Ό κ°–λŠ” 217λͺ…μ˜ 유의미 μ–΄νœ˜ λ°œν™” 65,100개둜 ν”Όλ“œλ°± μžλ™ 생성 λͺ¨λΈμ„ ν›ˆλ ¨ν•˜κ³ , κ°œμ„  μ—¬λΆ€ 및 정도에 λŒ€ν•œ 지각 평가λ₯Ό μˆ˜ν–‰ν•˜μ˜€λ‹€. μ œμ•ˆλœ 방법을 μ‚¬μš©ν•˜μ˜€μ„ λ•Œ ν•™μŠ΅μž 본인의 λͺ©μ†Œλ¦¬λ₯Ό μœ μ§€ν•œ 채 κ΅μ •λœ 발음으둜 λ³€ν™˜ν•˜λŠ” 것이 κ°€λŠ₯ν•˜λ©°, 전톡적인 방법인 μŒλ†’μ΄ 동기식 쀑첩가산 (Pitch-Synchronous Overlap-and-Add) μ•Œκ³ λ¦¬μ¦˜μ„ μ‚¬μš©ν•˜λŠ” 방법에 λΉ„ν•΄ μƒλŒ€ κ°œμ„ λ₯  16.67%이 ν™•μΈλ˜μ—ˆλ‹€.Chapter 1. Introduction 1 1.1. Motivation 1 1.1.1. An Overview of CAPT Systems 3 1.1.2. Survey of existing Korean CAPT Systems 5 1.2. Problem Statement 7 1.3. Thesis Structure 7 Chapter 2. Pronunciation Analysis of Korean Produced by Chinese 9 2.1. Comparison between Korean and Chinese 11 2.1.1. Phonetic and Syllable Structure Comparisons 11 2.1.2. Phonological Comparisons 14 2.2. Related Works 16 2.3. Proposed Analysis Method 19 2.3.1. Corpus 19 2.3.2. Transcribers and Agreement Rates 22 2.4. Salient Pronunciation Variations 22 2.4.1. Segmental Variation Patterns 22 2.4.1.1. Discussions 25 2.4.2. Phonological Variation Patterns 26 2.4.1.2. Discussions 27 2.5. Summary 29 Chapter 3. Correlation Analysis of Pronunciation Variations and Human Evaluation 30 3.1. Related Works 31 3.1.1. Criteria used in L2 Speech 31 3.1.2. Criteria used in L2 Korean Speech 32 3.2. Proposed Human Evaluation Method 36 3.2.1. Reading Prompt Design 36 3.2.2. Evaluation Criteria Design 37 3.2.3. Raters and Agreement Rates 40 3.3. Linguistic Factors Affecting L2 Korean Accentedness 41 3.3.1. Pearsons Correlation Analysis 41 3.3.2. Discussions 42 3.3.3. Implications for Automatic Feedback Generation 44 3.4. Summary 45 Chapter 4. Corrective Feedback Generation for CAPT 46 4.1. Related Works 46 4.1.1. Prosody Transplantation 47 4.1.2. Recent Speech Conversion Methods 49 4.1.3. Evaluation of Corrective Feedback 50 4.2. Proposed Method: Corrective Feedback as a Style Transfer 51 4.2.1. Speech Analysis at Spectral Domain 53 4.2.2. Self-imitative Learning 55 4.2.3. An Analogy: CAPT System and GAN Architecture 57 4.3. Generative Adversarial Networks 59 4.3.1. Conditional GAN 61 4.3.2. CycleGAN 62 4.4. Experiment 63 4.4.1. Corpus 64 4.4.2. Baseline Implementation 65 4.4.3. Adversarial Training Implementation 65 4.4.4. Spectrogram-to-Spectrogram Training 66 4.5. Results and Evaluation 69 4.5.1. Spectrogram Generation Results 69 4.5.2. Perceptual Evaluation 70 4.5.3. Discussions 72 4.6. Summary 74 Chapter 5. Integration of Linguistic Knowledge in an Auxiliary Classifier CycleGAN for Feedback Generation 75 5.1. Linguistic Class Selection 75 5.2. Auxiliary Classifier CycleGAN Design 77 5.3. Experiment and Results 80 5.3.1. Corpus 80 5.3.2. Feature Annotations 81 5.3.3. Experiment Setup 81 5.3.4. Results 82 5.4. Summary 84 Chapter 6. Conclusion 86 6.1. Thesis Results 86 6.2. Thesis Contributions 88 6.3. Recommendations for Future Work 89 Bibliography 91 Appendix 107 Abstract in Korean 117 Acknowledgments 120Docto

    Attentive Learning of Sequential Handwriting Movements: A Neural Network Model

    Full text link
    Defense Advanced research Projects Agency and the Office of Naval Research (N00014-95-1-0409, N00014-92-J-1309); National Science Foundation (IRI-97-20333); National Institutes of Health (I-R29-DC02952-01)

    Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion

    Get PDF
    Acoustic-to-articulatory inversion, the estimation of articulatory kinematics from an acoustic waveform, is a challenging but important problem. Accurate estimation of articulatory movements has the potential for significant impact on our understanding of speech production, on our capacity to assess and treat pathologies in a clinical setting, and on speech technologies such as computer aided pronunciation assessment and audio-video synthesis. However, because of the complex and speaker-specific relationship between articulation and acoustics, existing approaches for inversion do not generalize well across speakers. As acquiring speaker-specific kinematic data for training is not feasible in many practical applications, this remains an important and open problem. This paper proposes a novel approach to acoustic-to-articulatory inversion, Parallel Reference Speaker Weighting (PRSW), which requires no kinematic data for the target speaker and a small amount of acoustic adaptation data. PRSW hypothesizes that acoustic and kinematic similarities are correlated and uses speaker-adapted articulatory models derived from acoustically derived weights. The system was assessed using a 20-speaker data set of synchronous acoustic and Electromagnetic Articulography (EMA) kinematic data. Results demonstrate that by restricting the reference group to a subset consisting of speakers with strong individual speaker-dependent inversion performance, the PRSW method is able to attain kinematic-independent acoustic-to-articulatory inversion performance nearly matching that of the speaker-dependent model, with an average correlation of 0.62 versus 0.63. This indicates that given a sufficiently complete and appropriately selected reference speaker set for adaptation, it is possible to create effective articulatory models without kinematic training data
    • …
    corecore