238 research outputs found
Developing Speech Recognition and Synthesis Technologies to Support Computer-Aided Pronunciation Training for Chinese Learners of English
PACLIC 23 / City University of Hong Kong / 3-5 December 200
Automatic Pronunciation Assessment -- A Review
Pronunciation assessment and its application in computer-aided pronunciation
training (CAPT) have seen impressive progress in recent years. With the rapid
growth in language processing and deep learning over the past few years, there
is a need for an updated review. In this paper, we review methods employed in
pronunciation assessment for both phonemic and prosodic. We categorize the main
challenges observed in prominent research trends, and highlight existing
limitations, and available resources. This is followed by a discussion of the
remaining challenges and possible directions for future work.Comment: 9 pages, accepted to EMNLP Finding
MISPRONUNCIATION DETECTION AND DIAGNOSIS IN MANDARIN ACCENTED ENGLISH SPEECH
This work presents the development, implementation, and evaluation of a Mispronunciation Detection and Diagnosis (MDD) system, with application to pronunciation evaluation of Mandarin-accented English speech. A comprehensive detection and diagnosis of errors in the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) was performed by using the expert phonetic transcripts and an Automatic Speech Recognition (ASR) system. Articulatory features derived from the parallel kinematic data available in the EMA-MAE corpus were used to identify the most significant articulatory error patterns seen in L2 speakers during common mispronunciations. Using both acoustic and articulatory information, an ASR based Mispronunciation Detection and Diagnosis (MDD) system was built and evaluated across different feature combinations and Deep Neural Network (DNN) architectures. The MDD system captured mispronunciation errors with a detection accuracy of 82.4%, a diagnostic accuracy of 75.8% and a false rejection rate of 17.2%. The results demonstrate the advantage of using articulatory features in revealing the significant contributors of mispronunciation as well as improving the performance of MDD systems
Multi-View Multi-Task Representation Learning for Mispronunciation Detection
The disparity in phonology between learner's native (L1) and target (L2)
language poses a significant challenge for mispronunciation detection and
diagnosis (MDD) systems. This challenge is further intensified by lack of
annotated L2 data. This paper proposes a novel MDD architecture that exploits
multiple `views' of the same input data assisted by auxiliary tasks to learn
more distinctive phonetic representation in a low-resource setting. Using the
mono- and multilingual encoders, the model learn multiple views of the input,
and capture the sound properties across diverse languages and accents. These
encoded representations are further enriched by learning articulatory features
in a multi-task setup. Our reported results using the L2-ARCTIC data
outperformed the SOTA models, with a phoneme error rate reduction of 11.13% and
8.60% and absolute F1 score increase of 5.89%, and 2.49% compared to the
single-view mono- and multilingual systems, with a limited L2 dataset.Comment: 5 page
CAPTλ₯Ό μν λ°μ λ³μ΄ λΆμ λ° CycleGAN κΈ°λ° νΌλλ°± μμ±
νμλ
Όλ¬Έ(λ°μ¬)--μμΈλνκ΅ λνμ :μΈλ¬Έλν νλκ³Όμ μΈμ§κ³Όνμ 곡,2020. 2. μ λ―Όν.Despite the growing popularity in learning Korean as a foreign language and the rapid development in language learning applications, the existing computer-assisted pronunciation training (CAPT) systems in Korean do not utilize linguistic characteristics of non-native Korean speech. Pronunciation variations in non-native speech are far more diverse than those observed in native speech, which may pose a difficulty in combining such knowledge in an automatic system. Moreover, most of the existing methods rely on feature extraction results from signal processing, prosodic analysis, and natural language processing techniques. Such methods entail limitations since they necessarily depend on finding the right features for the task and the extraction accuracies.
This thesis presents a new approach for corrective feedback generation in a CAPT system, in which pronunciation variation patterns and linguistic correlates with accentedness are analyzed and combined with a deep neural network approach, so that feature engineering efforts are minimized while maintaining the linguistically important factors for the corrective feedback generation task. Investigations on non-native Korean speech characteristics in contrast with those of native speakers, and their correlation with accentedness judgement show that both segmental and prosodic variations are important factors in a Korean CAPT system.
The present thesis argues that the feedback generation task can be interpreted as a style transfer problem, and proposes to evaluate the idea using generative adversarial network. A corrective feedback generation model is trained on 65,100 read utterances by 217 non-native speakers of 27 mother tongue backgrounds. The features are automatically learnt in an unsupervised way in an auxiliary classifier CycleGAN setting, in which the generator learns to map a foreign accented speech to native speech distributions. In order to inject linguistic knowledge into the network, an auxiliary classifier is trained so that the feedback also identifies the linguistic error types that were defined in the first half of the thesis. The proposed approach generates a corrected version the speech using the learners own voice, outperforming the conventional Pitch-Synchronous Overlap-and-Add method.μΈκ΅μ΄λ‘μμ νκ΅μ΄ κ΅μ‘μ λν κ΄μ¬μ΄ κ³ μ‘°λμ΄ νκ΅μ΄ νμ΅μμ μκ° ν¬κ² μ¦κ°νκ³ μμΌλ©°, μμ±μΈμ΄μ²λ¦¬ κΈ°μ μ μ μ©ν μ»΄ν¨ν° κΈ°λ° λ°μ κ΅μ‘(Computer-Assisted Pronunciation Training; CAPT) μ΄ν리μΌμ΄μ
μ λν μ°κ΅¬ λν μ κ·Ήμ μΌλ‘ μ΄λ£¨μ΄μ§κ³ μλ€. κ·ΈλΌμλ λΆκ΅¬νκ³ νμ‘΄νλ νκ΅μ΄ λ§νκΈ° κ΅μ‘ μμ€ν
μ μΈκ΅μΈμ νκ΅μ΄μ λν μΈμ΄νμ νΉμ§μ μΆ©λΆν νμ©νμ§ μκ³ μμΌλ©°, μ΅μ μΈμ΄μ²λ¦¬ κΈ°μ λν μ μ©λμ§ μκ³ μλ μ€μ μ΄λ€. κ°λ₯ν μμΈμΌλ‘μ¨λ μΈκ΅μΈ λ°ν νκ΅μ΄ νμμ λν λΆμμ΄ μΆ©λΆνκ² μ΄λ£¨μ΄μ§μ§ μμλ€λ μ , κ·Έλ¦¬κ³ κ΄λ ¨ μ°κ΅¬κ° μμ΄λ μ΄λ₯Ό μλνλ μμ€ν
μ λ°μνκΈ°μλ κ³ λνλ μ°κ΅¬κ° νμνλ€λ μ μ΄ μλ€. λΏλ§ μλλΌ CAPT κΈ°μ μ λ°μ μΌλ‘λ μ νΈμ²λ¦¬, μ΄μ¨ λΆμ, μμ°μ΄μ²λ¦¬ κΈ°λ²κ³Ό κ°μ νΉμ§ μΆμΆμ μμ‘΄νκ³ μμ΄μ μ ν©ν νΉμ§μ μ°Ύκ³ μ΄λ₯Ό μ ννκ² μΆμΆνλ λ°μ λ§μ μκ°κ³Ό λ
Έλ ₯μ΄ νμν μ€μ μ΄λ€. μ΄λ μ΅μ λ₯λ¬λ κΈ°λ° μΈμ΄μ²λ¦¬ κΈ°μ μ νμ©ν¨μΌλ‘μ¨ μ΄ κ³Όμ λν λ°μ μ μ¬μ§κ° λ§λ€λ λ°λ₯Ό μμ¬νλ€.
λ°λΌμ λ³Έ μ°κ΅¬λ λ¨Όμ CAPT μμ€ν
κ°λ°μ μμ΄ λ°μ λ³μ΄ μμκ³Ό μΈμ΄νμ μκ΄κ΄κ³λ₯Ό λΆμνμλ€. μΈκ΅μΈ νμλ€μ λλ
체 λ³μ΄ μμκ³Ό νκ΅μ΄ μμ΄λ―Ό νμλ€μ λλ
체 λ³μ΄ μμμ λμ‘°νκ³ μ£Όμν λ³μ΄λ₯Ό νμΈν ν, μκ΄κ΄κ³ λΆμμ ν΅νμ¬ μμ¬μν΅μ μν₯μ λ―ΈμΉλ μ€μλλ₯Ό νμ
νμλ€. κ·Έ κ²°κ³Ό, μ’
μ± μμ μ 3μ€ λ립μ νΌλ, μ΄λΆμ κ΄λ ¨ μ€λ₯κ° λ°μν κ²½μ° νΌλλ°± μμ±μ μ°μ μ μΌλ‘ λ°μνλ κ²μ΄ νμνλ€λ κ²μ΄ νμΈλμλ€.
κ΅μ λ νΌλλ°±μ μλμΌλ‘ μμ±νλ κ²μ CAPT μμ€ν
μ μ€μν κ³Όμ μ€ νλμ΄λ€. λ³Έ μ°κ΅¬λ μ΄ κ³Όμ κ° λ°νμ μ€νμΌ λ³νμ λ¬Έμ λ‘ ν΄μμ΄ κ°λ₯νλ€κ³ 보μμΌλ©°, μμ±μ μ λ μ κ²½λ§ (Cycle-consistent Generative Adversarial Network; CycleGAN) ꡬ쑰μμ λͺ¨λΈλ§νλ κ²μ μ μνμλ€. GAN λ€νΈμν¬μ μμ±λͺ¨λΈμ λΉμμ΄λ―Ό λ°νμ λΆν¬μ μμ΄λ―Ό λ°ν λΆν¬μ 맀νμ νμ΅νλ©°, Cycle consistency μμ€ν¨μλ₯Ό μ¬μ©ν¨μΌλ‘μ¨ λ°νκ° μ λ°μ μΈ κ΅¬μ‘°λ₯Ό μ μ§ν¨κ³Ό λμμ κ³Όλν κ΅μ μ λ°©μ§νμλ€. λ³λμ νΉμ§ μΆμΆ κ³Όμ μ΄ μμ΄ νμν νΉμ§λ€μ΄ CycleGAN νλ μμν¬μμ 무κ°λ
λ°©λ²μΌλ‘ μ€μ€λ‘ νμ΅λλ λ°©λ²μΌλ‘, μΈμ΄ νμ₯μ΄ μ©μ΄ν λ°©λ²μ΄λ€.
μΈμ΄νμ λΆμμμ λλ¬λ μ£Όμν λ³μ΄λ€ κ°μ μ°μ μμλ Auxiliary Classifier CycleGAN ꡬ쑰μμ λͺ¨λΈλ§νλ κ²μ μ μνμλ€. μ΄ λ°©λ²μ κΈ°μ‘΄μ CycleGANμ μ§μμ μ λͺ©μμΌ νΌλλ°± μμ±μ μμ±ν¨κ³Ό λμμ ν΄λΉ νΌλλ°±μ΄ μ΄λ€ μ νμ μ€λ₯μΈμ§ λΆλ₯νλ λ¬Έμ λ₯Ό μννλ€. μ΄λ λλ©μΈ μ§μμ΄ κ΅μ νΌλλ°± μμ± λ¨κ³κΉμ§ μ μ§λκ³ ν΅μ κ° κ°λ₯νλ€λ μ₯μ μ΄ μλ€λ λ°μ κ·Έ μμκ° μλ€.
λ³Έ μ°κ΅¬μμ μ μν λ°©λ²μ νκ°νκΈ° μν΄μ 27κ°μ λͺ¨κ΅μ΄λ₯Ό κ°λ 217λͺ
μ μ μλ―Έ μ΄ν λ°ν 65,100κ°λ‘ νΌλλ°± μλ μμ± λͺ¨λΈμ νλ ¨νκ³ , κ°μ μ¬λΆ λ° μ λμ λν μ§κ° νκ°λ₯Ό μννμλ€. μ μλ λ°©λ²μ μ¬μ©νμμ λ νμ΅μ λ³ΈμΈμ λͺ©μ리λ₯Ό μ μ§ν μ± κ΅μ λ λ°μμΌλ‘ λ³ννλ κ²μ΄ κ°λ₯νλ©°, μ ν΅μ μΈ λ°©λ²μΈ μλμ΄ λκΈ°μ μ€μ²©κ°μ° (Pitch-Synchronous Overlap-and-Add) μκ³ λ¦¬μ¦μ μ¬μ©νλ λ°©λ²μ λΉν΄ μλ κ°μ λ₯ 16.67%μ΄ νμΈλμλ€.Chapter 1. Introduction 1
1.1. Motivation 1
1.1.1. An Overview of CAPT Systems 3
1.1.2. Survey of existing Korean CAPT Systems 5
1.2. Problem Statement 7
1.3. Thesis Structure 7
Chapter 2. Pronunciation Analysis of Korean Produced by Chinese 9
2.1. Comparison between Korean and Chinese 11
2.1.1. Phonetic and Syllable Structure Comparisons 11
2.1.2. Phonological Comparisons 14
2.2. Related Works 16
2.3. Proposed Analysis Method 19
2.3.1. Corpus 19
2.3.2. Transcribers and Agreement Rates 22
2.4. Salient Pronunciation Variations 22
2.4.1. Segmental Variation Patterns 22
2.4.1.1. Discussions 25
2.4.2. Phonological Variation Patterns 26
2.4.1.2. Discussions 27
2.5. Summary 29
Chapter 3. Correlation Analysis of Pronunciation Variations and Human Evaluation 30
3.1. Related Works 31
3.1.1. Criteria used in L2 Speech 31
3.1.2. Criteria used in L2 Korean Speech 32
3.2. Proposed Human Evaluation Method 36
3.2.1. Reading Prompt Design 36
3.2.2. Evaluation Criteria Design 37
3.2.3. Raters and Agreement Rates 40
3.3. Linguistic Factors Affecting L2 Korean Accentedness 41
3.3.1. Pearsons Correlation Analysis 41
3.3.2. Discussions 42
3.3.3. Implications for Automatic Feedback Generation 44
3.4. Summary 45
Chapter 4. Corrective Feedback Generation for CAPT 46
4.1. Related Works 46
4.1.1. Prosody Transplantation 47
4.1.2. Recent Speech Conversion Methods 49
4.1.3. Evaluation of Corrective Feedback 50
4.2. Proposed Method: Corrective Feedback as a Style Transfer 51
4.2.1. Speech Analysis at Spectral Domain 53
4.2.2. Self-imitative Learning 55
4.2.3. An Analogy: CAPT System and GAN Architecture 57
4.3. Generative Adversarial Networks 59
4.3.1. Conditional GAN 61
4.3.2. CycleGAN 62
4.4. Experiment 63
4.4.1. Corpus 64
4.4.2. Baseline Implementation 65
4.4.3. Adversarial Training Implementation 65
4.4.4. Spectrogram-to-Spectrogram Training 66
4.5. Results and Evaluation 69
4.5.1. Spectrogram Generation Results 69
4.5.2. Perceptual Evaluation 70
4.5.3. Discussions 72
4.6. Summary 74
Chapter 5. Integration of Linguistic Knowledge in an Auxiliary Classifier CycleGAN for Feedback Generation 75
5.1. Linguistic Class Selection 75
5.2. Auxiliary Classifier CycleGAN Design 77
5.3. Experiment and Results 80
5.3.1. Corpus 80
5.3.2. Feature Annotations 81
5.3.3. Experiment Setup 81
5.3.4. Results 82
5.4. Summary 84
Chapter 6. Conclusion 86
6.1. Thesis Results 86
6.2. Thesis Contributions 88
6.3. Recommendations for Future Work 89
Bibliography 91
Appendix 107
Abstract in Korean 117
Acknowledgments 120Docto
Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method
The automatic identification and analysis of pronunciation errors, known as
Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer
Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning
or speech therapy applications. Existing MDD methods relying on analysing
phonemes can only detect categorical errors of phonemes that have an adequate
amount of training data to be modelled. With the unpredictable nature of the
pronunciation errors of non-native or disordered speakers and the scarcity of
training datasets, it is unfeasible to model all types of mispronunciations.
Moreover, phoneme-level MDD approaches have a limited ability to provide
detailed diagnostic information about the error made. In this paper, we propose
a low-level MDD approach based on the detection of speech attribute features.
Speech attribute features break down phoneme production into elementary
components that are directly related to the articulatory system leading to more
formative feedback to the learner. We further propose a multi-label variant of
the Connectionist Temporal Classification (CTC) approach to jointly model the
non-mutually exclusive speech attributes using a single model. The pre-trained
wav2vec2 model was employed as a core model for the speech attribute detector.
The proposed method was applied to L2 speech corpora collected from English
learners from different native languages. The proposed speech attribute MDD
method was further compared to the traditional phoneme-level MDD and achieved a
significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR),
and Diagnostic Error Rate (DER) over all speech attributes compared to the
phoneme-level equivalent
Automatic detection of accent and lexical pronunciation errors in spontaneous non-native English speech
Detecting individual pronunciation errors and diagnosing pronunciation error tendencies in a language learner based on their speech are important components of computer-aided language learning (CALL). The tasks of error detection and error tendency diagnosis become particularly challenging when the speech in question is spontaneous and particularly given the challenges posed by the inconsistency of human annotation of pronunciation errors. This paper presents an approach to these tasks by distinguishing between lexical errors, wherein the speaker does not know how a particular word is pronounced, and accent errors, wherein the candidate's speech exhibits consistent patterns of phone substitution, deletion and insertion. Three annotated corpora of non-native English speech by speakers of multiple L1s are analysed, the consistency of human annotation investigated and a method presented for detecting individual accent and lexical errors and diagnosing accent error tendencies at the speaker level
Methods for pronunciation assessment in computer aided language learning
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 149-176).Learning a foreign language is a challenging endeavor that entails acquiring a wide range of new knowledge including words, grammar, gestures, sounds, etc. Mastering these skills all require extensive practice by the learner and opportunities may not always be available. Computer Aided Language Learning (CALL) systems provide non-threatening environments where foreign language skills can be practiced where ever and whenever a student desires. These systems often have several technologies to identify the different types of errors made by a student. This thesis focuses on the problem of identifying mispronunciations made by a foreign language student using a CALL system. We make several assumptions about the nature of the learning activity: it takes place using a dialogue system, it is a task- or game-oriented activity, the student should not be interrupted by the pronunciation feedback system, and that the goal of the feedback system is to identify severe mispronunciations with high reliability. Detecting mispronunciations requires a corpus of speech with human judgements of pronunciation quality. Typical approaches to collecting such a corpus use an expert phonetician to both phonetically transcribe and assign judgements of quality to each phone in a corpus. This is time consuming and expensive. It also places an extra burden on the transcriber. We describe a novel method for obtaining phone level judgements of pronunciation quality by utilizing non-expert, crowd-sourced, word level judgements of pronunciation. Foreign language learners typically exhibit high variation and pronunciation shapes distinct from native speakers that make analysis for mispronunciation difficult. We detail a simple, but effective method for transforming the vowel space of non-native speakers to make mispronunciation detection more robust and accurate. We show that this transformation not only enhances performance on a simple classification task, but also results in distributions that can be better exploited for mispronunciation detection. This transformation of the vowel is exploited to train a mispronunciation detector using a variety of features derived from acoustic model scores and vowel class distributions. We confirm that the transformation technique results in a more robust and accurate identification of mispronunciations than traditional acoustic models.by Mitchell A. Peabody.Ph.D
- β¦