15 research outputs found
Automatic Pronunciation Assessment -- A Review
Pronunciation assessment and its application in computer-aided pronunciation
training (CAPT) have seen impressive progress in recent years. With the rapid
growth in language processing and deep learning over the past few years, there
is a need for an updated review. In this paper, we review methods employed in
pronunciation assessment for both phonemic and prosodic. We categorize the main
challenges observed in prominent research trends, and highlight existing
limitations, and available resources. This is followed by a discussion of the
remaining challenges and possible directions for future work.Comment: 9 pages, accepted to EMNLP Finding
A comparison-based approach to mispronunciation detection
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 89-92).This thesis focuses on the problem of detecting word-level mispronunciations in nonnative speech. Conventional automatic speech recognition-based mispronunciation detection systems have the disadvantage of requiring a large amount of language-specific, annotated training data. Some systems even require a speech recognizer in the target language and another one in the students' native language. To reduce human labeling effort and for generalization across all languages, we propose a comparison-based framework which only requires word-level timing information from the native training data. With the assumption that the student is trying to enunciate the given script, dynamic time warping (DTW) is carried out between a student's utterance (nonnative speech) and a teacher's utterance (native speech), and we focus on detecting mis-alignment in the warping path and the distance matrix. The first stage of the system locates word boundaries in the nonnative utterance. To handle the problem that nonnative speech often contains intra-word pauses, we run DTW with a silence model which can align the two utterances, detect and remove silences at the same time. In order to segment each word into smaller, acoustically similar, units for a finer-grained analysis, we develop a phoneme-like unit segmentor which works by segmenting the selfsimilarity matrix into low-distance regions along the diagonal. Both phone-level and wordlevel features that describe the degree of mis-alignment between the two utterances are extracted, and the problem is formulated as a classification task. SVM classifiers are trained, and three voting schemes are considered for the cases where there are more than one matching reference utterance. The system is evaluated on the Chinese University Chinese Learners of English (CUCHLOE) corpus, and the TIMIT corpus is used as the native corpus. Experimental results have shown 1) the effectiveness of the silence model in guiding DTW to capture the word boundaries in nonnative speech more accurately, 2) the complimentary performance of the word-level and the phone-level features, and 3) the stable performance of the system with or without phonetic units labeling.by Ann Lee.S.M
CAPTλ₯Ό μν λ°μ λ³μ΄ λΆμ λ° CycleGAN κΈ°λ° νΌλλ°± μμ±
νμλ
Όλ¬Έ(λ°μ¬)--μμΈλνκ΅ λνμ :μΈλ¬Έλν νλκ³Όμ μΈμ§κ³Όνμ 곡,2020. 2. μ λ―Όν.Despite the growing popularity in learning Korean as a foreign language and the rapid development in language learning applications, the existing computer-assisted pronunciation training (CAPT) systems in Korean do not utilize linguistic characteristics of non-native Korean speech. Pronunciation variations in non-native speech are far more diverse than those observed in native speech, which may pose a difficulty in combining such knowledge in an automatic system. Moreover, most of the existing methods rely on feature extraction results from signal processing, prosodic analysis, and natural language processing techniques. Such methods entail limitations since they necessarily depend on finding the right features for the task and the extraction accuracies.
This thesis presents a new approach for corrective feedback generation in a CAPT system, in which pronunciation variation patterns and linguistic correlates with accentedness are analyzed and combined with a deep neural network approach, so that feature engineering efforts are minimized while maintaining the linguistically important factors for the corrective feedback generation task. Investigations on non-native Korean speech characteristics in contrast with those of native speakers, and their correlation with accentedness judgement show that both segmental and prosodic variations are important factors in a Korean CAPT system.
The present thesis argues that the feedback generation task can be interpreted as a style transfer problem, and proposes to evaluate the idea using generative adversarial network. A corrective feedback generation model is trained on 65,100 read utterances by 217 non-native speakers of 27 mother tongue backgrounds. The features are automatically learnt in an unsupervised way in an auxiliary classifier CycleGAN setting, in which the generator learns to map a foreign accented speech to native speech distributions. In order to inject linguistic knowledge into the network, an auxiliary classifier is trained so that the feedback also identifies the linguistic error types that were defined in the first half of the thesis. The proposed approach generates a corrected version the speech using the learners own voice, outperforming the conventional Pitch-Synchronous Overlap-and-Add method.μΈκ΅μ΄λ‘μμ νκ΅μ΄ κ΅μ‘μ λν κ΄μ¬μ΄ κ³ μ‘°λμ΄ νκ΅μ΄ νμ΅μμ μκ° ν¬κ² μ¦κ°νκ³ μμΌλ©°, μμ±μΈμ΄μ²λ¦¬ κΈ°μ μ μ μ©ν μ»΄ν¨ν° κΈ°λ° λ°μ κ΅μ‘(Computer-Assisted Pronunciation Training; CAPT) μ΄ν리μΌμ΄μ
μ λν μ°κ΅¬ λν μ κ·Ήμ μΌλ‘ μ΄λ£¨μ΄μ§κ³ μλ€. κ·ΈλΌμλ λΆκ΅¬νκ³ νμ‘΄νλ νκ΅μ΄ λ§νκΈ° κ΅μ‘ μμ€ν
μ μΈκ΅μΈμ νκ΅μ΄μ λν μΈμ΄νμ νΉμ§μ μΆ©λΆν νμ©νμ§ μκ³ μμΌλ©°, μ΅μ μΈμ΄μ²λ¦¬ κΈ°μ λν μ μ©λμ§ μκ³ μλ μ€μ μ΄λ€. κ°λ₯ν μμΈμΌλ‘μ¨λ μΈκ΅μΈ λ°ν νκ΅μ΄ νμμ λν λΆμμ΄ μΆ©λΆνκ² μ΄λ£¨μ΄μ§μ§ μμλ€λ μ , κ·Έλ¦¬κ³ κ΄λ ¨ μ°κ΅¬κ° μμ΄λ μ΄λ₯Ό μλνλ μμ€ν
μ λ°μνκΈ°μλ κ³ λνλ μ°κ΅¬κ° νμνλ€λ μ μ΄ μλ€. λΏλ§ μλλΌ CAPT κΈ°μ μ λ°μ μΌλ‘λ μ νΈμ²λ¦¬, μ΄μ¨ λΆμ, μμ°μ΄μ²λ¦¬ κΈ°λ²κ³Ό κ°μ νΉμ§ μΆμΆμ μμ‘΄νκ³ μμ΄μ μ ν©ν νΉμ§μ μ°Ύκ³ μ΄λ₯Ό μ ννκ² μΆμΆνλ λ°μ λ§μ μκ°κ³Ό λ
Έλ ₯μ΄ νμν μ€μ μ΄λ€. μ΄λ μ΅μ λ₯λ¬λ κΈ°λ° μΈμ΄μ²λ¦¬ κΈ°μ μ νμ©ν¨μΌλ‘μ¨ μ΄ κ³Όμ λν λ°μ μ μ¬μ§κ° λ§λ€λ λ°λ₯Ό μμ¬νλ€.
λ°λΌμ λ³Έ μ°κ΅¬λ λ¨Όμ CAPT μμ€ν
κ°λ°μ μμ΄ λ°μ λ³μ΄ μμκ³Ό μΈμ΄νμ μκ΄κ΄κ³λ₯Ό λΆμνμλ€. μΈκ΅μΈ νμλ€μ λλ
체 λ³μ΄ μμκ³Ό νκ΅μ΄ μμ΄λ―Ό νμλ€μ λλ
체 λ³μ΄ μμμ λμ‘°νκ³ μ£Όμν λ³μ΄λ₯Ό νμΈν ν, μκ΄κ΄κ³ λΆμμ ν΅νμ¬ μμ¬μν΅μ μν₯μ λ―ΈμΉλ μ€μλλ₯Ό νμ
νμλ€. κ·Έ κ²°κ³Ό, μ’
μ± μμ μ 3μ€ λ립μ νΌλ, μ΄λΆμ κ΄λ ¨ μ€λ₯κ° λ°μν κ²½μ° νΌλλ°± μμ±μ μ°μ μ μΌλ‘ λ°μνλ κ²μ΄ νμνλ€λ κ²μ΄ νμΈλμλ€.
κ΅μ λ νΌλλ°±μ μλμΌλ‘ μμ±νλ κ²μ CAPT μμ€ν
μ μ€μν κ³Όμ μ€ νλμ΄λ€. λ³Έ μ°κ΅¬λ μ΄ κ³Όμ κ° λ°νμ μ€νμΌ λ³νμ λ¬Έμ λ‘ ν΄μμ΄ κ°λ₯νλ€κ³ 보μμΌλ©°, μμ±μ μ λ μ κ²½λ§ (Cycle-consistent Generative Adversarial Network; CycleGAN) ꡬ쑰μμ λͺ¨λΈλ§νλ κ²μ μ μνμλ€. GAN λ€νΈμν¬μ μμ±λͺ¨λΈμ λΉμμ΄λ―Ό λ°νμ λΆν¬μ μμ΄λ―Ό λ°ν λΆν¬μ 맀νμ νμ΅νλ©°, Cycle consistency μμ€ν¨μλ₯Ό μ¬μ©ν¨μΌλ‘μ¨ λ°νκ° μ λ°μ μΈ κ΅¬μ‘°λ₯Ό μ μ§ν¨κ³Ό λμμ κ³Όλν κ΅μ μ λ°©μ§νμλ€. λ³λμ νΉμ§ μΆμΆ κ³Όμ μ΄ μμ΄ νμν νΉμ§λ€μ΄ CycleGAN νλ μμν¬μμ 무κ°λ
λ°©λ²μΌλ‘ μ€μ€λ‘ νμ΅λλ λ°©λ²μΌλ‘, μΈμ΄ νμ₯μ΄ μ©μ΄ν λ°©λ²μ΄λ€.
μΈμ΄νμ λΆμμμ λλ¬λ μ£Όμν λ³μ΄λ€ κ°μ μ°μ μμλ Auxiliary Classifier CycleGAN ꡬ쑰μμ λͺ¨λΈλ§νλ κ²μ μ μνμλ€. μ΄ λ°©λ²μ κΈ°μ‘΄μ CycleGANμ μ§μμ μ λͺ©μμΌ νΌλλ°± μμ±μ μμ±ν¨κ³Ό λμμ ν΄λΉ νΌλλ°±μ΄ μ΄λ€ μ νμ μ€λ₯μΈμ§ λΆλ₯νλ λ¬Έμ λ₯Ό μννλ€. μ΄λ λλ©μΈ μ§μμ΄ κ΅μ νΌλλ°± μμ± λ¨κ³κΉμ§ μ μ§λκ³ ν΅μ κ° κ°λ₯νλ€λ μ₯μ μ΄ μλ€λ λ°μ κ·Έ μμκ° μλ€.
λ³Έ μ°κ΅¬μμ μ μν λ°©λ²μ νκ°νκΈ° μν΄μ 27κ°μ λͺ¨κ΅μ΄λ₯Ό κ°λ 217λͺ
μ μ μλ―Έ μ΄ν λ°ν 65,100κ°λ‘ νΌλλ°± μλ μμ± λͺ¨λΈμ νλ ¨νκ³ , κ°μ μ¬λΆ λ° μ λμ λν μ§κ° νκ°λ₯Ό μννμλ€. μ μλ λ°©λ²μ μ¬μ©νμμ λ νμ΅μ λ³ΈμΈμ λͺ©μ리λ₯Ό μ μ§ν μ± κ΅μ λ λ°μμΌλ‘ λ³ννλ κ²μ΄ κ°λ₯νλ©°, μ ν΅μ μΈ λ°©λ²μΈ μλμ΄ λκΈ°μ μ€μ²©κ°μ° (Pitch-Synchronous Overlap-and-Add) μκ³ λ¦¬μ¦μ μ¬μ©νλ λ°©λ²μ λΉν΄ μλ κ°μ λ₯ 16.67%μ΄ νμΈλμλ€.Chapter 1. Introduction 1
1.1. Motivation 1
1.1.1. An Overview of CAPT Systems 3
1.1.2. Survey of existing Korean CAPT Systems 5
1.2. Problem Statement 7
1.3. Thesis Structure 7
Chapter 2. Pronunciation Analysis of Korean Produced by Chinese 9
2.1. Comparison between Korean and Chinese 11
2.1.1. Phonetic and Syllable Structure Comparisons 11
2.1.2. Phonological Comparisons 14
2.2. Related Works 16
2.3. Proposed Analysis Method 19
2.3.1. Corpus 19
2.3.2. Transcribers and Agreement Rates 22
2.4. Salient Pronunciation Variations 22
2.4.1. Segmental Variation Patterns 22
2.4.1.1. Discussions 25
2.4.2. Phonological Variation Patterns 26
2.4.1.2. Discussions 27
2.5. Summary 29
Chapter 3. Correlation Analysis of Pronunciation Variations and Human Evaluation 30
3.1. Related Works 31
3.1.1. Criteria used in L2 Speech 31
3.1.2. Criteria used in L2 Korean Speech 32
3.2. Proposed Human Evaluation Method 36
3.2.1. Reading Prompt Design 36
3.2.2. Evaluation Criteria Design 37
3.2.3. Raters and Agreement Rates 40
3.3. Linguistic Factors Affecting L2 Korean Accentedness 41
3.3.1. Pearsons Correlation Analysis 41
3.3.2. Discussions 42
3.3.3. Implications for Automatic Feedback Generation 44
3.4. Summary 45
Chapter 4. Corrective Feedback Generation for CAPT 46
4.1. Related Works 46
4.1.1. Prosody Transplantation 47
4.1.2. Recent Speech Conversion Methods 49
4.1.3. Evaluation of Corrective Feedback 50
4.2. Proposed Method: Corrective Feedback as a Style Transfer 51
4.2.1. Speech Analysis at Spectral Domain 53
4.2.2. Self-imitative Learning 55
4.2.3. An Analogy: CAPT System and GAN Architecture 57
4.3. Generative Adversarial Networks 59
4.3.1. Conditional GAN 61
4.3.2. CycleGAN 62
4.4. Experiment 63
4.4.1. Corpus 64
4.4.2. Baseline Implementation 65
4.4.3. Adversarial Training Implementation 65
4.4.4. Spectrogram-to-Spectrogram Training 66
4.5. Results and Evaluation 69
4.5.1. Spectrogram Generation Results 69
4.5.2. Perceptual Evaluation 70
4.5.3. Discussions 72
4.6. Summary 74
Chapter 5. Integration of Linguistic Knowledge in an Auxiliary Classifier CycleGAN for Feedback Generation 75
5.1. Linguistic Class Selection 75
5.2. Auxiliary Classifier CycleGAN Design 77
5.3. Experiment and Results 80
5.3.1. Corpus 80
5.3.2. Feature Annotations 81
5.3.3. Experiment Setup 81
5.3.4. Results 82
5.4. Summary 84
Chapter 6. Conclusion 86
6.1. Thesis Results 86
6.2. Thesis Contributions 88
6.3. Recommendations for Future Work 89
Bibliography 91
Appendix 107
Abstract in Korean 117
Acknowledgments 120Docto
Apraxia World: Deploying a Mobile Game and Automatic Speech Recognition for Independent Child Speech Therapy
Children with speech sound disorders typically improve pronunciation quality by undergoing speech therapy, which must be delivered frequently and with high intensity to be effective. As such, clinic sessions are supplemented with home practice, often under caregiver supervision. However, traditional home practice can grow boring for children due to monotony. Furthermore, practice frequency is limited by caregiver availability, making it difficult for some children to reach therapy dosage. To address these issues, this dissertation presents a novel speech therapy game to increase engagement, and explores automatic pronunciation evaluation techniques to afford children independent practice.
Children with speech sound disorders typically improve pronunciation quality by undergoing speech therapy, which must be delivered frequently and with high intensity to be effective. As such, clinic sessions are supplemented with home practice, often under caregiver supervision. However, traditional home practice can grow boring for children due to monotony. Furthermore, practice frequency is limited by caregiver availability, making it difficult for some children to reach therapy dosage. To address these issues, this dissertation presents a novel speech therapy game to increase engagement, and explores automatic pronunciation evaluation techniques to afford children independent practice.
The therapy game, called Apraxia World, delivers customizable, repetition-based speech therapy while children play through platformer-style levels using typical on-screen tablet controls; children complete in-game speech exercises to collect assets required to progress through the levels. Additionally, Apraxia World provides pronunciation feedback according to an automated pronunciation evaluation system running locally on the tablet. Apraxia World offers two advantages over current commercial and research speech therapy games; first, the game provides extended gameplay to support long therapy treatments; second, it affords some therapy practice independence via automatic pronunciation evaluation, allowing caregivers to lightly supervise instead of directly administer the practice. Pilot testing indicated that children enjoyed the game-based therapy much more than traditional practice and that the exercises did not interfere with gameplay. During a longitudinal study, children made clinically-significant pronunciation improvements while playing Apraxia World at home. Furthermore, children remained engaged in the game-based therapy over the two-month testing period and some even wanted to continue playing post-study.
The second part of the dissertation explores word- and phoneme-level pronunciation verification for child speech therapy applications. Word-level pronunciation verification is accomplished using a child-specific template-matching framework, where an utterance is compared against correctly and incorrectly pronounced examples of the word. This framework identified mispronounced words better than both a standard automated baseline and co-located caregivers. Phoneme-level mispronunciation detection is investigated using a technique from the second-language learning literature: training phoneme-specific classifiers with phonetic posterior features. This method also outperformed the standard baseline, but more significantly, identified mispronunciations better than student clinicians
Automatic Proficiency Evaluation of Spoken English by Japanese Learners for Dialogue-Based Language Learning System Based on Deep Learning
Tohoku UniversityδΌθ€ε½°εθͺ²
Dealing with linguistic mismatches for automatic speech recognition
Recent breakthroughs in automatic speech recognition (ASR) have resulted in a word error rate (WER) on par with human transcribers on the English Switchboard benchmark. However, dealing with linguistic mismatches between the training and testing data is still a significant challenge that remains unsolved. Under the monolingual environment, it is well-known that the performance of ASR systems degrades significantly when presented with the speech from speakers with different accents, dialects, and speaking styles than those encountered during system training. Under the multi-lingual environment, ASR systems trained on a source language achieve even worse performance when tested on another target language because of mismatches in terms of the number of phonemes, lexical ambiguity, and power of phonotactic constraints provided by phone-level n-grams.
In order to address the issues of linguistic mismatches for current ASR systems, my dissertation investigates both knowledge-gnostic and knowledge-agnostic solutions. In the first part, classic theories relevant to acoustics and articulatory phonetics that present capability of being transferred across a dialect continuum from local dialects to another standardized language are re-visited. Experiments demonstrate the potentials that acoustic correlates in the vicinity of landmarks could help to build a bridge for dealing with mismatches across difference local or global varieties in a dialect continuum. In the second part, we design an end-to-end acoustic modeling approach based on connectionist temporal classification loss and propose to link the training of acoustics and accent altogether in a manner similar to the learning process in human speech perception. This joint model not only performed well on ASR with multiple accents but also boosted accuracies of accent identification task in comparison to separately-trained models
Recommended from our members
Deep Learning for Automatic Assessment and Feedback of Spoken English
Growing global demand for learning a second language (L2), particularly English, has led to
considerable interest in automatic spoken language assessment, whether for use in computerassisted language learning (CALL) tools or for grading candidates for formal qualifications.
This thesis presents research conducted into the automatic assessment of spontaneous nonnative English speech, with a view to be able to provide meaningful feedback to learners. One
of the challenges in automatic spoken language assessment is giving candidates feedback on
particular aspects, or views, of their spoken language proficiency, in addition to the overall
holistic score normally provided. Another is detecting pronunciation and other types of errors
at the word or utterance level and feeding them back to the learner in a useful way.
It is usually difficult to obtain accurate training data with separate scores for different
views and, as examiners are often trained to give holistic grades, single-view scores can
suffer issues of consistency. Conversely, holistic scores are available for various standard
assessment tasks such as Linguaskill. An investigation is thus conducted into whether
assessment scores linked to particular views of the speakerβs ability can be obtained from
systems trained using only holistic scores.
End-to-end neural systems are designed with structures and forms of input tuned to single
views, specifically each of pronunciation, rhythm, intonation and text. By training each
system on large quantities of candidate data, individual-view information should be possible
to extract. The relationships between the predictions of each system are evaluated to examine
whether they are, in fact, extracting different information about the speaker. Three methods
of combining the systems to predict holistic score are investigated, namely averaging their
predictions and concatenating and attending over their intermediate representations. The
combined graders are compared to each other and to baseline approaches.
The tasks of error detection and error tendency diagnosis become particularly challenging
when the speech in question is spontaneous and particularly given the challenges posed by
the inconsistency of human annotation of pronunciation errors. An approach to these tasks is
presented by distinguishing between lexical errors, wherein the speaker does not know how a
particular word is pronounced, and accent errors, wherein the candidateβs speech exhibits
consistent patterns of phone substitution, deletion and insertion. Three annotated corpora
x
of non-native English speech by speakers of multiple L1s are analysed, the consistency of
human annotation investigated and a method presented for detecting individual accent and
lexical errors and diagnosing accent error tendencies at the speaker level