3,854 research outputs found
Self-imitating Feedback Generation Using GAN for Computer-Assisted Pronunciation Training
Self-imitating feedback is an effective and learner-friendly method for
non-native learners in Computer-Assisted Pronunciation Training. Acoustic
characteristics in native utterances are extracted and transplanted onto
learner's own speech input, and given back to the learner as a corrective
feedback. Previous works focused on speech conversion using prosodic
transplantation techniques based on PSOLA algorithm. Motivated by the visual
differences found in spectrograms of native and non-native speeches, we
investigated applying GAN to generate self-imitating feedback by utilizing
generator's ability through adversarial training. Because this mapping is
highly under-constrained, we also adopt cycle consistency loss to encourage the
output to preserve the global structure, which is shared by native and
non-native utterances. Trained on 97,200 spectrogram images of short utterances
produced by native and non-native speakers of Korean, the generator is able to
successfully transform the non-native spectrogram input to a spectrogram with
properties of self-imitating feedback. Furthermore, the transformed spectrogram
shows segmental corrections that cannot be obtained by prosodic
transplantation. Perceptual test comparing the self-imitating and correcting
abilities of our method with the baseline PSOLA method shows that the
generative approach with cycle consistency loss is promising
Recommended from our members
Towards automatic assessment of spontaneous spoken English
With increasing global demand for learning English as a second language, there has been considerable interest in
methods of automatic assessment of spoken language proficiency for use in interactive electronic learning tools as
well as for grading candidates for formal qualifications. This paper presents an automatic system to address the
assessment of spontaneous spoken language. Prompts or questions requiring spontaneous speech responses elicit
more natural speech which better reflects a learnerโs proficiency level than read speech. In addition to the challenges
of highly variable non-native, learner, speech and noisy real-world recording conditions, this requires any automatic
system to handle disfluent, non-grammatical, spontaneous speech with the underlying text unknown. To handle these,
a strong deep learning based speech recognition system is applied in combination with a Gaussian Process (GP)
grader. A range of features derived from the audio using the recognition hypothesis are investigated for their efficacy
in the automatic grader. The proposed system is shown to predict grades at a similar level to the original examiner
graders on real candidate entries. Interpolation with the examiner grades further boosts performance. The ability to
reject poorly estimated grades is also important and measures are proposed to evaluate the performance of rejection
schemes. The GP variance is used to decide which automatic grades should be rejected. Back-off to an expert grader
for the least confident grades gives gains.Cambridge Assessment Englis
CAPT๋ฅผ ์ํ ๋ฐ์ ๋ณ์ด ๋ถ์ ๋ฐ CycleGAN ๊ธฐ๋ฐ ํผ๋๋ฐฑ ์์ฑ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :์ธ๋ฌธ๋ํ ํ๋๊ณผ์ ์ธ์ง๊ณผํ์ ๊ณต,2020. 2. ์ ๋ฏผํ.Despite the growing popularity in learning Korean as a foreign language and the rapid development in language learning applications, the existing computer-assisted pronunciation training (CAPT) systems in Korean do not utilize linguistic characteristics of non-native Korean speech. Pronunciation variations in non-native speech are far more diverse than those observed in native speech, which may pose a difficulty in combining such knowledge in an automatic system. Moreover, most of the existing methods rely on feature extraction results from signal processing, prosodic analysis, and natural language processing techniques. Such methods entail limitations since they necessarily depend on finding the right features for the task and the extraction accuracies.
This thesis presents a new approach for corrective feedback generation in a CAPT system, in which pronunciation variation patterns and linguistic correlates with accentedness are analyzed and combined with a deep neural network approach, so that feature engineering efforts are minimized while maintaining the linguistically important factors for the corrective feedback generation task. Investigations on non-native Korean speech characteristics in contrast with those of native speakers, and their correlation with accentedness judgement show that both segmental and prosodic variations are important factors in a Korean CAPT system.
The present thesis argues that the feedback generation task can be interpreted as a style transfer problem, and proposes to evaluate the idea using generative adversarial network. A corrective feedback generation model is trained on 65,100 read utterances by 217 non-native speakers of 27 mother tongue backgrounds. The features are automatically learnt in an unsupervised way in an auxiliary classifier CycleGAN setting, in which the generator learns to map a foreign accented speech to native speech distributions. In order to inject linguistic knowledge into the network, an auxiliary classifier is trained so that the feedback also identifies the linguistic error types that were defined in the first half of the thesis. The proposed approach generates a corrected version the speech using the learners own voice, outperforming the conventional Pitch-Synchronous Overlap-and-Add method.์ธ๊ตญ์ด๋ก์์ ํ๊ตญ์ด ๊ต์ก์ ๋ํ ๊ด์ฌ์ด ๊ณ ์กฐ๋์ด ํ๊ตญ์ด ํ์ต์์ ์๊ฐ ํฌ๊ฒ ์ฆ๊ฐํ๊ณ ์์ผ๋ฉฐ, ์์ฑ์ธ์ด์ฒ๋ฆฌ ๊ธฐ์ ์ ์ ์ฉํ ์ปดํจํฐ ๊ธฐ๋ฐ ๋ฐ์ ๊ต์ก(Computer-Assisted Pronunciation Training; CAPT) ์ดํ๋ฆฌ์ผ์ด์
์ ๋ํ ์ฐ๊ตฌ ๋ํ ์ ๊ทน์ ์ผ๋ก ์ด๋ฃจ์ด์ง๊ณ ์๋ค. ๊ทธ๋ผ์๋ ๋ถ๊ตฌํ๊ณ ํ์กดํ๋ ํ๊ตญ์ด ๋งํ๊ธฐ ๊ต์ก ์์คํ
์ ์ธ๊ตญ์ธ์ ํ๊ตญ์ด์ ๋ํ ์ธ์ดํ์ ํน์ง์ ์ถฉ๋ถํ ํ์ฉํ์ง ์๊ณ ์์ผ๋ฉฐ, ์ต์ ์ธ์ด์ฒ๋ฆฌ ๊ธฐ์ ๋ํ ์ ์ฉ๋์ง ์๊ณ ์๋ ์ค์ ์ด๋ค. ๊ฐ๋ฅํ ์์ธ์ผ๋ก์จ๋ ์ธ๊ตญ์ธ ๋ฐํ ํ๊ตญ์ด ํ์์ ๋ํ ๋ถ์์ด ์ถฉ๋ถํ๊ฒ ์ด๋ฃจ์ด์ง์ง ์์๋ค๋ ์ , ๊ทธ๋ฆฌ๊ณ ๊ด๋ จ ์ฐ๊ตฌ๊ฐ ์์ด๋ ์ด๋ฅผ ์๋ํ๋ ์์คํ
์ ๋ฐ์ํ๊ธฐ์๋ ๊ณ ๋ํ๋ ์ฐ๊ตฌ๊ฐ ํ์ํ๋ค๋ ์ ์ด ์๋ค. ๋ฟ๋ง ์๋๋ผ CAPT ๊ธฐ์ ์ ๋ฐ์ ์ผ๋ก๋ ์ ํธ์ฒ๋ฆฌ, ์ด์จ ๋ถ์, ์์ฐ์ด์ฒ๋ฆฌ ๊ธฐ๋ฒ๊ณผ ๊ฐ์ ํน์ง ์ถ์ถ์ ์์กดํ๊ณ ์์ด์ ์ ํฉํ ํน์ง์ ์ฐพ๊ณ ์ด๋ฅผ ์ ํํ๊ฒ ์ถ์ถํ๋ ๋ฐ์ ๋ง์ ์๊ฐ๊ณผ ๋
ธ๋ ฅ์ด ํ์ํ ์ค์ ์ด๋ค. ์ด๋ ์ต์ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ธ์ด์ฒ๋ฆฌ ๊ธฐ์ ์ ํ์ฉํจ์ผ๋ก์จ ์ด ๊ณผ์ ๋ํ ๋ฐ์ ์ ์ฌ์ง๊ฐ ๋ง๋ค๋ ๋ฐ๋ฅผ ์์ฌํ๋ค.
๋ฐ๋ผ์ ๋ณธ ์ฐ๊ตฌ๋ ๋จผ์ CAPT ์์คํ
๊ฐ๋ฐ์ ์์ด ๋ฐ์ ๋ณ์ด ์์๊ณผ ์ธ์ดํ์ ์๊ด๊ด๊ณ๋ฅผ ๋ถ์ํ์๋ค. ์ธ๊ตญ์ธ ํ์๋ค์ ๋ญ๋
์ฒด ๋ณ์ด ์์๊ณผ ํ๊ตญ์ด ์์ด๋ฏผ ํ์๋ค์ ๋ญ๋
์ฒด ๋ณ์ด ์์์ ๋์กฐํ๊ณ ์ฃผ์ํ ๋ณ์ด๋ฅผ ํ์ธํ ํ, ์๊ด๊ด๊ณ ๋ถ์์ ํตํ์ฌ ์์ฌ์ํต์ ์ํฅ์ ๋ฏธ์น๋ ์ค์๋๋ฅผ ํ์
ํ์๋ค. ๊ทธ ๊ฒฐ๊ณผ, ์ข
์ฑ ์ญ์ ์ 3์ค ๋๋ฆฝ์ ํผ๋, ์ด๋ถ์ ๊ด๋ จ ์ค๋ฅ๊ฐ ๋ฐ์ํ ๊ฒฝ์ฐ ํผ๋๋ฐฑ ์์ฑ์ ์ฐ์ ์ ์ผ๋ก ๋ฐ์ํ๋ ๊ฒ์ด ํ์ํ๋ค๋ ๊ฒ์ด ํ์ธ๋์๋ค.
๊ต์ ๋ ํผ๋๋ฐฑ์ ์๋์ผ๋ก ์์ฑํ๋ ๊ฒ์ CAPT ์์คํ
์ ์ค์ํ ๊ณผ์ ์ค ํ๋์ด๋ค. ๋ณธ ์ฐ๊ตฌ๋ ์ด ๊ณผ์ ๊ฐ ๋ฐํ์ ์คํ์ผ ๋ณํ์ ๋ฌธ์ ๋ก ํด์์ด ๊ฐ๋ฅํ๋ค๊ณ ๋ณด์์ผ๋ฉฐ, ์์ฑ์ ์ ๋ ์ ๊ฒฝ๋ง (Cycle-consistent Generative Adversarial Network; CycleGAN) ๊ตฌ์กฐ์์ ๋ชจ๋ธ๋งํ๋ ๊ฒ์ ์ ์ํ์๋ค. GAN ๋คํธ์ํฌ์ ์์ฑ๋ชจ๋ธ์ ๋น์์ด๋ฏผ ๋ฐํ์ ๋ถํฌ์ ์์ด๋ฏผ ๋ฐํ ๋ถํฌ์ ๋งคํ์ ํ์ตํ๋ฉฐ, Cycle consistency ์์คํจ์๋ฅผ ์ฌ์ฉํจ์ผ๋ก์จ ๋ฐํ๊ฐ ์ ๋ฐ์ ์ธ ๊ตฌ์กฐ๋ฅผ ์ ์งํจ๊ณผ ๋์์ ๊ณผ๋ํ ๊ต์ ์ ๋ฐฉ์งํ์๋ค. ๋ณ๋์ ํน์ง ์ถ์ถ ๊ณผ์ ์ด ์์ด ํ์ํ ํน์ง๋ค์ด CycleGAN ํ๋ ์์ํฌ์์ ๋ฌด๊ฐ๋
๋ฐฉ๋ฒ์ผ๋ก ์ค์ค๋ก ํ์ต๋๋ ๋ฐฉ๋ฒ์ผ๋ก, ์ธ์ด ํ์ฅ์ด ์ฉ์ดํ ๋ฐฉ๋ฒ์ด๋ค.
์ธ์ดํ์ ๋ถ์์์ ๋๋ฌ๋ ์ฃผ์ํ ๋ณ์ด๋ค ๊ฐ์ ์ฐ์ ์์๋ Auxiliary Classifier CycleGAN ๊ตฌ์กฐ์์ ๋ชจ๋ธ๋งํ๋ ๊ฒ์ ์ ์ํ์๋ค. ์ด ๋ฐฉ๋ฒ์ ๊ธฐ์กด์ CycleGAN์ ์ง์์ ์ ๋ชฉ์์ผ ํผ๋๋ฐฑ ์์ฑ์ ์์ฑํจ๊ณผ ๋์์ ํด๋น ํผ๋๋ฐฑ์ด ์ด๋ค ์ ํ์ ์ค๋ฅ์ธ์ง ๋ถ๋ฅํ๋ ๋ฌธ์ ๋ฅผ ์ํํ๋ค. ์ด๋ ๋๋ฉ์ธ ์ง์์ด ๊ต์ ํผ๋๋ฐฑ ์์ฑ ๋จ๊ณ๊น์ง ์ ์ง๋๊ณ ํต์ ๊ฐ ๊ฐ๋ฅํ๋ค๋ ์ฅ์ ์ด ์๋ค๋ ๋ฐ์ ๊ทธ ์์๊ฐ ์๋ค.
๋ณธ ์ฐ๊ตฌ์์ ์ ์ํ ๋ฐฉ๋ฒ์ ํ๊ฐํ๊ธฐ ์ํด์ 27๊ฐ์ ๋ชจ๊ตญ์ด๋ฅผ ๊ฐ๋ 217๋ช
์ ์ ์๋ฏธ ์ดํ ๋ฐํ 65,100๊ฐ๋ก ํผ๋๋ฐฑ ์๋ ์์ฑ ๋ชจ๋ธ์ ํ๋ จํ๊ณ , ๊ฐ์ ์ฌ๋ถ ๋ฐ ์ ๋์ ๋ํ ์ง๊ฐ ํ๊ฐ๋ฅผ ์ํํ์๋ค. ์ ์๋ ๋ฐฉ๋ฒ์ ์ฌ์ฉํ์์ ๋ ํ์ต์ ๋ณธ์ธ์ ๋ชฉ์๋ฆฌ๋ฅผ ์ ์งํ ์ฑ ๊ต์ ๋ ๋ฐ์์ผ๋ก ๋ณํํ๋ ๊ฒ์ด ๊ฐ๋ฅํ๋ฉฐ, ์ ํต์ ์ธ ๋ฐฉ๋ฒ์ธ ์๋์ด ๋๊ธฐ์ ์ค์ฒฉ๊ฐ์ฐ (Pitch-Synchronous Overlap-and-Add) ์๊ณ ๋ฆฌ์ฆ์ ์ฌ์ฉํ๋ ๋ฐฉ๋ฒ์ ๋นํด ์๋ ๊ฐ์ ๋ฅ 16.67%์ด ํ์ธ๋์๋ค.Chapter 1. Introduction 1
1.1. Motivation 1
1.1.1. An Overview of CAPT Systems 3
1.1.2. Survey of existing Korean CAPT Systems 5
1.2. Problem Statement 7
1.3. Thesis Structure 7
Chapter 2. Pronunciation Analysis of Korean Produced by Chinese 9
2.1. Comparison between Korean and Chinese 11
2.1.1. Phonetic and Syllable Structure Comparisons 11
2.1.2. Phonological Comparisons 14
2.2. Related Works 16
2.3. Proposed Analysis Method 19
2.3.1. Corpus 19
2.3.2. Transcribers and Agreement Rates 22
2.4. Salient Pronunciation Variations 22
2.4.1. Segmental Variation Patterns 22
2.4.1.1. Discussions 25
2.4.2. Phonological Variation Patterns 26
2.4.1.2. Discussions 27
2.5. Summary 29
Chapter 3. Correlation Analysis of Pronunciation Variations and Human Evaluation 30
3.1. Related Works 31
3.1.1. Criteria used in L2 Speech 31
3.1.2. Criteria used in L2 Korean Speech 32
3.2. Proposed Human Evaluation Method 36
3.2.1. Reading Prompt Design 36
3.2.2. Evaluation Criteria Design 37
3.2.3. Raters and Agreement Rates 40
3.3. Linguistic Factors Affecting L2 Korean Accentedness 41
3.3.1. Pearsons Correlation Analysis 41
3.3.2. Discussions 42
3.3.3. Implications for Automatic Feedback Generation 44
3.4. Summary 45
Chapter 4. Corrective Feedback Generation for CAPT 46
4.1. Related Works 46
4.1.1. Prosody Transplantation 47
4.1.2. Recent Speech Conversion Methods 49
4.1.3. Evaluation of Corrective Feedback 50
4.2. Proposed Method: Corrective Feedback as a Style Transfer 51
4.2.1. Speech Analysis at Spectral Domain 53
4.2.2. Self-imitative Learning 55
4.2.3. An Analogy: CAPT System and GAN Architecture 57
4.3. Generative Adversarial Networks 59
4.3.1. Conditional GAN 61
4.3.2. CycleGAN 62
4.4. Experiment 63
4.4.1. Corpus 64
4.4.2. Baseline Implementation 65
4.4.3. Adversarial Training Implementation 65
4.4.4. Spectrogram-to-Spectrogram Training 66
4.5. Results and Evaluation 69
4.5.1. Spectrogram Generation Results 69
4.5.2. Perceptual Evaluation 70
4.5.3. Discussions 72
4.6. Summary 74
Chapter 5. Integration of Linguistic Knowledge in an Auxiliary Classifier CycleGAN for Feedback Generation 75
5.1. Linguistic Class Selection 75
5.2. Auxiliary Classifier CycleGAN Design 77
5.3. Experiment and Results 80
5.3.1. Corpus 80
5.3.2. Feature Annotations 81
5.3.3. Experiment Setup 81
5.3.4. Results 82
5.4. Summary 84
Chapter 6. Conclusion 86
6.1. Thesis Results 86
6.2. Thesis Contributions 88
6.3. Recommendations for Future Work 89
Bibliography 91
Appendix 107
Abstract in Korean 117
Acknowledgments 120Docto
Impact of ASR performance on free speaking language assessment
In free speaking tests candidates respond in spontaneous speech to prompts. This form of test allows the spoken language proficiency of a non-native speaker of English to be assessed more fully than read aloud tests. As the candidate's responses are unscripted, transcription by automatic speech recognition (ASR) is essential for automated assessment. ASR will never be 100% accurate so any assessment system must seek to minimise and mitigate ASR errors. This paper considers the impact of ASR errors on the performance of free speaking test auto-marking systems. Firstly rich linguistically related features, based on part-of-speech tags from statistical parse trees, are investigated for assessment. Then, the impact of ASR errors on how well the system can detect whether a learner's answer is relevant to the question asked is evaluated. Finally, the impact that these errors may have on the ability of the system to provide detailed feedback to the learner is analysed. In particular, pronunciation and grammatical errors are considered as these are important in helping a learner to make progress. As feedback resulting from an ASR error would be highly confusing, an approach to mitigate this problem using confidence scores is also analysed
Fluency Strategy Training and the L2 Oral Task Performance of Indonesian EFL Classroom Learners
This quasi-experimental study investigated the impacts of two instructional conditions, explicit fluency strategy training and implicit task-based instruction, on university English learners in Indonesia. The results revealed that both instructional conditions could not significantly improve participantsโ speech fluency, but improvement on oral proficiency reached statistical significance. A degree of variability in participantsโ speech fluency development was also found. Both instructional conditions could be applied with potentially complementary effects in Indonesian EFL classrooms
Recommended from our members
Deep Learning for Automatic Assessment and Feedback of Spoken English
Growing global demand for learning a second language (L2), particularly English, has led to
considerable interest in automatic spoken language assessment, whether for use in computerassisted language learning (CALL) tools or for grading candidates for formal qualifications.
This thesis presents research conducted into the automatic assessment of spontaneous nonnative English speech, with a view to be able to provide meaningful feedback to learners. One
of the challenges in automatic spoken language assessment is giving candidates feedback on
particular aspects, or views, of their spoken language proficiency, in addition to the overall
holistic score normally provided. Another is detecting pronunciation and other types of errors
at the word or utterance level and feeding them back to the learner in a useful way.
It is usually difficult to obtain accurate training data with separate scores for different
views and, as examiners are often trained to give holistic grades, single-view scores can
suffer issues of consistency. Conversely, holistic scores are available for various standard
assessment tasks such as Linguaskill. An investigation is thus conducted into whether
assessment scores linked to particular views of the speakerโs ability can be obtained from
systems trained using only holistic scores.
End-to-end neural systems are designed with structures and forms of input tuned to single
views, specifically each of pronunciation, rhythm, intonation and text. By training each
system on large quantities of candidate data, individual-view information should be possible
to extract. The relationships between the predictions of each system are evaluated to examine
whether they are, in fact, extracting different information about the speaker. Three methods
of combining the systems to predict holistic score are investigated, namely averaging their
predictions and concatenating and attending over their intermediate representations. The
combined graders are compared to each other and to baseline approaches.
The tasks of error detection and error tendency diagnosis become particularly challenging
when the speech in question is spontaneous and particularly given the challenges posed by
the inconsistency of human annotation of pronunciation errors. An approach to these tasks is
presented by distinguishing between lexical errors, wherein the speaker does not know how a
particular word is pronounced, and accent errors, wherein the candidateโs speech exhibits
consistent patterns of phone substitution, deletion and insertion. Three annotated corpora
x
of non-native English speech by speakers of multiple L1s are analysed, the consistency of
human annotation investigated and a method presented for detecting individual accent and
lexical errors and diagnosing accent error tendencies at the speaker level
- โฆ