2,935 research outputs found
Meta-Learning for Phonemic Annotation of Corpora
We apply rule induction, classifier combination and meta-learning (stacked
classifiers) to the problem of bootstrapping high accuracy automatic annotation
of corpora with pronunciation information. The task we address in this paper
consists of generating phonemic representations reflecting the Flemish and
Dutch pronunciations of a word on the basis of its orthographic representation
(which in turn is based on the actual speech recordings). We compare several
possible approaches to achieve the text-to-pronunciation mapping task:
memory-based learning, transformation-based learning, rule induction, maximum
entropy modeling, combination of classifiers in stacked learning, and stacking
of meta-learners. We are interested both in optimal accuracy and in obtaining
insight into the linguistic regularities involved. As far as accuracy is
concerned, an already high accuracy level (93% for Celex and 86% for Fonilex at
word level) for single classifiers is boosted significantly with additional
error reductions of 31% and 38% respectively using combination of classifiers,
and a further 5% using combination of meta-learners, bringing overall word
level accuracy to 96% for the Dutch variant and 92% for the Flemish variant. We
also show that the application of machine learning methods indeed leads to
increased insight into the linguistic regularities determining the variation
between the two pronunciation variants studied.Comment: 8 page
Articulation rate as a metric in spoken language assessment
Copyright ยฉ 2019 ISCA Automated evaluation of non-native pronunciation provides a consistent and more cost-efficient alternative to human evaluation. To that end, there is considerable interest in deriving metrics that are based on the cues human listeners use to judge pronunciation. Previous research reported the use of phonetic features such as vowel characteristics in automated spoken language evaluation. The present study extends this line of work on the significance of phonetic features in automated evaluation of L2 speech (both assessment and feedback). Predictive modelling techniques examined the relationship between various articulation rate metrics one the one hand, and the proficiency and L1 background of non-native English speakers on the other. It was found that the optimal predictive model was one in which the phonetic details of phoneme articulation were factored in the analysis of articulation rate. Model performance varied also according to the L1 background of speakers. The implications for assessment and feedback are discussed.Leverhulme ECF Fellowship; ALTA projec
Articulatory and bottleneck features for speaker-independent ASR of dysarthric speech
The rapid population aging has stimulated the development of assistive
devices that provide personalized medical support to the needies suffering from
various etiologies. One prominent clinical application is a computer-assisted
speech training system which enables personalized speech therapy to patients
impaired by communicative disorders in the patient's home environment. Such a
system relies on the robust automatic speech recognition (ASR) technology to be
able to provide accurate articulation feedback. With the long-term aim of
developing off-the-shelf ASR systems that can be incorporated in clinical
context without prior speaker information, we compare the ASR performance of
speaker-independent bottleneck and articulatory features on dysarthric speech
used in conjunction with dedicated neural network-based acoustic models that
have been shown to be robust against spectrotemporal deviations. We report ASR
performance of these systems on two dysarthric speech datasets of different
characteristics to quantify the achieved performance gains. Despite the
remaining performance gap between the dysarthric and normal speech, significant
improvements have been reported on both datasets using speaker-independent ASR
architectures.Comment: to appear in Computer Speech & Language -
https://doi.org/10.1016/j.csl.2019.05.002 - arXiv admin note: substantial
text overlap with arXiv:1807.1094
Neural representations for modeling variation in speech
Variation in speech is often quantified by comparing phonetic transcriptions
of the same utterance. However, manually transcribing speech is time-consuming
and error prone. As an alternative, therefore, we investigate the extraction of
acoustic embeddings from several self-supervised neural models. We use these
representations to compute word-based pronunciation differences between
non-native and native speakers of English, and between Norwegian dialect
speakers. For comparison with several earlier studies, we evaluate how well
these differences match human perception by comparing them with available human
judgements of similarity. We show that speech representations extracted from a
specific type of neural model (i.e. Transformers) lead to a better match with
human perception than two earlier approaches on the basis of phonetic
transcriptions and MFCC-based acoustic features. We furthermore find that
features from the neural models can generally best be extracted from one of the
middle hidden layers than from the final layer. We also demonstrate that neural
speech representations not only capture segmental differences, but also
intonational and durational differences that cannot adequately be represented
by a set of discrete symbols used in phonetic transcriptions.Comment: Submitted to Journal of Phonetic
A computational model for studying L1โs effect on L2 speech learning
abstract: Much evidence has shown that first language (L1) plays an important role in the formation of L2 phonological system during second language (L2) learning process. This combines with the fact that different L1s have distinct phonological patterns to indicate the diverse L2 speech learning outcomes for speakers from different L1 backgrounds. This dissertation hypothesizes that phonological distances between accented speech and speakers' L1 speech are also correlated with perceived accentedness, and the correlations are negative for some phonological properties. Moreover, contrastive phonological distinctions between L1s and L2 will manifest themselves in the accented speech produced by speaker from these L1s. To test the hypotheses, this study comes up with a computational model to analyze the accented speech properties in both segmental (short-term speech measurements on short-segment or phoneme level) and suprasegmental (long-term speech measurements on word, long-segment, or sentence level) feature space. The benefit of using a computational model is that it enables quantitative analysis of L1's effect on accent in terms of different phonological properties. The core parts of this computational model are feature extraction schemes to extract pronunciation and prosody representation of accented speech based on existing techniques in speech processing field. Correlation analysis on both segmental and suprasegmental feature space is conducted to look into the relationship between acoustic measurements related to L1s and perceived accentedness across several L1s. Multiple regression analysis is employed to investigate how the L1's effect impacts the perception of foreign accent, and how accented speech produced by speakers from different L1s behaves distinctly on segmental and suprasegmental feature spaces. Results unveil the potential application of the methodology in this study to provide quantitative analysis of accented speech, and extend current studies in L2 speech learning theory to large scale. Practically, this study further shows that the computational model proposed in this study can benefit automatic accentedness evaluation system by adding features related to speakers' L1s.Dissertation/ThesisDoctoral Dissertation Speech and Hearing Science 201
CAPT๋ฅผ ์ํ ๋ฐ์ ๋ณ์ด ๋ถ์ ๋ฐ CycleGAN ๊ธฐ๋ฐ ํผ๋๋ฐฑ ์์ฑ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :์ธ๋ฌธ๋ํ ํ๋๊ณผ์ ์ธ์ง๊ณผํ์ ๊ณต,2020. 2. ์ ๋ฏผํ.Despite the growing popularity in learning Korean as a foreign language and the rapid development in language learning applications, the existing computer-assisted pronunciation training (CAPT) systems in Korean do not utilize linguistic characteristics of non-native Korean speech. Pronunciation variations in non-native speech are far more diverse than those observed in native speech, which may pose a difficulty in combining such knowledge in an automatic system. Moreover, most of the existing methods rely on feature extraction results from signal processing, prosodic analysis, and natural language processing techniques. Such methods entail limitations since they necessarily depend on finding the right features for the task and the extraction accuracies.
This thesis presents a new approach for corrective feedback generation in a CAPT system, in which pronunciation variation patterns and linguistic correlates with accentedness are analyzed and combined with a deep neural network approach, so that feature engineering efforts are minimized while maintaining the linguistically important factors for the corrective feedback generation task. Investigations on non-native Korean speech characteristics in contrast with those of native speakers, and their correlation with accentedness judgement show that both segmental and prosodic variations are important factors in a Korean CAPT system.
The present thesis argues that the feedback generation task can be interpreted as a style transfer problem, and proposes to evaluate the idea using generative adversarial network. A corrective feedback generation model is trained on 65,100 read utterances by 217 non-native speakers of 27 mother tongue backgrounds. The features are automatically learnt in an unsupervised way in an auxiliary classifier CycleGAN setting, in which the generator learns to map a foreign accented speech to native speech distributions. In order to inject linguistic knowledge into the network, an auxiliary classifier is trained so that the feedback also identifies the linguistic error types that were defined in the first half of the thesis. The proposed approach generates a corrected version the speech using the learners own voice, outperforming the conventional Pitch-Synchronous Overlap-and-Add method.์ธ๊ตญ์ด๋ก์์ ํ๊ตญ์ด ๊ต์ก์ ๋ํ ๊ด์ฌ์ด ๊ณ ์กฐ๋์ด ํ๊ตญ์ด ํ์ต์์ ์๊ฐ ํฌ๊ฒ ์ฆ๊ฐํ๊ณ ์์ผ๋ฉฐ, ์์ฑ์ธ์ด์ฒ๋ฆฌ ๊ธฐ์ ์ ์ ์ฉํ ์ปดํจํฐ ๊ธฐ๋ฐ ๋ฐ์ ๊ต์ก(Computer-Assisted Pronunciation Training; CAPT) ์ดํ๋ฆฌ์ผ์ด์
์ ๋ํ ์ฐ๊ตฌ ๋ํ ์ ๊ทน์ ์ผ๋ก ์ด๋ฃจ์ด์ง๊ณ ์๋ค. ๊ทธ๋ผ์๋ ๋ถ๊ตฌํ๊ณ ํ์กดํ๋ ํ๊ตญ์ด ๋งํ๊ธฐ ๊ต์ก ์์คํ
์ ์ธ๊ตญ์ธ์ ํ๊ตญ์ด์ ๋ํ ์ธ์ดํ์ ํน์ง์ ์ถฉ๋ถํ ํ์ฉํ์ง ์๊ณ ์์ผ๋ฉฐ, ์ต์ ์ธ์ด์ฒ๋ฆฌ ๊ธฐ์ ๋ํ ์ ์ฉ๋์ง ์๊ณ ์๋ ์ค์ ์ด๋ค. ๊ฐ๋ฅํ ์์ธ์ผ๋ก์จ๋ ์ธ๊ตญ์ธ ๋ฐํ ํ๊ตญ์ด ํ์์ ๋ํ ๋ถ์์ด ์ถฉ๋ถํ๊ฒ ์ด๋ฃจ์ด์ง์ง ์์๋ค๋ ์ , ๊ทธ๋ฆฌ๊ณ ๊ด๋ จ ์ฐ๊ตฌ๊ฐ ์์ด๋ ์ด๋ฅผ ์๋ํ๋ ์์คํ
์ ๋ฐ์ํ๊ธฐ์๋ ๊ณ ๋ํ๋ ์ฐ๊ตฌ๊ฐ ํ์ํ๋ค๋ ์ ์ด ์๋ค. ๋ฟ๋ง ์๋๋ผ CAPT ๊ธฐ์ ์ ๋ฐ์ ์ผ๋ก๋ ์ ํธ์ฒ๋ฆฌ, ์ด์จ ๋ถ์, ์์ฐ์ด์ฒ๋ฆฌ ๊ธฐ๋ฒ๊ณผ ๊ฐ์ ํน์ง ์ถ์ถ์ ์์กดํ๊ณ ์์ด์ ์ ํฉํ ํน์ง์ ์ฐพ๊ณ ์ด๋ฅผ ์ ํํ๊ฒ ์ถ์ถํ๋ ๋ฐ์ ๋ง์ ์๊ฐ๊ณผ ๋
ธ๋ ฅ์ด ํ์ํ ์ค์ ์ด๋ค. ์ด๋ ์ต์ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ธ์ด์ฒ๋ฆฌ ๊ธฐ์ ์ ํ์ฉํจ์ผ๋ก์จ ์ด ๊ณผ์ ๋ํ ๋ฐ์ ์ ์ฌ์ง๊ฐ ๋ง๋ค๋ ๋ฐ๋ฅผ ์์ฌํ๋ค.
๋ฐ๋ผ์ ๋ณธ ์ฐ๊ตฌ๋ ๋จผ์ CAPT ์์คํ
๊ฐ๋ฐ์ ์์ด ๋ฐ์ ๋ณ์ด ์์๊ณผ ์ธ์ดํ์ ์๊ด๊ด๊ณ๋ฅผ ๋ถ์ํ์๋ค. ์ธ๊ตญ์ธ ํ์๋ค์ ๋ญ๋
์ฒด ๋ณ์ด ์์๊ณผ ํ๊ตญ์ด ์์ด๋ฏผ ํ์๋ค์ ๋ญ๋
์ฒด ๋ณ์ด ์์์ ๋์กฐํ๊ณ ์ฃผ์ํ ๋ณ์ด๋ฅผ ํ์ธํ ํ, ์๊ด๊ด๊ณ ๋ถ์์ ํตํ์ฌ ์์ฌ์ํต์ ์ํฅ์ ๋ฏธ์น๋ ์ค์๋๋ฅผ ํ์
ํ์๋ค. ๊ทธ ๊ฒฐ๊ณผ, ์ข
์ฑ ์ญ์ ์ 3์ค ๋๋ฆฝ์ ํผ๋, ์ด๋ถ์ ๊ด๋ จ ์ค๋ฅ๊ฐ ๋ฐ์ํ ๊ฒฝ์ฐ ํผ๋๋ฐฑ ์์ฑ์ ์ฐ์ ์ ์ผ๋ก ๋ฐ์ํ๋ ๊ฒ์ด ํ์ํ๋ค๋ ๊ฒ์ด ํ์ธ๋์๋ค.
๊ต์ ๋ ํผ๋๋ฐฑ์ ์๋์ผ๋ก ์์ฑํ๋ ๊ฒ์ CAPT ์์คํ
์ ์ค์ํ ๊ณผ์ ์ค ํ๋์ด๋ค. ๋ณธ ์ฐ๊ตฌ๋ ์ด ๊ณผ์ ๊ฐ ๋ฐํ์ ์คํ์ผ ๋ณํ์ ๋ฌธ์ ๋ก ํด์์ด ๊ฐ๋ฅํ๋ค๊ณ ๋ณด์์ผ๋ฉฐ, ์์ฑ์ ์ ๋ ์ ๊ฒฝ๋ง (Cycle-consistent Generative Adversarial Network; CycleGAN) ๊ตฌ์กฐ์์ ๋ชจ๋ธ๋งํ๋ ๊ฒ์ ์ ์ํ์๋ค. GAN ๋คํธ์ํฌ์ ์์ฑ๋ชจ๋ธ์ ๋น์์ด๋ฏผ ๋ฐํ์ ๋ถํฌ์ ์์ด๋ฏผ ๋ฐํ ๋ถํฌ์ ๋งคํ์ ํ์ตํ๋ฉฐ, Cycle consistency ์์คํจ์๋ฅผ ์ฌ์ฉํจ์ผ๋ก์จ ๋ฐํ๊ฐ ์ ๋ฐ์ ์ธ ๊ตฌ์กฐ๋ฅผ ์ ์งํจ๊ณผ ๋์์ ๊ณผ๋ํ ๊ต์ ์ ๋ฐฉ์งํ์๋ค. ๋ณ๋์ ํน์ง ์ถ์ถ ๊ณผ์ ์ด ์์ด ํ์ํ ํน์ง๋ค์ด CycleGAN ํ๋ ์์ํฌ์์ ๋ฌด๊ฐ๋
๋ฐฉ๋ฒ์ผ๋ก ์ค์ค๋ก ํ์ต๋๋ ๋ฐฉ๋ฒ์ผ๋ก, ์ธ์ด ํ์ฅ์ด ์ฉ์ดํ ๋ฐฉ๋ฒ์ด๋ค.
์ธ์ดํ์ ๋ถ์์์ ๋๋ฌ๋ ์ฃผ์ํ ๋ณ์ด๋ค ๊ฐ์ ์ฐ์ ์์๋ Auxiliary Classifier CycleGAN ๊ตฌ์กฐ์์ ๋ชจ๋ธ๋งํ๋ ๊ฒ์ ์ ์ํ์๋ค. ์ด ๋ฐฉ๋ฒ์ ๊ธฐ์กด์ CycleGAN์ ์ง์์ ์ ๋ชฉ์์ผ ํผ๋๋ฐฑ ์์ฑ์ ์์ฑํจ๊ณผ ๋์์ ํด๋น ํผ๋๋ฐฑ์ด ์ด๋ค ์ ํ์ ์ค๋ฅ์ธ์ง ๋ถ๋ฅํ๋ ๋ฌธ์ ๋ฅผ ์ํํ๋ค. ์ด๋ ๋๋ฉ์ธ ์ง์์ด ๊ต์ ํผ๋๋ฐฑ ์์ฑ ๋จ๊ณ๊น์ง ์ ์ง๋๊ณ ํต์ ๊ฐ ๊ฐ๋ฅํ๋ค๋ ์ฅ์ ์ด ์๋ค๋ ๋ฐ์ ๊ทธ ์์๊ฐ ์๋ค.
๋ณธ ์ฐ๊ตฌ์์ ์ ์ํ ๋ฐฉ๋ฒ์ ํ๊ฐํ๊ธฐ ์ํด์ 27๊ฐ์ ๋ชจ๊ตญ์ด๋ฅผ ๊ฐ๋ 217๋ช
์ ์ ์๋ฏธ ์ดํ ๋ฐํ 65,100๊ฐ๋ก ํผ๋๋ฐฑ ์๋ ์์ฑ ๋ชจ๋ธ์ ํ๋ จํ๊ณ , ๊ฐ์ ์ฌ๋ถ ๋ฐ ์ ๋์ ๋ํ ์ง๊ฐ ํ๊ฐ๋ฅผ ์ํํ์๋ค. ์ ์๋ ๋ฐฉ๋ฒ์ ์ฌ์ฉํ์์ ๋ ํ์ต์ ๋ณธ์ธ์ ๋ชฉ์๋ฆฌ๋ฅผ ์ ์งํ ์ฑ ๊ต์ ๋ ๋ฐ์์ผ๋ก ๋ณํํ๋ ๊ฒ์ด ๊ฐ๋ฅํ๋ฉฐ, ์ ํต์ ์ธ ๋ฐฉ๋ฒ์ธ ์๋์ด ๋๊ธฐ์ ์ค์ฒฉ๊ฐ์ฐ (Pitch-Synchronous Overlap-and-Add) ์๊ณ ๋ฆฌ์ฆ์ ์ฌ์ฉํ๋ ๋ฐฉ๋ฒ์ ๋นํด ์๋ ๊ฐ์ ๋ฅ 16.67%์ด ํ์ธ๋์๋ค.Chapter 1. Introduction 1
1.1. Motivation 1
1.1.1. An Overview of CAPT Systems 3
1.1.2. Survey of existing Korean CAPT Systems 5
1.2. Problem Statement 7
1.3. Thesis Structure 7
Chapter 2. Pronunciation Analysis of Korean Produced by Chinese 9
2.1. Comparison between Korean and Chinese 11
2.1.1. Phonetic and Syllable Structure Comparisons 11
2.1.2. Phonological Comparisons 14
2.2. Related Works 16
2.3. Proposed Analysis Method 19
2.3.1. Corpus 19
2.3.2. Transcribers and Agreement Rates 22
2.4. Salient Pronunciation Variations 22
2.4.1. Segmental Variation Patterns 22
2.4.1.1. Discussions 25
2.4.2. Phonological Variation Patterns 26
2.4.1.2. Discussions 27
2.5. Summary 29
Chapter 3. Correlation Analysis of Pronunciation Variations and Human Evaluation 30
3.1. Related Works 31
3.1.1. Criteria used in L2 Speech 31
3.1.2. Criteria used in L2 Korean Speech 32
3.2. Proposed Human Evaluation Method 36
3.2.1. Reading Prompt Design 36
3.2.2. Evaluation Criteria Design 37
3.2.3. Raters and Agreement Rates 40
3.3. Linguistic Factors Affecting L2 Korean Accentedness 41
3.3.1. Pearsons Correlation Analysis 41
3.3.2. Discussions 42
3.3.3. Implications for Automatic Feedback Generation 44
3.4. Summary 45
Chapter 4. Corrective Feedback Generation for CAPT 46
4.1. Related Works 46
4.1.1. Prosody Transplantation 47
4.1.2. Recent Speech Conversion Methods 49
4.1.3. Evaluation of Corrective Feedback 50
4.2. Proposed Method: Corrective Feedback as a Style Transfer 51
4.2.1. Speech Analysis at Spectral Domain 53
4.2.2. Self-imitative Learning 55
4.2.3. An Analogy: CAPT System and GAN Architecture 57
4.3. Generative Adversarial Networks 59
4.3.1. Conditional GAN 61
4.3.2. CycleGAN 62
4.4. Experiment 63
4.4.1. Corpus 64
4.4.2. Baseline Implementation 65
4.4.3. Adversarial Training Implementation 65
4.4.4. Spectrogram-to-Spectrogram Training 66
4.5. Results and Evaluation 69
4.5.1. Spectrogram Generation Results 69
4.5.2. Perceptual Evaluation 70
4.5.3. Discussions 72
4.6. Summary 74
Chapter 5. Integration of Linguistic Knowledge in an Auxiliary Classifier CycleGAN for Feedback Generation 75
5.1. Linguistic Class Selection 75
5.2. Auxiliary Classifier CycleGAN Design 77
5.3. Experiment and Results 80
5.3.1. Corpus 80
5.3.2. Feature Annotations 81
5.3.3. Experiment Setup 81
5.3.4. Results 82
5.4. Summary 84
Chapter 6. Conclusion 86
6.1. Thesis Results 86
6.2. Thesis Contributions 88
6.3. Recommendations for Future Work 89
Bibliography 91
Appendix 107
Abstract in Korean 117
Acknowledgments 120Docto
Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme
Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie
Directions for the future of technology in pronunciation research and teaching
This paper reports on the role of technology in state-of-the-art pronunciation research and instruction, and makes concrete suggestions for future developments. The point of departure for this contribution is that the goal of second language (L2) pronunciation research and teaching should be enhanced comprehensibility and intelligibility as opposed to native-likeness. Three main areas are covered here. We begin with a presentation of advanced uses of pronunciation technology in research with a special focus on the expertise required to carry out even small-scale investigations. Next, we discuss the nature of data in pronunciation research, pointing to ways in which future work can build on advances in corpus research and crowdsourcing. Finally, we consider how these insights pave the way for researchers and developers working to create research-informed, computer-assisted pronunciation teaching resources. We conclude with predictions for future developments
- โฆ