175 research outputs found

    Pronunciation Modeling of Foreign Words for Mandarin ASR by Considering the Effect of Language Transfer

    Full text link
    One of the challenges in automatic speech recognition is foreign words recognition. It is observed that a speaker's pronunciation of a foreign word is influenced by his native language knowledge, and such phenomenon is known as the effect of language transfer. This paper focuses on examining the phonetic effect of language transfer in automatic speech recognition. A set of lexical rules is proposed to convert an English word into Mandarin phonetic representation. In this way, a Mandarin lexicon can be augmented by including English words. Hence, the Mandarin ASR system becomes capable to recognize English words without retraining or re-estimation of the acoustic model parameters. Using the lexicon that derived from the proposed rules, the ASR performance of Mandarin English mixed speech is improved without harming the accuracy of Mandarin only speech. The proposed lexical rules are generalized and they can be directly applied to unseen English words.Comment: Published by INTERSPEECH 201

    A computational model for studying L1’s effect on L2 speech learning

    Get PDF
    abstract: Much evidence has shown that first language (L1) plays an important role in the formation of L2 phonological system during second language (L2) learning process. This combines with the fact that different L1s have distinct phonological patterns to indicate the diverse L2 speech learning outcomes for speakers from different L1 backgrounds. This dissertation hypothesizes that phonological distances between accented speech and speakers' L1 speech are also correlated with perceived accentedness, and the correlations are negative for some phonological properties. Moreover, contrastive phonological distinctions between L1s and L2 will manifest themselves in the accented speech produced by speaker from these L1s. To test the hypotheses, this study comes up with a computational model to analyze the accented speech properties in both segmental (short-term speech measurements on short-segment or phoneme level) and suprasegmental (long-term speech measurements on word, long-segment, or sentence level) feature space. The benefit of using a computational model is that it enables quantitative analysis of L1's effect on accent in terms of different phonological properties. The core parts of this computational model are feature extraction schemes to extract pronunciation and prosody representation of accented speech based on existing techniques in speech processing field. Correlation analysis on both segmental and suprasegmental feature space is conducted to look into the relationship between acoustic measurements related to L1s and perceived accentedness across several L1s. Multiple regression analysis is employed to investigate how the L1's effect impacts the perception of foreign accent, and how accented speech produced by speakers from different L1s behaves distinctly on segmental and suprasegmental feature spaces. Results unveil the potential application of the methodology in this study to provide quantitative analysis of accented speech, and extend current studies in L2 speech learning theory to large scale. Practically, this study further shows that the computational model proposed in this study can benefit automatic accentedness evaluation system by adding features related to speakers' L1s.Dissertation/ThesisDoctoral Dissertation Speech and Hearing Science 201

    Cloud-based Automatic Speech Recognition Systems for Southeast Asian Languages

    Full text link
    This paper provides an overall introduction of our Automatic Speech Recognition (ASR) systems for Southeast Asian languages. As not much existing work has been carried out on such regional languages, a few difficulties should be addressed before building the systems: limitation on speech and text resources, lack of linguistic knowledge, etc. This work takes Bahasa Indonesia and Thai as examples to illustrate the strategies of collecting various resources required for building ASR systems.Comment: Published by the 2017 IEEE International Conference on Orange Technologies (ICOT 2017

    MISPRONUNCIATION DETECTION AND DIAGNOSIS IN MANDARIN ACCENTED ENGLISH SPEECH

    Get PDF
    This work presents the development, implementation, and evaluation of a Mispronunciation Detection and Diagnosis (MDD) system, with application to pronunciation evaluation of Mandarin-accented English speech. A comprehensive detection and diagnosis of errors in the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) was performed by using the expert phonetic transcripts and an Automatic Speech Recognition (ASR) system. Articulatory features derived from the parallel kinematic data available in the EMA-MAE corpus were used to identify the most significant articulatory error patterns seen in L2 speakers during common mispronunciations. Using both acoustic and articulatory information, an ASR based Mispronunciation Detection and Diagnosis (MDD) system was built and evaluated across different feature combinations and Deep Neural Network (DNN) architectures. The MDD system captured mispronunciation errors with a detection accuracy of 82.4%, a diagnostic accuracy of 75.8% and a false rejection rate of 17.2%. The results demonstrate the advantage of using articulatory features in revealing the significant contributors of mispronunciation as well as improving the performance of MDD systems

    Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

    Get PDF
    This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

    Automatic Pronunciation Assessment -- A Review

    Full text link
    Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challenges observed in prominent research trends, and highlight existing limitations, and available resources. This is followed by a discussion of the remaining challenges and possible directions for future work.Comment: 9 pages, accepted to EMNLP Finding

    Teaching Lexical Stress: Effective Practice in a Mandarin ELL Context

    Get PDF
    Current trends in teaching pronunciation to ELLs (English Language Learners) point towards a top-down approach. This refers to putting emphasis on the overarching prosodic features of English rather than the proper pronunciation of consonants and vowels. One of the most integral prosodic features in English is stress. Both lexical stress (stressed syllables within a word) and sentence stress (stressed words within a sentence) play an important role in the prosodic pronunciation of English. However, some languages, such as Mandarin, lack stress in their prosodic systems, instead employing features such as tonality. These languages both have overlap in their fundamental prosodic structures, with pitch changes as integral to both tonality in Mandarin and stress in English. I propose that ESL instructors will instill prosodic skills and thus make better communicators of their students by drawing attention to this positive transfer between both systems

    CAPTλ₯Ό μœ„ν•œ 발음 변이 뢄석 및 CycleGAN 기반 ν”Όλ“œλ°± 생성

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사)--μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› :μΈλ¬ΈλŒ€ν•™ ν˜‘λ™κ³Όμ • 인지과학전곡,2020. 2. μ •λ―Όν™”.Despite the growing popularity in learning Korean as a foreign language and the rapid development in language learning applications, the existing computer-assisted pronunciation training (CAPT) systems in Korean do not utilize linguistic characteristics of non-native Korean speech. Pronunciation variations in non-native speech are far more diverse than those observed in native speech, which may pose a difficulty in combining such knowledge in an automatic system. Moreover, most of the existing methods rely on feature extraction results from signal processing, prosodic analysis, and natural language processing techniques. Such methods entail limitations since they necessarily depend on finding the right features for the task and the extraction accuracies. This thesis presents a new approach for corrective feedback generation in a CAPT system, in which pronunciation variation patterns and linguistic correlates with accentedness are analyzed and combined with a deep neural network approach, so that feature engineering efforts are minimized while maintaining the linguistically important factors for the corrective feedback generation task. Investigations on non-native Korean speech characteristics in contrast with those of native speakers, and their correlation with accentedness judgement show that both segmental and prosodic variations are important factors in a Korean CAPT system. The present thesis argues that the feedback generation task can be interpreted as a style transfer problem, and proposes to evaluate the idea using generative adversarial network. A corrective feedback generation model is trained on 65,100 read utterances by 217 non-native speakers of 27 mother tongue backgrounds. The features are automatically learnt in an unsupervised way in an auxiliary classifier CycleGAN setting, in which the generator learns to map a foreign accented speech to native speech distributions. In order to inject linguistic knowledge into the network, an auxiliary classifier is trained so that the feedback also identifies the linguistic error types that were defined in the first half of the thesis. The proposed approach generates a corrected version the speech using the learners own voice, outperforming the conventional Pitch-Synchronous Overlap-and-Add method.μ™Έκ΅­μ–΄λ‘œμ„œμ˜ ν•œκ΅­μ–΄ κ΅μœ‘μ— λŒ€ν•œ 관심이 κ³ μ‘°λ˜μ–΄ ν•œκ΅­μ–΄ ν•™μŠ΅μžμ˜ μˆ˜κ°€ 크게 μ¦κ°€ν•˜κ³  있으며, μŒμ„±μ–Έμ–΄μ²˜λ¦¬ κΈ°μˆ μ„ μ μš©ν•œ 컴퓨터 기반 발음 ꡐ윑(Computer-Assisted Pronunciation Training; CAPT) μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ— λŒ€ν•œ 연ꡬ λ˜ν•œ 적극적으둜 이루어지고 μžˆλ‹€. κ·ΈλŸΌμ—λ„ λΆˆκ΅¬ν•˜κ³  ν˜„μ‘΄ν•˜λŠ” ν•œκ΅­μ–΄ λ§ν•˜κΈ° ꡐ윑 μ‹œμŠ€ν…œμ€ μ™Έκ΅­μΈμ˜ ν•œκ΅­μ–΄μ— λŒ€ν•œ 언어학적 νŠΉμ§•μ„ μΆ©λΆ„νžˆ ν™œμš©ν•˜μ§€ μ•Šκ³  있으며, μ΅œμ‹  μ–Έμ–΄μ²˜λ¦¬ 기술 λ˜ν•œ μ μš©λ˜μ§€ μ•Šκ³  μžˆλŠ” 싀정이닀. κ°€λŠ₯ν•œ μ›μΈμœΌλ‘œμ¨λŠ” 외ꡭ인 λ°œν™” ν•œκ΅­μ–΄ ν˜„μƒμ— λŒ€ν•œ 뢄석이 μΆ©λΆ„ν•˜κ²Œ 이루어지지 μ•Šμ•˜λ‹€λŠ” 점, 그리고 κ΄€λ ¨ 연ꡬ가 μžˆμ–΄λ„ 이λ₯Ό μžλ™ν™”λœ μ‹œμŠ€ν…œμ— λ°˜μ˜ν•˜κΈ°μ—λŠ” κ³ λ„ν™”λœ 연ꡬ가 ν•„μš”ν•˜λ‹€λŠ” 점이 μžˆλ‹€. 뿐만 μ•„λ‹ˆλΌ CAPT 기술 μ „λ°˜μ μœΌλ‘œλŠ” μ‹ ν˜Έμ²˜λ¦¬, 운율 뢄석, μžμ—°μ–΄μ²˜λ¦¬ 기법과 같은 νŠΉμ§• μΆ”μΆœμ— μ˜μ‘΄ν•˜κ³  μžˆμ–΄μ„œ μ ν•©ν•œ νŠΉμ§•μ„ μ°Ύκ³  이λ₯Ό μ •ν™•ν•˜κ²Œ μΆ”μΆœν•˜λŠ” 데에 λ§Žμ€ μ‹œκ°„κ³Ό λ…Έλ ₯이 ν•„μš”ν•œ 싀정이닀. μ΄λŠ” μ΅œμ‹  λ”₯λŸ¬λ‹ 기반 μ–Έμ–΄μ²˜λ¦¬ κΈ°μˆ μ„ ν™œμš©ν•¨μœΌλ‘œμ¨ 이 κ³Όμ • λ˜ν•œ λ°œμ „μ˜ 여지가 λ§Žλ‹€λŠ” λ°”λ₯Ό μ‹œμ‚¬ν•œλ‹€. λ”°λΌμ„œ λ³Έ μ—°κ΅¬λŠ” λ¨Όμ € CAPT μ‹œμŠ€ν…œ κ°œλ°œμ— μžˆμ–΄ 발음 변이 양상과 언어학적 상관관계λ₯Ό λΆ„μ„ν•˜μ˜€λ‹€. 외ꡭ인 ν™”μžλ“€μ˜ 낭독체 변이 양상과 ν•œκ΅­μ–΄ 원어민 ν™”μžλ“€μ˜ 낭독체 변이 양상을 λŒ€μ‘°ν•˜κ³  μ£Όμš”ν•œ 변이λ₯Ό ν™•μΈν•œ ν›„, 상관관계 뢄석을 ν†΅ν•˜μ—¬ μ˜μ‚¬μ†Œν†΅μ— 영ν–₯을 λ―ΈμΉ˜λŠ” μ€‘μš”λ„λ₯Ό νŒŒμ•…ν•˜μ˜€λ‹€. κ·Έ κ²°κ³Ό, μ’…μ„± μ‚­μ œμ™€ 3쀑 λŒ€λ¦½μ˜ ν˜Όλ™, μ΄ˆλΆ„μ ˆ κ΄€λ ¨ 였λ₯˜κ°€ λ°œμƒν•  경우 ν”Όλ“œλ°± 생성에 μš°μ„ μ μœΌλ‘œ λ°˜μ˜ν•˜λŠ” 것이 ν•„μš”ν•˜λ‹€λŠ” 것이 ν™•μΈλ˜μ—ˆλ‹€. κ΅μ •λœ ν”Όλ“œλ°±μ„ μžλ™μœΌλ‘œ μƒμ„±ν•˜λŠ” 것은 CAPT μ‹œμŠ€ν…œμ˜ μ€‘μš”ν•œ 과제 쀑 ν•˜λ‚˜μ΄λ‹€. λ³Έ μ—°κ΅¬λŠ” 이 κ³Όμ œκ°€ λ°œν™”μ˜ μŠ€νƒ€μΌ λ³€ν™”μ˜ 문제둜 해석이 κ°€λŠ₯ν•˜λ‹€κ³  λ³΄μ•˜μœΌλ©°, 생성적 μ λŒ€ 신경망 (Cycle-consistent Generative Adversarial Network; CycleGAN) κ΅¬μ‘°μ—μ„œ λͺ¨λΈλ§ν•˜λŠ” 것을 μ œμ•ˆν•˜μ˜€λ‹€. GAN λ„€νŠΈμ›Œν¬μ˜ 생성λͺ¨λΈμ€ 비원어민 λ°œν™”μ˜ 뢄포와 원어민 λ°œν™” λΆ„ν¬μ˜ 맀핑을 ν•™μŠ΅ν•˜λ©°, Cycle consistency μ†μ‹€ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•¨μœΌλ‘œμ¨ λ°œν™”κ°„ μ „λ°˜μ μΈ ꡬ쑰λ₯Ό μœ μ§€ν•¨κ³Ό λ™μ‹œμ— κ³Όλ„ν•œ ꡐ정을 λ°©μ§€ν•˜μ˜€λ‹€. λ³„λ„μ˜ νŠΉμ§• μΆ”μΆœ 과정이 없이 ν•„μš”ν•œ νŠΉμ§•λ“€μ΄ CycleGAN ν”„λ ˆμž„μ›Œν¬μ—μ„œ 무감독 λ°©λ²•μœΌλ‘œ 슀슀둜 ν•™μŠ΅λ˜λŠ” λ°©λ²•μœΌλ‘œ, μ–Έμ–΄ ν™•μž₯이 μš©μ΄ν•œ 방법이닀. 언어학적 λΆ„μ„μ—μ„œ λ“œλŸ¬λ‚œ μ£Όμš”ν•œ 변이듀 κ°„μ˜ μš°μ„ μˆœμœ„λŠ” Auxiliary Classifier CycleGAN κ΅¬μ‘°μ—μ„œ λͺ¨λΈλ§ν•˜λŠ” 것을 μ œμ•ˆν•˜μ˜€λ‹€. 이 방법은 기쑴의 CycleGAN에 지식을 μ ‘λͺ©μ‹œμΌœ ν”Όλ“œλ°± μŒμ„±μ„ 생성함과 λ™μ‹œμ— ν•΄λ‹Ή ν”Όλ“œλ°±μ΄ μ–΄λ–€ μœ ν˜•μ˜ 였λ₯˜μΈμ§€ λΆ„λ₯˜ν•˜λŠ” 문제λ₯Ό μˆ˜ν–‰ν•œλ‹€. μ΄λŠ” 도메인 지식이 ꡐ정 ν”Όλ“œλ°± 생성 λ‹¨κ³„κΉŒμ§€ μœ μ§€λ˜κ³  ν†΅μ œκ°€ κ°€λŠ₯ν•˜λ‹€λŠ” μž₯점이 μžˆλ‹€λŠ” 데에 κ·Έ μ˜μ˜κ°€ μžˆλ‹€. λ³Έ μ—°κ΅¬μ—μ„œ μ œμ•ˆν•œ 방법을 ν‰κ°€ν•˜κΈ° μœ„ν•΄μ„œ 27개의 λͺ¨κ΅­μ–΄λ₯Ό κ°–λŠ” 217λͺ…μ˜ 유의미 μ–΄νœ˜ λ°œν™” 65,100개둜 ν”Όλ“œλ°± μžλ™ 생성 λͺ¨λΈμ„ ν›ˆλ ¨ν•˜κ³ , κ°œμ„  μ—¬λΆ€ 및 정도에 λŒ€ν•œ 지각 평가λ₯Ό μˆ˜ν–‰ν•˜μ˜€λ‹€. μ œμ•ˆλœ 방법을 μ‚¬μš©ν•˜μ˜€μ„ λ•Œ ν•™μŠ΅μž 본인의 λͺ©μ†Œλ¦¬λ₯Ό μœ μ§€ν•œ 채 κ΅μ •λœ 발음으둜 λ³€ν™˜ν•˜λŠ” 것이 κ°€λŠ₯ν•˜λ©°, 전톡적인 방법인 μŒλ†’μ΄ 동기식 쀑첩가산 (Pitch-Synchronous Overlap-and-Add) μ•Œκ³ λ¦¬μ¦˜μ„ μ‚¬μš©ν•˜λŠ” 방법에 λΉ„ν•΄ μƒλŒ€ κ°œμ„ λ₯  16.67%이 ν™•μΈλ˜μ—ˆλ‹€.Chapter 1. Introduction 1 1.1. Motivation 1 1.1.1. An Overview of CAPT Systems 3 1.1.2. Survey of existing Korean CAPT Systems 5 1.2. Problem Statement 7 1.3. Thesis Structure 7 Chapter 2. Pronunciation Analysis of Korean Produced by Chinese 9 2.1. Comparison between Korean and Chinese 11 2.1.1. Phonetic and Syllable Structure Comparisons 11 2.1.2. Phonological Comparisons 14 2.2. Related Works 16 2.3. Proposed Analysis Method 19 2.3.1. Corpus 19 2.3.2. Transcribers and Agreement Rates 22 2.4. Salient Pronunciation Variations 22 2.4.1. Segmental Variation Patterns 22 2.4.1.1. Discussions 25 2.4.2. Phonological Variation Patterns 26 2.4.1.2. Discussions 27 2.5. Summary 29 Chapter 3. Correlation Analysis of Pronunciation Variations and Human Evaluation 30 3.1. Related Works 31 3.1.1. Criteria used in L2 Speech 31 3.1.2. Criteria used in L2 Korean Speech 32 3.2. Proposed Human Evaluation Method 36 3.2.1. Reading Prompt Design 36 3.2.2. Evaluation Criteria Design 37 3.2.3. Raters and Agreement Rates 40 3.3. Linguistic Factors Affecting L2 Korean Accentedness 41 3.3.1. Pearsons Correlation Analysis 41 3.3.2. Discussions 42 3.3.3. Implications for Automatic Feedback Generation 44 3.4. Summary 45 Chapter 4. Corrective Feedback Generation for CAPT 46 4.1. Related Works 46 4.1.1. Prosody Transplantation 47 4.1.2. Recent Speech Conversion Methods 49 4.1.3. Evaluation of Corrective Feedback 50 4.2. Proposed Method: Corrective Feedback as a Style Transfer 51 4.2.1. Speech Analysis at Spectral Domain 53 4.2.2. Self-imitative Learning 55 4.2.3. An Analogy: CAPT System and GAN Architecture 57 4.3. Generative Adversarial Networks 59 4.3.1. Conditional GAN 61 4.3.2. CycleGAN 62 4.4. Experiment 63 4.4.1. Corpus 64 4.4.2. Baseline Implementation 65 4.4.3. Adversarial Training Implementation 65 4.4.4. Spectrogram-to-Spectrogram Training 66 4.5. Results and Evaluation 69 4.5.1. Spectrogram Generation Results 69 4.5.2. Perceptual Evaluation 70 4.5.3. Discussions 72 4.6. Summary 74 Chapter 5. Integration of Linguistic Knowledge in an Auxiliary Classifier CycleGAN for Feedback Generation 75 5.1. Linguistic Class Selection 75 5.2. Auxiliary Classifier CycleGAN Design 77 5.3. Experiment and Results 80 5.3.1. Corpus 80 5.3.2. Feature Annotations 81 5.3.3. Experiment Setup 81 5.3.4. Results 82 5.4. Summary 84 Chapter 6. Conclusion 86 6.1. Thesis Results 86 6.2. Thesis Contributions 88 6.3. Recommendations for Future Work 89 Bibliography 91 Appendix 107 Abstract in Korean 117 Acknowledgments 120Docto

    Automatic detection of accent and lexical pronunciation errors in spontaneous non-native English speech

    Get PDF
    Detecting individual pronunciation errors and diagnosing pronunciation error tendencies in a language learner based on their speech are important components of computer-aided language learning (CALL). The tasks of error detection and error tendency diagnosis become particularly challenging when the speech in question is spontaneous and particularly given the challenges posed by the inconsistency of human annotation of pronunciation errors. This paper presents an approach to these tasks by distinguishing between lexical errors, wherein the speaker does not know how a particular word is pronounced, and accent errors, wherein the candidate's speech exhibits consistent patterns of phone substitution, deletion and insertion. Three annotated corpora of non-native English speech by speakers of multiple L1s are analysed, the consistency of human annotation investigated and a method presented for detecting individual accent and lexical errors and diagnosing accent error tendencies at the speaker level
    • …
    corecore