305 research outputs found

    A computational model for studying L1โ€™s effect on L2 speech learning

    Get PDF
    abstract: Much evidence has shown that first language (L1) plays an important role in the formation of L2 phonological system during second language (L2) learning process. This combines with the fact that different L1s have distinct phonological patterns to indicate the diverse L2 speech learning outcomes for speakers from different L1 backgrounds. This dissertation hypothesizes that phonological distances between accented speech and speakers' L1 speech are also correlated with perceived accentedness, and the correlations are negative for some phonological properties. Moreover, contrastive phonological distinctions between L1s and L2 will manifest themselves in the accented speech produced by speaker from these L1s. To test the hypotheses, this study comes up with a computational model to analyze the accented speech properties in both segmental (short-term speech measurements on short-segment or phoneme level) and suprasegmental (long-term speech measurements on word, long-segment, or sentence level) feature space. The benefit of using a computational model is that it enables quantitative analysis of L1's effect on accent in terms of different phonological properties. The core parts of this computational model are feature extraction schemes to extract pronunciation and prosody representation of accented speech based on existing techniques in speech processing field. Correlation analysis on both segmental and suprasegmental feature space is conducted to look into the relationship between acoustic measurements related to L1s and perceived accentedness across several L1s. Multiple regression analysis is employed to investigate how the L1's effect impacts the perception of foreign accent, and how accented speech produced by speakers from different L1s behaves distinctly on segmental and suprasegmental feature spaces. Results unveil the potential application of the methodology in this study to provide quantitative analysis of accented speech, and extend current studies in L2 speech learning theory to large scale. Practically, this study further shows that the computational model proposed in this study can benefit automatic accentedness evaluation system by adding features related to speakers' L1s.Dissertation/ThesisDoctoral Dissertation Speech and Hearing Science 201

    Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

    Full text link
    The automatic identification and analysis of pronunciation errors, known as Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning or speech therapy applications. Existing MDD methods relying on analysing phonemes can only detect categorical errors of phonemes that have an adequate amount of training data to be modelled. With the unpredictable nature of the pronunciation errors of non-native or disordered speakers and the scarcity of training datasets, it is unfeasible to model all types of mispronunciations. Moreover, phoneme-level MDD approaches have a limited ability to provide detailed diagnostic information about the error made. In this paper, we propose a low-level MDD approach based on the detection of speech attribute features. Speech attribute features break down phoneme production into elementary components that are directly related to the articulatory system leading to more formative feedback to the learner. We further propose a multi-label variant of the Connectionist Temporal Classification (CTC) approach to jointly model the non-mutually exclusive speech attributes using a single model. The pre-trained wav2vec2 model was employed as a core model for the speech attribute detector. The proposed method was applied to L2 speech corpora collected from English learners from different native languages. The proposed speech attribute MDD method was further compared to the traditional phoneme-level MDD and achieved a significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR), and Diagnostic Error Rate (DER) over all speech attributes compared to the phoneme-level equivalent

    Automatic Pronunciation Assessment -- A Review

    Full text link
    Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challenges observed in prominent research trends, and highlight existing limitations, and available resources. This is followed by a discussion of the remaining challenges and possible directions for future work.Comment: 9 pages, accepted to EMNLP Finding

    ์ž๋™๋ฐœ์Œํ‰๊ฐ€-๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ ํ†ตํ•ฉ ๋ชจ๋ธ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์ธ๋ฌธ๋Œ€ํ•™ ์–ธ์–ดํ•™๊ณผ, 2023. 8. ์ •๋ฏผํ™”.์‹ค์ฆ ์—ฐ๊ตฌ์— ์˜ํ•˜๋ฉด ๋น„์›์–ด๋ฏผ ๋ฐœ์Œ ํ‰๊ฐ€์— ์žˆ์–ด ์ „๋ฌธ ํ‰๊ฐ€์ž๊ฐ€ ์ฑ„์ ํ•˜๋Š” ๋ฐœ์Œ ์ ์ˆ˜์™€ ์Œ์†Œ ์˜ค๋ฅ˜ ์‚ฌ์ด์˜ ์ƒ๊ด€๊ด€๊ณ„๋Š” ๋งค์šฐ ๋†’๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด์˜ ์ปดํ“จํ„ฐ๊ธฐ๋ฐ˜๋ฐœ์Œํ›ˆ๋ จ (Computer-assisted Pronunciation Training; CAPT) ์‹œ์Šคํ…œ์€ ์ž๋™๋ฐœ์Œํ‰๊ฐ€ (Automatic Pronunciation Assessment; APA) ๊ณผ์ œ ๋ฐ ๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ (Mispronunciation Detection and Diagnosis; MDD) ๊ณผ์ œ๋ฅผ ๋…๋ฆฝ์ ์ธ ๊ณผ์ œ๋กœ ์ทจ๊ธ‰ํ•˜๋ฉฐ ๊ฐ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์—๋งŒ ์ดˆ์ ์„ ๋‘์—ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋‘ ๊ณผ์ œ ์‚ฌ์ด์˜ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„์— ์ฃผ๋ชฉ, ๋‹ค์ค‘์ž‘์—…ํ•™์Šต ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ์ž๋™๋ฐœ์Œํ‰๊ฐ€์™€ ๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ ๊ณผ์ œ๋ฅผ ๋™์‹œ์— ํ›ˆ๋ จํ•˜๋Š” ์ƒˆ๋กœ์šด ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” APA ๊ณผ์ œ๋ฅผ ์œ„ํ•ด ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹คํ•จ์ˆ˜ ๋ฐ RMSE ์†์‹คํ•จ์ˆ˜๋ฅผ ์‹คํ—˜ํ•˜๋ฉฐ, MDD ์†์‹คํ•จ์ˆ˜๋Š” CTC ์†์‹คํ•จ์ˆ˜๋กœ ๊ณ ์ •๋œ๋‹ค. ๊ทผ๊ฐ„ ์Œํ–ฅ ๋ชจ๋ธ์€ ์‚ฌ์ „ํ›ˆ๋ จ๋œ ์ž๊ธฐ์ง€๋„ํ•™์Šต๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋กœ ํ•˜๋ฉฐ, ์ด๋•Œ ๋”์šฑ ํ’๋ถ€ํ•œ ์Œํ–ฅ ์ •๋ณด๋ฅผ ์œ„ํ•ด ๋‹ค์ค‘์ž‘์—…ํ•™์Šต์„ ๊ฑฐ์น˜๊ธฐ ์ „์— ๋ถ€์ˆ˜์ ์œผ๋กœ ์Œ์†Œ์ธ์‹์— ๋Œ€ํ•˜์—ฌ ๋ฏธ์„ธ์กฐ์ •๋˜๊ธฐ๋„ ํ•œ๋‹ค. ์Œํ–ฅ ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ๋ฐœ์Œ์ ํ•ฉ์ ์ˆ˜(Goodness-of-Pronunciation; GOP)๊ฐ€ ์ถ”๊ฐ€์ ์ธ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ํ†ตํ•ฉ ๋ชจ๋ธ์ด ๋‹จ์ผ ์ž๋™๋ฐœ์Œํ‰๊ฐ€ ๋ฐ ๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ ๋ชจ๋ธ๋ณด๋‹ค ๋งค์šฐ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” Speechocean762 ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ž๋™๋ฐœ์Œํ‰๊ฐ€ ๊ณผ์ œ์— ์‚ฌ์šฉ๋œ ๋„ค ํ•ญ๋ชฉ์˜ ์ ์ˆ˜๋“ค์˜ ํ‰๊ท  ํ”ผ์–ด์Šจ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ 0.041 ์ฆ๊ฐ€ํ•˜์˜€์œผ๋ฉฐ, ๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ ๊ณผ์ œ์— ๋Œ€ํ•ด F1 ์ ์ˆ˜๊ฐ€ 0.003 ์ฆ๊ฐ€ํ•˜์˜€๋‹ค. ํ†ตํ•ฉ ๋ชจ๋ธ์— ๋Œ€ํ•ด ์‹œ๋„๋œ ์•„ํ‚คํ…์ฒ˜ ์ค‘์—์„œ๋Š”, Robust Wav2vec2.0 ์Œํ–ฅ๋ชจ๋ธ๊ณผ ๋ฐœ์Œ์ ํ•ฉ์ ์ˆ˜๋ฅผ ํ™œ์šฉํ•˜์—ฌ RMSE/CTC ์†์‹คํ•จ์ˆ˜๋กœ ํ›ˆ๋ จํ•œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์•˜๋‹ค. ๋ชจ๋ธ์„ ๋ถ„์„ํ•œ ๊ฒฐ๊ณผ, ํ†ตํ•ฉ ๋ชจ๋ธ์ด ๊ฐœ๋ณ„ ๋ชจ๋ธ์— ๋น„ํ•ด ๋ถ„ํฌ๊ฐ€ ๋‚ฎ์€ ์ ์ˆ˜ ๋ฐ ๋ฐœ์Œ์˜ค๋ฅ˜๋ฅผ ๋” ์ •ํ™•ํ•˜๊ฒŒ ๊ตฌ๋ถ„ํ•˜์˜€์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ํฅ๋ฏธ๋กญ๊ฒŒ๋„ ํ†ตํ•ฉ ๋ชจ๋ธ์— ์žˆ์–ด ๊ฐ ํ•˜์œ„ ๊ณผ์ œ๋“ค์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ ์ •๋„๋Š” ๊ฐ ๋ฐœ์Œ ์ ์ˆ˜์™€ ๋ฐœ์Œ ์˜ค๋ฅ˜ ๋ ˆ์ด๋ธ” ์‚ฌ์ด์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ํฌ๊ธฐ์— ๋น„๋ก€ํ•˜์˜€๋‹ค. ๋˜ ํ†ตํ•ฉ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๊ฐœ์„ ๋ ์ˆ˜๋ก ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๋ฐœ์Œ์ ์ˆ˜, ๊ทธ๋ฆฌ๊ณ  ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๋ฐœ์Œ์˜ค๋ฅ˜์— ๋Œ€ํ•œ ์ƒ๊ด€์„ฑ์ด ๋†’์•„์กŒ๋‹ค. ๋ณธ ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๋Š” ํ†ตํ•ฉ ๋ชจ๋ธ์ด ๋ฐœ์Œ ์ ์ˆ˜ ๋ฐ ์Œ์†Œ ์˜ค๋ฅ˜ ์‚ฌ์ด์˜ ์–ธ์–ดํ•™์  ์ƒ๊ด€์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ์ž๋™๋ฐœ์Œํ‰๊ฐ€ ๋ฐ ๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ ๊ณผ์ œ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ์œผ๋ฉฐ, ๊ทธ ๊ฒฐ๊ณผ ํ†ตํ•ฉ ๋ชจ๋ธ์ด ์ „๋ฌธ ํ‰๊ฐ€์ž๋“ค์˜ ์‹ค์ œ ๋น„์›์–ด๋ฏผ ํ‰๊ฐ€์™€ ๋น„์Šทํ•œ ์–‘์ƒ์„ ๋ค๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.Empirical studies report a strong correlation between pronunciation scores and mispronunciations in non-native speech assessments of human evaluators. However, the existing system of computer-assisted pronunciation training (CAPT) regards automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD) as independent and focuses on individual performance improvement. Motivated by the correlation between two tasks, this study proposes a novel architecture that jointly tackles APA and MDD with a multi-task learning scheme to benefit both tasks. Specifically, APA loss is examined between cross-entropy and root mean square error (RMSE) criteria, and MDD loss is fixed to Connectionist Temporal Classification (CTC) criteria. For the backbone acoustic model, self-supervised model is used with an auxiliary fine-tuning on phone recognition before multi-task learning to leverage extra knowledge transfer. Goodness-of-Pronunciation (GOP) measure is given as an additional input along with the acoustic model. The joint model significantly outperformed single-task learning counterparts, with a mean of 0.041 PCC increase for APA task on four multi-aspect scores and 0.003 F1 increase for MDD task on Speechocean762 dataset. For the joint model architecture, multi-task learning with RMSE and CTC criteria with raw Robust Wav2vec2.0 and GOP measure achieved the best performance. Analysis indicates that the joint model learned to distinguish scores with low distribution, and to better recognize mispronunciations as mispronunciations compared to single-task learning models. Interestingly, the degree of the performance increase in each subtask for the joint model was proportional to the strength of the correlation between respective pronunciation score and mispronunciation labels, and the strength of the correlation between the model predictions also increased as the joint model achieved higher performances. The findings reveal that the joint model leveraged the linguistic correlation between pronunciation scores and mispronunciations to improve performances for APA and MDD tasks, and to show behaviors that follow the assessments of human experts.Chapter 1, Introduction 1 Chapter 2. Related work 5 Chapter 3. Methodology 17 Chapter 4. Results 28 Chapter 5. Discussion 47 Chapter 6. Conclusion 52 References 53 Appendix 60 ๊ตญ๋ฌธ ์ดˆ๋ก 65์„

    Leveraging phone-level linguistic-acoustic similarity for utterance-level pronunciation scoring

    Full text link
    Recent studies on pronunciation scoring have explored the effect of introducing phone embeddings as reference pronunciation, but mostly in an implicit manner, i.e., addition or concatenation of reference phone embedding and actual pronunciation of the target phone as the phone-level pronunciation quality representation. In this paper, we propose to use linguistic-acoustic similarity to explicitly measure the deviation of non-native production from its native reference for pronunciation assessment. Specifically, the deviation is first estimated by the cosine similarity between reference phone embedding and corresponding acoustic embedding. Next, a phone-level Goodness of pronunciation (GOP) pre-training stage is introduced to guide this similarity-based learning for better initialization of the aforementioned two embeddings. Finally, a transformer-based hierarchical pronunciation scorer is used to map a sequence of phone embeddings, acoustic embeddings along with their similarity measures to predict the final utterance-level score. Experimental results on the non-native databases suggest that the proposed system significantly outperforms the baselines, where the acoustic and phone embeddings are simply added or concatenated. A further examination shows that the phone embeddings learned in the proposed approach are able to capture linguistic-acoustic attributes of native pronunciation as reference.Comment: Accepted by ICASSP 202

    Methods for pronunciation assessment in computer aided language learning

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 149-176).Learning a foreign language is a challenging endeavor that entails acquiring a wide range of new knowledge including words, grammar, gestures, sounds, etc. Mastering these skills all require extensive practice by the learner and opportunities may not always be available. Computer Aided Language Learning (CALL) systems provide non-threatening environments where foreign language skills can be practiced where ever and whenever a student desires. These systems often have several technologies to identify the different types of errors made by a student. This thesis focuses on the problem of identifying mispronunciations made by a foreign language student using a CALL system. We make several assumptions about the nature of the learning activity: it takes place using a dialogue system, it is a task- or game-oriented activity, the student should not be interrupted by the pronunciation feedback system, and that the goal of the feedback system is to identify severe mispronunciations with high reliability. Detecting mispronunciations requires a corpus of speech with human judgements of pronunciation quality. Typical approaches to collecting such a corpus use an expert phonetician to both phonetically transcribe and assign judgements of quality to each phone in a corpus. This is time consuming and expensive. It also places an extra burden on the transcriber. We describe a novel method for obtaining phone level judgements of pronunciation quality by utilizing non-expert, crowd-sourced, word level judgements of pronunciation. Foreign language learners typically exhibit high variation and pronunciation shapes distinct from native speakers that make analysis for mispronunciation difficult. We detail a simple, but effective method for transforming the vowel space of non-native speakers to make mispronunciation detection more robust and accurate. We show that this transformation not only enhances performance on a simple classification task, but also results in distributions that can be better exploited for mispronunciation detection. This transformation of the vowel is exploited to train a mispronunciation detector using a variety of features derived from acoustic model scores and vowel class distributions. We confirm that the transformation technique results in a more robust and accurate identification of mispronunciations than traditional acoustic models.by Mitchell A. Peabody.Ph.D

    Artificial Neural Network (ANN) in a Small Dataset to determine Neutrality in the Pronunciation of English as a Foreign Language in Filipino Call Center Agents

    Get PDF
    Artificial Neural Networks (ANNs) have continued to be efficient models in solving classification problems. In this paper, we explore the use of an ANN with a small dataset to accurately classify whether Filipino call center agentsโ€™ pronunciations are neutral or not based on their employerโ€™s standards. Isolated utterances of the ten most commonly used words in the call center were recorded from eleven agents creating a dataset of 110 utterances. Two learning specialists were consulted to establish ground truths and Cohenโ€™s Kappa was computed as 0.82, validating the reliability of the dataset. The first thirteen Mel-Frequency Cepstral Coefficients (MFCCs) were then extracted from each word and an ANN was trained with Ten-fold Stratified Cross Validation. Experimental results on the model recorded a classification accuracy of 89.60% supported by an overall F-Score of 0.92

    MISPRONUNCIATION DETECTION AND DIAGNOSIS IN MANDARIN ACCENTED ENGLISH SPEECH

    Get PDF
    This work presents the development, implementation, and evaluation of a Mispronunciation Detection and Diagnosis (MDD) system, with application to pronunciation evaluation of Mandarin-accented English speech. A comprehensive detection and diagnosis of errors in the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) was performed by using the expert phonetic transcripts and an Automatic Speech Recognition (ASR) system. Articulatory features derived from the parallel kinematic data available in the EMA-MAE corpus were used to identify the most significant articulatory error patterns seen in L2 speakers during common mispronunciations. Using both acoustic and articulatory information, an ASR based Mispronunciation Detection and Diagnosis (MDD) system was built and evaluated across different feature combinations and Deep Neural Network (DNN) architectures. The MDD system captured mispronunciation errors with a detection accuracy of 82.4%, a diagnostic accuracy of 75.8% and a false rejection rate of 17.2%. The results demonstrate the advantage of using articulatory features in revealing the significant contributors of mispronunciation as well as improving the performance of MDD systems
    • โ€ฆ
    corecore