11 research outputs found

    L2-ARCTIC: A Non-Native English Speech Corpus

    Get PDF
    In this paper, we introduce L2-ARCTIC, a speech corpus of non-native English that is intended for research in voice conversion, accent conversion, and mispronunciation detection. This initial release includes recordings from ten non-native speakers of English whose first languages (L1s) are Hindi, Korean, Mandarin, Spanish, and Arabic, each L1 containing recordings from one male and one female speaker. Each speaker recorded approximately one hour of read speech from the Carnegie Mellon University ARCTIC prompts, from which we generated orthographic and forced-aligned phonetic transcriptions. In addition, we manually annotated 150 utterances per speaker to identify three types of mispronunciation errors: substitutions, deletions, and additions, making it a valuable resource not only for research in voice conversion and accent conversion but also in computer-assisted pronunciation training. The corpus is publicly accessible at https://psi.engr.tamu.edu/l2-arctic-corpus/

    Microsoft Reading Progress as Capt Tool

    Get PDF
    The paper explores the accuracy of feedback provided to non-native learners of English by a pronunciation module included in Microsoft Reading Progress. We compared pronunciation assessment offered by Reading Progress against two university pronunciation teachers. Recordings from students of English who aim for native-like pronunciation were assessed independently by Reading Progress and the human raters. The output was standardized as negative binary feedback assigned to orthographic words, which matches the Microsoft format. Our results indicate that Reading Progress is not yet ready to be used as a CAPT tool. Inter-rater reliability analysis showed a moderate level of agreement for all raters and a good level of agreement upon eliminating feedback from Reading Progress. Meanwhile, the qualitative analysis revealed certain problems, notably false positives, i.e., words pronounced within the boundaries of academic pronunciation standards, but still marked as incorrect by the digital rater. We recommend that EFL teachers and researchers approach the current version of Reading Progress with caution, especially as regards automated feedback. However, its design may still be useful for manual feedback. Given Microsoft declarations that Reading Progress would be developed to include more accents, it has the potential to evolve into a fully-functional CAPT tool for EFL pedagogy and research

    ์ž๋™๋ฐœ์Œํ‰๊ฐ€-๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ ํ†ตํ•ฉ ๋ชจ๋ธ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์ธ๋ฌธ๋Œ€ํ•™ ์–ธ์–ดํ•™๊ณผ, 2023. 8. ์ •๋ฏผํ™”.์‹ค์ฆ ์—ฐ๊ตฌ์— ์˜ํ•˜๋ฉด ๋น„์›์–ด๋ฏผ ๋ฐœ์Œ ํ‰๊ฐ€์— ์žˆ์–ด ์ „๋ฌธ ํ‰๊ฐ€์ž๊ฐ€ ์ฑ„์ ํ•˜๋Š” ๋ฐœ์Œ ์ ์ˆ˜์™€ ์Œ์†Œ ์˜ค๋ฅ˜ ์‚ฌ์ด์˜ ์ƒ๊ด€๊ด€๊ณ„๋Š” ๋งค์šฐ ๋†’๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด์˜ ์ปดํ“จํ„ฐ๊ธฐ๋ฐ˜๋ฐœ์Œํ›ˆ๋ จ (Computer-assisted Pronunciation Training; CAPT) ์‹œ์Šคํ…œ์€ ์ž๋™๋ฐœ์Œํ‰๊ฐ€ (Automatic Pronunciation Assessment; APA) ๊ณผ์ œ ๋ฐ ๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ (Mispronunciation Detection and Diagnosis; MDD) ๊ณผ์ œ๋ฅผ ๋…๋ฆฝ์ ์ธ ๊ณผ์ œ๋กœ ์ทจ๊ธ‰ํ•˜๋ฉฐ ๊ฐ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์—๋งŒ ์ดˆ์ ์„ ๋‘์—ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋‘ ๊ณผ์ œ ์‚ฌ์ด์˜ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„์— ์ฃผ๋ชฉ, ๋‹ค์ค‘์ž‘์—…ํ•™์Šต ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ์ž๋™๋ฐœ์Œํ‰๊ฐ€์™€ ๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ ๊ณผ์ œ๋ฅผ ๋™์‹œ์— ํ›ˆ๋ จํ•˜๋Š” ์ƒˆ๋กœ์šด ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” APA ๊ณผ์ œ๋ฅผ ์œ„ํ•ด ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹คํ•จ์ˆ˜ ๋ฐ RMSE ์†์‹คํ•จ์ˆ˜๋ฅผ ์‹คํ—˜ํ•˜๋ฉฐ, MDD ์†์‹คํ•จ์ˆ˜๋Š” CTC ์†์‹คํ•จ์ˆ˜๋กœ ๊ณ ์ •๋œ๋‹ค. ๊ทผ๊ฐ„ ์Œํ–ฅ ๋ชจ๋ธ์€ ์‚ฌ์ „ํ›ˆ๋ จ๋œ ์ž๊ธฐ์ง€๋„ํ•™์Šต๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋กœ ํ•˜๋ฉฐ, ์ด๋•Œ ๋”์šฑ ํ’๋ถ€ํ•œ ์Œํ–ฅ ์ •๋ณด๋ฅผ ์œ„ํ•ด ๋‹ค์ค‘์ž‘์—…ํ•™์Šต์„ ๊ฑฐ์น˜๊ธฐ ์ „์— ๋ถ€์ˆ˜์ ์œผ๋กœ ์Œ์†Œ์ธ์‹์— ๋Œ€ํ•˜์—ฌ ๋ฏธ์„ธ์กฐ์ •๋˜๊ธฐ๋„ ํ•œ๋‹ค. ์Œํ–ฅ ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ๋ฐœ์Œ์ ํ•ฉ์ ์ˆ˜(Goodness-of-Pronunciation; GOP)๊ฐ€ ์ถ”๊ฐ€์ ์ธ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ํ†ตํ•ฉ ๋ชจ๋ธ์ด ๋‹จ์ผ ์ž๋™๋ฐœ์Œํ‰๊ฐ€ ๋ฐ ๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ ๋ชจ๋ธ๋ณด๋‹ค ๋งค์šฐ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” Speechocean762 ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ž๋™๋ฐœ์Œํ‰๊ฐ€ ๊ณผ์ œ์— ์‚ฌ์šฉ๋œ ๋„ค ํ•ญ๋ชฉ์˜ ์ ์ˆ˜๋“ค์˜ ํ‰๊ท  ํ”ผ์–ด์Šจ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ 0.041 ์ฆ๊ฐ€ํ•˜์˜€์œผ๋ฉฐ, ๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ ๊ณผ์ œ์— ๋Œ€ํ•ด F1 ์ ์ˆ˜๊ฐ€ 0.003 ์ฆ๊ฐ€ํ•˜์˜€๋‹ค. ํ†ตํ•ฉ ๋ชจ๋ธ์— ๋Œ€ํ•ด ์‹œ๋„๋œ ์•„ํ‚คํ…์ฒ˜ ์ค‘์—์„œ๋Š”, Robust Wav2vec2.0 ์Œํ–ฅ๋ชจ๋ธ๊ณผ ๋ฐœ์Œ์ ํ•ฉ์ ์ˆ˜๋ฅผ ํ™œ์šฉํ•˜์—ฌ RMSE/CTC ์†์‹คํ•จ์ˆ˜๋กœ ํ›ˆ๋ จํ•œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์•˜๋‹ค. ๋ชจ๋ธ์„ ๋ถ„์„ํ•œ ๊ฒฐ๊ณผ, ํ†ตํ•ฉ ๋ชจ๋ธ์ด ๊ฐœ๋ณ„ ๋ชจ๋ธ์— ๋น„ํ•ด ๋ถ„ํฌ๊ฐ€ ๋‚ฎ์€ ์ ์ˆ˜ ๋ฐ ๋ฐœ์Œ์˜ค๋ฅ˜๋ฅผ ๋” ์ •ํ™•ํ•˜๊ฒŒ ๊ตฌ๋ถ„ํ•˜์˜€์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ํฅ๋ฏธ๋กญ๊ฒŒ๋„ ํ†ตํ•ฉ ๋ชจ๋ธ์— ์žˆ์–ด ๊ฐ ํ•˜์œ„ ๊ณผ์ œ๋“ค์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ ์ •๋„๋Š” ๊ฐ ๋ฐœ์Œ ์ ์ˆ˜์™€ ๋ฐœ์Œ ์˜ค๋ฅ˜ ๋ ˆ์ด๋ธ” ์‚ฌ์ด์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ํฌ๊ธฐ์— ๋น„๋ก€ํ•˜์˜€๋‹ค. ๋˜ ํ†ตํ•ฉ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๊ฐœ์„ ๋ ์ˆ˜๋ก ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๋ฐœ์Œ์ ์ˆ˜, ๊ทธ๋ฆฌ๊ณ  ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๋ฐœ์Œ์˜ค๋ฅ˜์— ๋Œ€ํ•œ ์ƒ๊ด€์„ฑ์ด ๋†’์•„์กŒ๋‹ค. ๋ณธ ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๋Š” ํ†ตํ•ฉ ๋ชจ๋ธ์ด ๋ฐœ์Œ ์ ์ˆ˜ ๋ฐ ์Œ์†Œ ์˜ค๋ฅ˜ ์‚ฌ์ด์˜ ์–ธ์–ดํ•™์  ์ƒ๊ด€์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ์ž๋™๋ฐœ์Œํ‰๊ฐ€ ๋ฐ ๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ ๊ณผ์ œ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ์œผ๋ฉฐ, ๊ทธ ๊ฒฐ๊ณผ ํ†ตํ•ฉ ๋ชจ๋ธ์ด ์ „๋ฌธ ํ‰๊ฐ€์ž๋“ค์˜ ์‹ค์ œ ๋น„์›์–ด๋ฏผ ํ‰๊ฐ€์™€ ๋น„์Šทํ•œ ์–‘์ƒ์„ ๋ค๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.Empirical studies report a strong correlation between pronunciation scores and mispronunciations in non-native speech assessments of human evaluators. However, the existing system of computer-assisted pronunciation training (CAPT) regards automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD) as independent and focuses on individual performance improvement. Motivated by the correlation between two tasks, this study proposes a novel architecture that jointly tackles APA and MDD with a multi-task learning scheme to benefit both tasks. Specifically, APA loss is examined between cross-entropy and root mean square error (RMSE) criteria, and MDD loss is fixed to Connectionist Temporal Classification (CTC) criteria. For the backbone acoustic model, self-supervised model is used with an auxiliary fine-tuning on phone recognition before multi-task learning to leverage extra knowledge transfer. Goodness-of-Pronunciation (GOP) measure is given as an additional input along with the acoustic model. The joint model significantly outperformed single-task learning counterparts, with a mean of 0.041 PCC increase for APA task on four multi-aspect scores and 0.003 F1 increase for MDD task on Speechocean762 dataset. For the joint model architecture, multi-task learning with RMSE and CTC criteria with raw Robust Wav2vec2.0 and GOP measure achieved the best performance. Analysis indicates that the joint model learned to distinguish scores with low distribution, and to better recognize mispronunciations as mispronunciations compared to single-task learning models. Interestingly, the degree of the performance increase in each subtask for the joint model was proportional to the strength of the correlation between respective pronunciation score and mispronunciation labels, and the strength of the correlation between the model predictions also increased as the joint model achieved higher performances. The findings reveal that the joint model leveraged the linguistic correlation between pronunciation scores and mispronunciations to improve performances for APA and MDD tasks, and to show behaviors that follow the assessments of human experts.Chapter 1, Introduction 1 Chapter 2. Related work 5 Chapter 3. Methodology 17 Chapter 4. Results 28 Chapter 5. Discussion 47 Chapter 6. Conclusion 52 References 53 Appendix 60 ๊ตญ๋ฌธ ์ดˆ๋ก 65์„

    MISPRONUNCIATION DETECTION AND DIAGNOSIS IN MANDARIN ACCENTED ENGLISH SPEECH

    Get PDF
    This work presents the development, implementation, and evaluation of a Mispronunciation Detection and Diagnosis (MDD) system, with application to pronunciation evaluation of Mandarin-accented English speech. A comprehensive detection and diagnosis of errors in the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) was performed by using the expert phonetic transcripts and an Automatic Speech Recognition (ASR) system. Articulatory features derived from the parallel kinematic data available in the EMA-MAE corpus were used to identify the most significant articulatory error patterns seen in L2 speakers during common mispronunciations. Using both acoustic and articulatory information, an ASR based Mispronunciation Detection and Diagnosis (MDD) system was built and evaluated across different feature combinations and Deep Neural Network (DNN) architectures. The MDD system captured mispronunciation errors with a detection accuracy of 82.4%, a diagnostic accuracy of 75.8% and a false rejection rate of 17.2%. The results demonstrate the advantage of using articulatory features in revealing the significant contributors of mispronunciation as well as improving the performance of MDD systems

    Towards Automatic Speech-Language Assessment for Aphasia Rehabilitation

    Full text link
    Speech-based technology has the potential to reinforce traditional aphasia therapy through the development of automatic speech-language assessment systems. Such systems can provide clinicians with supplementary information to assist with progress monitoring and treatment planning, and can provide support for on-demand auxiliary treatment. However, current technology cannot support this type of application due to the difficulties associated with aphasic speech processing. The focus of this dissertation is on the development of computational methods that can accurately assess aphasic speech across a range of clinically-relevant dimensions. The first part of the dissertation focuses on novel techniques for assessing aphasic speech intelligibility in constrained contexts. The second part investigates acoustic modeling methods that lead to significant improvement in aphasic speech recognition and allow the system to work with unconstrained speech samples. The final part demonstrates the efficacy of speech recognition-based analysis in automatic paraphasia detection, extraction of clinically-motivated quantitative measures, and estimation of aphasia severity. The methods and results presented in this work will enable robust technologies for accurately recognizing and assessing aphasic speech, and will provide insights into the link between computational methods and clinical understanding of aphasia.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/140840/1/ducle_1.pd

    Apraxia World: Deploying a Mobile Game and Automatic Speech Recognition for Independent Child Speech Therapy

    Get PDF
    Children with speech sound disorders typically improve pronunciation quality by undergoing speech therapy, which must be delivered frequently and with high intensity to be effective. As such, clinic sessions are supplemented with home practice, often under caregiver supervision. However, traditional home practice can grow boring for children due to monotony. Furthermore, practice frequency is limited by caregiver availability, making it difficult for some children to reach therapy dosage. To address these issues, this dissertation presents a novel speech therapy game to increase engagement, and explores automatic pronunciation evaluation techniques to afford children independent practice. Children with speech sound disorders typically improve pronunciation quality by undergoing speech therapy, which must be delivered frequently and with high intensity to be effective. As such, clinic sessions are supplemented with home practice, often under caregiver supervision. However, traditional home practice can grow boring for children due to monotony. Furthermore, practice frequency is limited by caregiver availability, making it difficult for some children to reach therapy dosage. To address these issues, this dissertation presents a novel speech therapy game to increase engagement, and explores automatic pronunciation evaluation techniques to afford children independent practice. The therapy game, called Apraxia World, delivers customizable, repetition-based speech therapy while children play through platformer-style levels using typical on-screen tablet controls; children complete in-game speech exercises to collect assets required to progress through the levels. Additionally, Apraxia World provides pronunciation feedback according to an automated pronunciation evaluation system running locally on the tablet. Apraxia World offers two advantages over current commercial and research speech therapy games; first, the game provides extended gameplay to support long therapy treatments; second, it affords some therapy practice independence via automatic pronunciation evaluation, allowing caregivers to lightly supervise instead of directly administer the practice. Pilot testing indicated that children enjoyed the game-based therapy much more than traditional practice and that the exercises did not interfere with gameplay. During a longitudinal study, children made clinically-significant pronunciation improvements while playing Apraxia World at home. Furthermore, children remained engaged in the game-based therapy over the two-month testing period and some even wanted to continue playing post-study. The second part of the dissertation explores word- and phoneme-level pronunciation verification for child speech therapy applications. Word-level pronunciation verification is accomplished using a child-specific template-matching framework, where an utterance is compared against correctly and incorrectly pronounced examples of the word. This framework identified mispronounced words better than both a standard automated baseline and co-located caregivers. Phoneme-level mispronunciation detection is investigated using a technique from the second-language learning literature: training phoneme-specific classifiers with phonetic posterior features. This method also outperformed the standard baseline, but more significantly, identified mispronunciations better than student clinicians
    corecore