457 research outputs found

    ์ž๋™๋ฐœ์Œํ‰๊ฐ€-๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ ํ†ตํ•ฉ ๋ชจ๋ธ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์ธ๋ฌธ๋Œ€ํ•™ ์–ธ์–ดํ•™๊ณผ, 2023. 8. ์ •๋ฏผํ™”.์‹ค์ฆ ์—ฐ๊ตฌ์— ์˜ํ•˜๋ฉด ๋น„์›์–ด๋ฏผ ๋ฐœ์Œ ํ‰๊ฐ€์— ์žˆ์–ด ์ „๋ฌธ ํ‰๊ฐ€์ž๊ฐ€ ์ฑ„์ ํ•˜๋Š” ๋ฐœ์Œ ์ ์ˆ˜์™€ ์Œ์†Œ ์˜ค๋ฅ˜ ์‚ฌ์ด์˜ ์ƒ๊ด€๊ด€๊ณ„๋Š” ๋งค์šฐ ๋†’๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด์˜ ์ปดํ“จํ„ฐ๊ธฐ๋ฐ˜๋ฐœ์Œํ›ˆ๋ จ (Computer-assisted Pronunciation Training; CAPT) ์‹œ์Šคํ…œ์€ ์ž๋™๋ฐœ์Œํ‰๊ฐ€ (Automatic Pronunciation Assessment; APA) ๊ณผ์ œ ๋ฐ ๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ (Mispronunciation Detection and Diagnosis; MDD) ๊ณผ์ œ๋ฅผ ๋…๋ฆฝ์ ์ธ ๊ณผ์ œ๋กœ ์ทจ๊ธ‰ํ•˜๋ฉฐ ๊ฐ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์—๋งŒ ์ดˆ์ ์„ ๋‘์—ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋‘ ๊ณผ์ œ ์‚ฌ์ด์˜ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„์— ์ฃผ๋ชฉ, ๋‹ค์ค‘์ž‘์—…ํ•™์Šต ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ์ž๋™๋ฐœ์Œํ‰๊ฐ€์™€ ๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ ๊ณผ์ œ๋ฅผ ๋™์‹œ์— ํ›ˆ๋ จํ•˜๋Š” ์ƒˆ๋กœ์šด ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” APA ๊ณผ์ œ๋ฅผ ์œ„ํ•ด ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹คํ•จ์ˆ˜ ๋ฐ RMSE ์†์‹คํ•จ์ˆ˜๋ฅผ ์‹คํ—˜ํ•˜๋ฉฐ, MDD ์†์‹คํ•จ์ˆ˜๋Š” CTC ์†์‹คํ•จ์ˆ˜๋กœ ๊ณ ์ •๋œ๋‹ค. ๊ทผ๊ฐ„ ์Œํ–ฅ ๋ชจ๋ธ์€ ์‚ฌ์ „ํ›ˆ๋ จ๋œ ์ž๊ธฐ์ง€๋„ํ•™์Šต๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋กœ ํ•˜๋ฉฐ, ์ด๋•Œ ๋”์šฑ ํ’๋ถ€ํ•œ ์Œํ–ฅ ์ •๋ณด๋ฅผ ์œ„ํ•ด ๋‹ค์ค‘์ž‘์—…ํ•™์Šต์„ ๊ฑฐ์น˜๊ธฐ ์ „์— ๋ถ€์ˆ˜์ ์œผ๋กœ ์Œ์†Œ์ธ์‹์— ๋Œ€ํ•˜์—ฌ ๋ฏธ์„ธ์กฐ์ •๋˜๊ธฐ๋„ ํ•œ๋‹ค. ์Œํ–ฅ ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ๋ฐœ์Œ์ ํ•ฉ์ ์ˆ˜(Goodness-of-Pronunciation; GOP)๊ฐ€ ์ถ”๊ฐ€์ ์ธ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ํ†ตํ•ฉ ๋ชจ๋ธ์ด ๋‹จ์ผ ์ž๋™๋ฐœ์Œํ‰๊ฐ€ ๋ฐ ๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ ๋ชจ๋ธ๋ณด๋‹ค ๋งค์šฐ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” Speechocean762 ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ž๋™๋ฐœ์Œํ‰๊ฐ€ ๊ณผ์ œ์— ์‚ฌ์šฉ๋œ ๋„ค ํ•ญ๋ชฉ์˜ ์ ์ˆ˜๋“ค์˜ ํ‰๊ท  ํ”ผ์–ด์Šจ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ 0.041 ์ฆ๊ฐ€ํ•˜์˜€์œผ๋ฉฐ, ๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ ๊ณผ์ œ์— ๋Œ€ํ•ด F1 ์ ์ˆ˜๊ฐ€ 0.003 ์ฆ๊ฐ€ํ•˜์˜€๋‹ค. ํ†ตํ•ฉ ๋ชจ๋ธ์— ๋Œ€ํ•ด ์‹œ๋„๋œ ์•„ํ‚คํ…์ฒ˜ ์ค‘์—์„œ๋Š”, Robust Wav2vec2.0 ์Œํ–ฅ๋ชจ๋ธ๊ณผ ๋ฐœ์Œ์ ํ•ฉ์ ์ˆ˜๋ฅผ ํ™œ์šฉํ•˜์—ฌ RMSE/CTC ์†์‹คํ•จ์ˆ˜๋กœ ํ›ˆ๋ จํ•œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์•˜๋‹ค. ๋ชจ๋ธ์„ ๋ถ„์„ํ•œ ๊ฒฐ๊ณผ, ํ†ตํ•ฉ ๋ชจ๋ธ์ด ๊ฐœ๋ณ„ ๋ชจ๋ธ์— ๋น„ํ•ด ๋ถ„ํฌ๊ฐ€ ๋‚ฎ์€ ์ ์ˆ˜ ๋ฐ ๋ฐœ์Œ์˜ค๋ฅ˜๋ฅผ ๋” ์ •ํ™•ํ•˜๊ฒŒ ๊ตฌ๋ถ„ํ•˜์˜€์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ํฅ๋ฏธ๋กญ๊ฒŒ๋„ ํ†ตํ•ฉ ๋ชจ๋ธ์— ์žˆ์–ด ๊ฐ ํ•˜์œ„ ๊ณผ์ œ๋“ค์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ ์ •๋„๋Š” ๊ฐ ๋ฐœ์Œ ์ ์ˆ˜์™€ ๋ฐœ์Œ ์˜ค๋ฅ˜ ๋ ˆ์ด๋ธ” ์‚ฌ์ด์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ํฌ๊ธฐ์— ๋น„๋ก€ํ•˜์˜€๋‹ค. ๋˜ ํ†ตํ•ฉ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๊ฐœ์„ ๋ ์ˆ˜๋ก ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๋ฐœ์Œ์ ์ˆ˜, ๊ทธ๋ฆฌ๊ณ  ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๋ฐœ์Œ์˜ค๋ฅ˜์— ๋Œ€ํ•œ ์ƒ๊ด€์„ฑ์ด ๋†’์•„์กŒ๋‹ค. ๋ณธ ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๋Š” ํ†ตํ•ฉ ๋ชจ๋ธ์ด ๋ฐœ์Œ ์ ์ˆ˜ ๋ฐ ์Œ์†Œ ์˜ค๋ฅ˜ ์‚ฌ์ด์˜ ์–ธ์–ดํ•™์  ์ƒ๊ด€์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ์ž๋™๋ฐœ์Œํ‰๊ฐ€ ๋ฐ ๋ฐœ์Œ์˜ค๋ฅ˜๊ฒ€์ถœ ๊ณผ์ œ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ์œผ๋ฉฐ, ๊ทธ ๊ฒฐ๊ณผ ํ†ตํ•ฉ ๋ชจ๋ธ์ด ์ „๋ฌธ ํ‰๊ฐ€์ž๋“ค์˜ ์‹ค์ œ ๋น„์›์–ด๋ฏผ ํ‰๊ฐ€์™€ ๋น„์Šทํ•œ ์–‘์ƒ์„ ๋ค๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.Empirical studies report a strong correlation between pronunciation scores and mispronunciations in non-native speech assessments of human evaluators. However, the existing system of computer-assisted pronunciation training (CAPT) regards automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD) as independent and focuses on individual performance improvement. Motivated by the correlation between two tasks, this study proposes a novel architecture that jointly tackles APA and MDD with a multi-task learning scheme to benefit both tasks. Specifically, APA loss is examined between cross-entropy and root mean square error (RMSE) criteria, and MDD loss is fixed to Connectionist Temporal Classification (CTC) criteria. For the backbone acoustic model, self-supervised model is used with an auxiliary fine-tuning on phone recognition before multi-task learning to leverage extra knowledge transfer. Goodness-of-Pronunciation (GOP) measure is given as an additional input along with the acoustic model. The joint model significantly outperformed single-task learning counterparts, with a mean of 0.041 PCC increase for APA task on four multi-aspect scores and 0.003 F1 increase for MDD task on Speechocean762 dataset. For the joint model architecture, multi-task learning with RMSE and CTC criteria with raw Robust Wav2vec2.0 and GOP measure achieved the best performance. Analysis indicates that the joint model learned to distinguish scores with low distribution, and to better recognize mispronunciations as mispronunciations compared to single-task learning models. Interestingly, the degree of the performance increase in each subtask for the joint model was proportional to the strength of the correlation between respective pronunciation score and mispronunciation labels, and the strength of the correlation between the model predictions also increased as the joint model achieved higher performances. The findings reveal that the joint model leveraged the linguistic correlation between pronunciation scores and mispronunciations to improve performances for APA and MDD tasks, and to show behaviors that follow the assessments of human experts.Chapter 1, Introduction 1 Chapter 2. Related work 5 Chapter 3. Methodology 17 Chapter 4. Results 28 Chapter 5. Discussion 47 Chapter 6. Conclusion 52 References 53 Appendix 60 ๊ตญ๋ฌธ ์ดˆ๋ก 65์„

    Self-supervised end-to-end ASR for low resource L2 Swedish

    Get PDF
    Publisher Copyright: Copyright ยฉ 2021 ISCA.Unlike traditional (hybrid) Automatic Speech Recognition (ASR), end-to-end ASR systems simplify the training procedure by directly mapping acoustic features to sequences of graphemes or characters, thereby eliminating the need for specialized acoustic, language, or pronunciation models. However, one drawback of end-to-end ASR systems is that they require more training data than conventional ASR systems to achieve similar word error rate (WER). This makes it difficult to develop ASR systems for tasks where transcribed target data is limited such as developing ASR for Second Language (L2) speakers of Swedish. Nonetheless, recent advancements in selfsupervised acoustic learning, manifested in wav2vec models [1, 2, 3], leverage the available untranscribed speech data to provide compact acoustic representation that can achieve low WER when incorporated in end-to-end systems. To this end, we experiment with several monolingual and cross-lingual selfsupervised acoustic models to develop end-to-end ASR system for L2 Swedish. Even though our test is very small, it indicates that these systems are competitive in performance with traditional ASR pipeline. Our best model seems to reduce the WER by 7% relative to our traditional ASR baseline trained on the same target data.Peer reviewe

    Automatic Pronunciation Assessment -- A Review

    Full text link
    Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challenges observed in prominent research trends, and highlight existing limitations, and available resources. This is followed by a discussion of the remaining challenges and possible directions for future work.Comment: 9 pages, accepted to EMNLP Finding

    A Hierarchical Context-aware Modeling Approach for Multi-aspect and Multi-granular Pronunciation Assessment

    Full text link
    Automatic Pronunciation Assessment (APA) plays a vital role in Computer-assisted Pronunciation Training (CAPT) when evaluating a second language (L2) learner's speaking proficiency. However, an apparent downside of most de facto methods is that they parallelize the modeling process throughout different speech granularities without accounting for the hierarchical and local contextual relationships among them. In light of this, a novel hierarchical approach is proposed in this paper for multi-aspect and multi-granular APA. Specifically, we first introduce the notion of sup-phonemes to explore more subtle semantic traits of L2 speakers. Second, a depth-wise separable convolution layer is exploited to better encapsulate the local context cues at the sub-word level. Finally, we use a score-restraint attention pooling mechanism to predict the sentence-level scores and optimize the component models with a multitask learning (MTL) framework. Extensive experiments carried out on a publicly-available benchmark dataset, viz. speechocean762, demonstrate the efficacy of our approach in relation to some cutting-edge baselines.Comment: Accepted to Interspeech 202

    Automatic detection of accent and lexical pronunciation errors in spontaneous non-native English speech

    Get PDF
    Detecting individual pronunciation errors and diagnosing pronunciation error tendencies in a language learner based on their speech are important components of computer-aided language learning (CALL). The tasks of error detection and error tendency diagnosis become particularly challenging when the speech in question is spontaneous and particularly given the challenges posed by the inconsistency of human annotation of pronunciation errors. This paper presents an approach to these tasks by distinguishing between lexical errors, wherein the speaker does not know how a particular word is pronounced, and accent errors, wherein the candidate's speech exhibits consistent patterns of phone substitution, deletion and insertion. Three annotated corpora of non-native English speech by speakers of multiple L1s are analysed, the consistency of human annotation investigated and a method presented for detecting individual accent and lexical errors and diagnosing accent error tendencies at the speaker level

    Cross-Lingual Transfer Learning Approach to Pronunciation Error Detection via Latent Phonetic Representation

    Get PDF
    Extensive research has been conducted on CALL systems for Pronunciation Error detection to automate language improvement through self-evaluation. However, many of these previous approaches have relied on HMM or Neural Network Hybrid Models which, although have proven to be effective, often utilize phonetically labelled L2 speech data which is expensive and often scarce. This paper discusses a โ€zero-shotโ€ transfer learning approach to detect phonetic errors in L2 English speech by Japanese native speakers using solely unaligned phonetically labelled native Language speech. The proposed method introduces a simple base architecture which utilizes the XLSR-Wav2Vec2.0 model pre-trained on unlabelled multilingual speech. Phoneme mapping for each language is determined based on difference of articulation of similar phonemes. This method achieved a Phonetic Error Rate of 0.214 on erroneous L2 speech after fine-tuning on 70 hours of speech with low resource automated phonetic labelling, and proved to additionally model phonemes of the native language of the L2 speaker effectively without the need for L2 speech fine-tuning

    Multi-View Multi-Task Representation Learning for Mispronunciation Detection

    Full text link
    The disparity in phonology between learner's native (L1) and target (L2) language poses a significant challenge for mispronunciation detection and diagnosis (MDD) systems. This challenge is further intensified by lack of annotated L2 data. This paper proposes a novel MDD architecture that exploits multiple `views' of the same input data assisted by auxiliary tasks to learn more distinctive phonetic representation in a low-resource setting. Using the mono- and multilingual encoders, the model learn multiple views of the input, and capture the sound properties across diverse languages and accents. These encoded representations are further enriched by learning articulatory features in a multi-task setup. Our reported results using the L2-ARCTIC data outperformed the SOTA models, with a phoneme error rate reduction of 11.13% and 8.60% and absolute F1 score increase of 5.89%, and 2.49% compared to the single-view mono- and multilingual systems, with a limited L2 dataset.Comment: 5 page

    MISPRONUNCIATION DETECTION AND DIAGNOSIS IN MANDARIN ACCENTED ENGLISH SPEECH

    Get PDF
    This work presents the development, implementation, and evaluation of a Mispronunciation Detection and Diagnosis (MDD) system, with application to pronunciation evaluation of Mandarin-accented English speech. A comprehensive detection and diagnosis of errors in the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) was performed by using the expert phonetic transcripts and an Automatic Speech Recognition (ASR) system. Articulatory features derived from the parallel kinematic data available in the EMA-MAE corpus were used to identify the most significant articulatory error patterns seen in L2 speakers during common mispronunciations. Using both acoustic and articulatory information, an ASR based Mispronunciation Detection and Diagnosis (MDD) system was built and evaluated across different feature combinations and Deep Neural Network (DNN) architectures. The MDD system captured mispronunciation errors with a detection accuracy of 82.4%, a diagnostic accuracy of 75.8% and a false rejection rate of 17.2%. The results demonstrate the advantage of using articulatory features in revealing the significant contributors of mispronunciation as well as improving the performance of MDD systems

    Rapid Generation of Pronunciation Dictionaries for new Domains and Languages

    Get PDF
    This dissertation presents innovative strategies and methods for the rapid generation of pronunciation dictionaries for new domains and languages. Depending on various conditions, solutions are proposed and developed. Starting from the straightforward scenario in which the target language is present in written form on the Internet and the mapping between speech and written language is close up to the difficult scenario in which no written form for the target language exists

    Automatic Screening of Childhood Speech Sound Disorders and Detection of Associated Pronunciation Errors

    Full text link
    Speech disorders in children can affect their fluency and intelligibility. Delay in their diagnosis and treatment increases the risk of social impairment and learning disabilities. With the significant shortage of Speech and Language Pathologists (SLPs), there is an increasing interest in Computer-Aided Speech Therapy tools with automatic detection and diagnosis capability. However, the scarcity and unreliable annotation of disordered child speech corpora along with the high acoustic variations in the child speech data has impeded the development of reliable automatic detection and diagnosis of childhood speech sound disorders. Therefore, this thesis investigates two types of detection systems that can be achieved with minimum dependency on annotated mispronounced speech data. First, a novel approach that adopts paralinguistic features which represent the prosodic, spectral, and voice quality characteristics of the speech was proposed to perform segment- and subject-level classification of Typically Developing (TD) and Speech Sound Disordered (SSD) child speech using a binary Support Vector Machine (SVM) classifier. As paralinguistic features are both language- and content-independent, they can be extracted from an unannotated speech signal. Second, a novel Mispronunciation Detection and Diagnosis (MDD) approach was introduced to detect the pronunciation errors made due to SSDs and provide low-level diagnostic information that can be used in constructing formative feedback and a detailed diagnostic report. Unlike existing MDD methods where detection and diagnosis are performed at the phoneme level, the proposed method achieved MDD at the speech attribute level, namely the manners and places of articulations. The speech attribute features describe the involved articulators and their interactions when making a speech sound allowing a low-level description of the pronunciation error to be provided. Two novel methods to model speech attributes are further proposed in this thesis, a frame-based (phoneme-alignment) method leveraging the Multi-Task Learning (MTL) criterion and training a separate model for each attribute, and an alignment-free jointly-learnt method based on the Connectionist Temporal Classification (CTC) sequence to sequence criterion. The proposed techniques have been evaluated using standard and publicly accessible adult and child speech corpora, while the MDD method has been validated using L2 speech corpora
    • โ€ฆ
    corecore