Search CORE

37 research outputs found

MISPRONUNCIATION DETECTION AND DIAGNOSIS IN MANDARIN ACCENTED ENGLISH SPEECH

Author: Khanal Subash
Publication venue: UKnowledge
Publication date: 01/01/2020
Field of study

This work presents the development, implementation, and evaluation of a Mispronunciation Detection and Diagnosis (MDD) system, with application to pronunciation evaluation of Mandarin-accented English speech. A comprehensive detection and diagnosis of errors in the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) was performed by using the expert phonetic transcripts and an Automatic Speech Recognition (ASR) system. Articulatory features derived from the parallel kinematic data available in the EMA-MAE corpus were used to identify the most significant articulatory error patterns seen in L2 speakers during common mispronunciations. Using both acoustic and articulatory information, an ASR based Mispronunciation Detection and Diagnosis (MDD) system was built and evaluated across different feature combinations and Deep Neural Network (DNN) architectures. The MDD system captured mispronunciation errors with a detection accuracy of 82.4%, a diagnostic accuracy of 75.8% and a false rejection rate of 17.2%. The results demonstrate the advantage of using articulatory features in revealing the significant contributors of mispronunciation as well as improving the performance of MDD systems

University of Kentucky

Multi-View Multi-Task Representation Learning for Mispronunciation Detection

Author: Ali Ahmed
Chowdhury Shammur Absar
Kheir Yassine El
Publication venue
Publication date: 02/06/2023
Field of study

The disparity in phonology between learner's native (L1) and target (L2) language poses a significant challenge for mispronunciation detection and diagnosis (MDD) systems. This challenge is further intensified by lack of annotated L2 data. This paper proposes a novel MDD architecture that exploits multiple `views' of the same input data assisted by auxiliary tasks to learn more distinctive phonetic representation in a low-resource setting. Using the mono- and multilingual encoders, the model learn multiple views of the input, and capture the sound properties across diverse languages and accents. These encoded representations are further enriched by learning articulatory features in a multi-task setup. Our reported results using the L2-ARCTIC data outperformed the SOTA models, with a phoneme error rate reduction of 11.13% and 8.60% and absolute F1 score increase of 5.89%, and 2.49% compared to the single-view mono- and multilingual systems, with a limited L2 dataset.Comment: 5 page

arXiv.org e-Print Archive

Transparent pronunciation scoring using articulatorily weighted phoneme edit distance

Author: Karhila Reima
Kurimo Mikko
Smolander Anna-Riikka
Ylinen Sari
Publication venue
Publication date: 01/01/2019
Field of study

Peer reviewe

arXiv.org e-Print Archive

Crossref

Aaltodoc Publication Archive

Helsingin yliopiston digitaalinen arkisto

자동발음평가-발음오류검출 통합 모델

Author: 류형신
Publication venue: 서울대학교 대학원
Publication date: 01/08/2023
Field of study

학위논문(석사) -- 서울대학교대학원 : 인문대학 언어학과, 2023. 8. 정민화.실증 연구에 의하면 비원어민 발음 평가에 있어 전문 평가자가 채점하는 발음 점수와 음소 오류 사이의 상관관계는 매우 높다. 그러나 기존의 컴퓨터기반발음훈련 (Computer-assisted Pronunciation Training; CAPT) 시스템은 자동발음평가 (Automatic Pronunciation Assessment; APA) 과제 및 발음오류검출 (Mispronunciation Detection and Diagnosis; MDD) 과제를 독립적인 과제로 취급하며 각 모델의 성능을 개별적으로 향상시키는 것에만 초점을 두었다. 본 연구에서는 두 과제 사이의 높은 상관관계에 주목, 다중작업학습 기법을 활용하여 자동발음평가와 발음오류검출 과제를 동시에 훈련하는 새로운 아키텍처를 제안한다. 구체적으로는 APA 과제를 위해 교차 엔트로피 손실함수 및 RMSE 손실함수를 실험하며, MDD 손실함수는 CTC 손실함수로 고정된다. 근간 음향 모델은 사전훈련된 자기지도학습기반 모델로 하며, 이때 더욱 풍부한 음향 정보를 위해 다중작업학습을 거치기 전에 부수적으로 음소인식에 대하여 미세조정되기도 한다. 음향 모델과 함께 발음적합점수(Goodness-of-Pronunciation; GOP)가 추가적인 입력으로 사용된다. 실험 결과, 통합 모델이 단일 자동발음평가 및 발음오류검출 모델보다 매우 높은 성능을 보였다. 구체적으로는 Speechocean762 데이터셋에서 자동발음평가 과제에 사용된 네 항목의 점수들의 평균 피어슨상관계수가 0.041 증가하였으며, 발음오류검출 과제에 대해 F1 점수가 0.003 증가하였다. 통합 모델에 대해 시도된 아키텍처 중에서는, Robust Wav2vec2.0 음향모델과 발음적합점수를 활용하여 RMSE/CTC 손실함수로 훈련한 모델의 성능이 가장 좋았다. 모델을 분석한 결과, 통합 모델이 개별 모델에 비해 분포가 낮은 점수 및 발음오류를 더 정확하게 구분하였음을 확인할 수 있었다. 흥미롭게도 통합 모델에 있어 각 하위 과제들의 성능 향상 정도는 각 발음 점수와 발음 오류 레이블 사이의 상관계수 크기에 비례하였다. 또 통합 모델의 성능이 개선될수록 모델의 예측 발음점수, 그리고 모델의 예측 발음오류에 대한 상관성이 높아졌다. 본 연구 결과는 통합 모델이 발음 점수 및 음소 오류 사이의 언어학적 상관성을 활용하여 자동발음평가 및 발음오류검출 과제의 성능을 향상시켰으며, 그 결과 통합 모델이 전문 평가자들의 실제 비원어민 평가와 비슷한 양상을 띤다는 것을 보여준다.Empirical studies report a strong correlation between pronunciation scores and mispronunciations in non-native speech assessments of human evaluators. However, the existing system of computer-assisted pronunciation training (CAPT) regards automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD) as independent and focuses on individual performance improvement. Motivated by the correlation between two tasks, this study proposes a novel architecture that jointly tackles APA and MDD with a multi-task learning scheme to benefit both tasks. Specifically, APA loss is examined between cross-entropy and root mean square error (RMSE) criteria, and MDD loss is fixed to Connectionist Temporal Classification (CTC) criteria. For the backbone acoustic model, self-supervised model is used with an auxiliary fine-tuning on phone recognition before multi-task learning to leverage extra knowledge transfer. Goodness-of-Pronunciation (GOP) measure is given as an additional input along with the acoustic model. The joint model significantly outperformed single-task learning counterparts, with a mean of 0.041 PCC increase for APA task on four multi-aspect scores and 0.003 F1 increase for MDD task on Speechocean762 dataset. For the joint model architecture, multi-task learning with RMSE and CTC criteria with raw Robust Wav2vec2.0 and GOP measure achieved the best performance. Analysis indicates that the joint model learned to distinguish scores with low distribution, and to better recognize mispronunciations as mispronunciations compared to single-task learning models. Interestingly, the degree of the performance increase in each subtask for the joint model was proportional to the strength of the correlation between respective pronunciation score and mispronunciation labels, and the strength of the correlation between the model predictions also increased as the joint model achieved higher performances. The findings reveal that the joint model leveraged the linguistic correlation between pronunciation scores and mispronunciations to improve performances for APA and MDD tasks, and to show behaviors that follow the assessments of human experts.Chapter 1, Introduction 1 Chapter 2. Related work 5 Chapter 3. Methodology 17 Chapter 4. Results 28 Chapter 5. Discussion 47 Chapter 6. Conclusion 52 References 53 Appendix 60 국문 초록 65석

SNU Open Repository and Archive

Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

Author: Ahmed Beena
Epps Julien
Shahin Mostafa
Publication venue
Publication date: 12/11/2023
Field of study

The automatic identification and analysis of pronunciation errors, known as Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning or speech therapy applications. Existing MDD methods relying on analysing phonemes can only detect categorical errors of phonemes that have an adequate amount of training data to be modelled. With the unpredictable nature of the pronunciation errors of non-native or disordered speakers and the scarcity of training datasets, it is unfeasible to model all types of mispronunciations. Moreover, phoneme-level MDD approaches have a limited ability to provide detailed diagnostic information about the error made. In this paper, we propose a low-level MDD approach based on the detection of speech attribute features. Speech attribute features break down phoneme production into elementary components that are directly related to the articulatory system leading to more formative feedback to the learner. We further propose a multi-label variant of the Connectionist Temporal Classification (CTC) approach to jointly model the non-mutually exclusive speech attributes using a single model. The pre-trained wav2vec2 model was employed as a core model for the speech attribute detector. The proposed method was applied to L2 speech corpora collected from English learners from different native languages. The proposed speech attribute MDD method was further compared to the traditional phoneme-level MDD and achieved a significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR), and Diagnostic Error Rate (DER) over all speech attributes compared to the phoneme-level equivalent

arXiv.org e-Print Archive

Automatic Pronunciation Assessment -- A Review

Author: Ali Ahmed
Chowdhury Shammur Absar
Kheir Yassine El
Publication venue
Publication date: 21/10/2023
Field of study

Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challenges observed in prominent research trends, and highlight existing limitations, and available resources. This is followed by a discussion of the remaining challenges and possible directions for future work.Comment: 9 pages, accepted to EMNLP Finding

arXiv.org e-Print Archive

Automatic Screening of Childhood Speech Sound Disorders and Detection of Associated Pronunciation Errors

Author: Shahin Mostafa
Publication venue: UNSW, Sydney
Publication date: 01/01/2023
Field of study

Speech disorders in children can affect their fluency and intelligibility. Delay in their diagnosis and treatment increases the risk of social impairment and learning disabilities. With the significant shortage of Speech and Language Pathologists (SLPs), there is an increasing interest in Computer-Aided Speech Therapy tools with automatic detection and diagnosis capability. However, the scarcity and unreliable annotation of disordered child speech corpora along with the high acoustic variations in the child speech data has impeded the development of reliable automatic detection and diagnosis of childhood speech sound disorders. Therefore, this thesis investigates two types of detection systems that can be achieved with minimum dependency on annotated mispronounced speech data. First, a novel approach that adopts paralinguistic features which represent the prosodic, spectral, and voice quality characteristics of the speech was proposed to perform segment- and subject-level classification of Typically Developing (TD) and Speech Sound Disordered (SSD) child speech using a binary Support Vector Machine (SVM) classifier. As paralinguistic features are both language- and content-independent, they can be extracted from an unannotated speech signal. Second, a novel Mispronunciation Detection and Diagnosis (MDD) approach was introduced to detect the pronunciation errors made due to SSDs and provide low-level diagnostic information that can be used in constructing formative feedback and a detailed diagnostic report. Unlike existing MDD methods where detection and diagnosis are performed at the phoneme level, the proposed method achieved MDD at the speech attribute level, namely the manners and places of articulations. The speech attribute features describe the involved articulators and their interactions when making a speech sound allowing a low-level description of the pronunciation error to be provided. Two novel methods to model speech attributes are further proposed in this thesis, a frame-based (phoneme-alignment) method leveraging the Multi-Task Learning (MTL) criterion and training a separate model for each attribute, and an alignment-free jointly-learnt method based on the Connectionist Temporal Classification (CTC) sequence to sequence criterion. The proposed techniques have been evaluated using standard and publicly accessible adult and child speech corpora, while the MDD method has been validated using L2 speech corpora

UNSWorks

Automatic detection of accent and lexical pronunciation errors in spontaneous non-native English speech

Author: Gales MJF
Knill KM
Kyriakopoulos K
Publication venue: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication date: 01/01/2020
Field of study

Detecting individual pronunciation errors and diagnosing pronunciation error tendencies in a language learner based on their speech are important components of computer-aided language learning (CALL). The tasks of error detection and error tendency diagnosis become particularly challenging when the speech in question is spontaneous and particularly given the challenges posed by the inconsistency of human annotation of pronunciation errors. This paper presents an approach to these tasks by distinguishing between lexical errors, wherein the speaker does not know how a particular word is pronounced, and accent errors, wherein the candidate's speech exhibits consistent patterns of phone substitution, deletion and insertion. Three annotated corpora of non-native English speech by speakers of multiple L1s are analysed, the consistency of human annotation investigated and a method presented for detecting individual accent and lexical errors and diagnosing accent error tendencies at the speaker level

Crossref

Apollo (Cambridge)