5 research outputs found

    L2-ARCTIC: A Non-Native English Speech Corpus

    Get PDF
    In this paper, we introduce L2-ARCTIC, a speech corpus of non-native English that is intended for research in voice conversion, accent conversion, and mispronunciation detection. This initial release includes recordings from ten non-native speakers of English whose first languages (L1s) are Hindi, Korean, Mandarin, Spanish, and Arabic, each L1 containing recordings from one male and one female speaker. Each speaker recorded approximately one hour of read speech from the Carnegie Mellon University ARCTIC prompts, from which we generated orthographic and forced-aligned phonetic transcriptions. In addition, we manually annotated 150 utterances per speaker to identify three types of mispronunciation errors: substitutions, deletions, and additions, making it a valuable resource not only for research in voice conversion and accent conversion but also in computer-assisted pronunciation training. The corpus is publicly accessible at https://psi.engr.tamu.edu/l2-arctic-corpus/

    Supervised Detection and Unsupervised Discovery of Pronunciation Error Patterns for Computer-Assisted Language Learning

    No full text
    Pronunciation error patterns (EPs) are patterns of mispronunciation frequently produced by language learners, and are usually different for different pairs of target and native languages. Accurate information of EPs can offer helpful feedbacks to the learners to improve their language skills. However, the major difficulty of EP detection comes from the fact that EPs are intrinsically similar to their corresponding canonical pronunciation, and EPs corresponding to the same canonical pronunciation are also intrinsically similar to each other. As a result, distinguishing EPs from their corresponding canonical pronunciation and between different EPs of the same phoneme is a difficult task – perhaps even more difficult than distinguishing between different phonemes in one language. On the other hand, the cost of deriving all EPs for each pair of target and native languages is high, usually requiring extensive expert knowledge or high-quality annotated data. Unsupervised EP discovery from a corpus of learner recordings would thus be an attractive addition to the field. In this dissertation, we propose new frameworks for both supervised EP detection and unsupervised EP discovery. For supervised EP detection, we use hierarchical MLPs as the EP classifiers to be integrated with the baseline using HMM/GMM in a two-pass Viterbi decoding architecture. Experimental results show that the new framework enhances the power of EP diagnosis. For unsupervised EP discovery we propose the first known framework, using the hierarchical agglomerative clustering (HAC) algorithm to explore sub-segmental variation within phoneme segments and produce fixed-length segment-level feature vectors in order to distinguish different EPs. We tested K-means (assuming a known number of EPs) and the Gaussian mixture model with the minimum description length principle (estimating an unknown number of EPs) for EP discovery. Preliminary experiments offered very encouraging results, although there is still a long way to go to approach the performance of human experts. We also propose to use the universal phoneme posteriorgram (UPP), derived from an MLP trained on corpora of mixed languages, as frame-level features in both supervised detection and unsupervised discovery of EPs. Experimental results show that using UPP not only achieves the best performance , but also is useful in analyzing the mispronunciation produced by language learners.誌謝 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .i Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1.1 Computer assisted language learning . . . . . . . . . . . . . . . . . . . .1 1.2 Major contributions of this dissertation . . . . . . . . . . . . . . . . . . .5 1.3 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 2 Background Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 2.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 2.2 Error Pattern definition and labeling . . . . . . . . . . . . . . . . . . . .8 2.3 Multi-layer perceptron in acoustic modeling . . . . . . . . . . . . . . . . 11 2.4 Universal Phoneme Posteriorgram (UPP) . . . . . . . . . . . . . . . . . 14 3 Supervised Detection of Pronunciation Error Patterns . . . . . . . . . . . . . . 17 3.1 Acoustic modeling for EPs . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 EP detection framework based on the hybrid approach . . . . . . . . . . 20 3.3 Hierarchical MLPs as the EP classifiers . . . . . . . . . . . . . . . . . . 22 3.4 EP diagnosis confidence estimation . . . . . . . . . . . . . . . . . . . . . 23 3.5 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.6 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.7 Experimental results and discussion . . . . . . . . . . . . . . . . . . . . 28 3.8 Complementarity analysis for the EP classifiers and EP AMs in the pro- posed framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4 Unsupervised Discovery of Pronunciation Error Patterns . . . . . . . . . . . . 34 4.1 Framework overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Hierarchical Agglomerative Clustering (HAC) and Segment-level Feature Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Unsupervised Clustering Algorithms for EP Discovery . . . . . . . . . . 39 4.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.5 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.6 Experimental results (I) – K-means with assumed known number of EPs . 42 4.7 Experimental results (II) – GMM-MDL with automatically estimated num- ber of EPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.8 Analysis for an example set of automatically discovered EPs . . . . . . . 46 5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5

    Apraxia World: Deploying a Mobile Game and Automatic Speech Recognition for Independent Child Speech Therapy

    Get PDF
    Children with speech sound disorders typically improve pronunciation quality by undergoing speech therapy, which must be delivered frequently and with high intensity to be effective. As such, clinic sessions are supplemented with home practice, often under caregiver supervision. However, traditional home practice can grow boring for children due to monotony. Furthermore, practice frequency is limited by caregiver availability, making it difficult for some children to reach therapy dosage. To address these issues, this dissertation presents a novel speech therapy game to increase engagement, and explores automatic pronunciation evaluation techniques to afford children independent practice. Children with speech sound disorders typically improve pronunciation quality by undergoing speech therapy, which must be delivered frequently and with high intensity to be effective. As such, clinic sessions are supplemented with home practice, often under caregiver supervision. However, traditional home practice can grow boring for children due to monotony. Furthermore, practice frequency is limited by caregiver availability, making it difficult for some children to reach therapy dosage. To address these issues, this dissertation presents a novel speech therapy game to increase engagement, and explores automatic pronunciation evaluation techniques to afford children independent practice. The therapy game, called Apraxia World, delivers customizable, repetition-based speech therapy while children play through platformer-style levels using typical on-screen tablet controls; children complete in-game speech exercises to collect assets required to progress through the levels. Additionally, Apraxia World provides pronunciation feedback according to an automated pronunciation evaluation system running locally on the tablet. Apraxia World offers two advantages over current commercial and research speech therapy games; first, the game provides extended gameplay to support long therapy treatments; second, it affords some therapy practice independence via automatic pronunciation evaluation, allowing caregivers to lightly supervise instead of directly administer the practice. Pilot testing indicated that children enjoyed the game-based therapy much more than traditional practice and that the exercises did not interfere with gameplay. During a longitudinal study, children made clinically-significant pronunciation improvements while playing Apraxia World at home. Furthermore, children remained engaged in the game-based therapy over the two-month testing period and some even wanted to continue playing post-study. The second part of the dissertation explores word- and phoneme-level pronunciation verification for child speech therapy applications. Word-level pronunciation verification is accomplished using a child-specific template-matching framework, where an utterance is compared against correctly and incorrectly pronounced examples of the word. This framework identified mispronounced words better than both a standard automated baseline and co-located caregivers. Phoneme-level mispronunciation detection is investigated using a technique from the second-language learning literature: training phoneme-specific classifiers with phonetic posterior features. This method also outperformed the standard baseline, but more significantly, identified mispronunciations better than student clinicians
    corecore