118 research outputs found

    MISPRONUNCIATION DETECTION AND DIAGNOSIS IN MANDARIN ACCENTED ENGLISH SPEECH

    Get PDF
    This work presents the development, implementation, and evaluation of a Mispronunciation Detection and Diagnosis (MDD) system, with application to pronunciation evaluation of Mandarin-accented English speech. A comprehensive detection and diagnosis of errors in the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) was performed by using the expert phonetic transcripts and an Automatic Speech Recognition (ASR) system. Articulatory features derived from the parallel kinematic data available in the EMA-MAE corpus were used to identify the most significant articulatory error patterns seen in L2 speakers during common mispronunciations. Using both acoustic and articulatory information, an ASR based Mispronunciation Detection and Diagnosis (MDD) system was built and evaluated across different feature combinations and Deep Neural Network (DNN) architectures. The MDD system captured mispronunciation errors with a detection accuracy of 82.4%, a diagnostic accuracy of 75.8% and a false rejection rate of 17.2%. The results demonstrate the advantage of using articulatory features in revealing the significant contributors of mispronunciation as well as improving the performance of MDD systems

    Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

    Get PDF
    We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation. We present a meta-analysis of the performance of speech recognition adaptation algorithms, based on relative error rate reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27 figure

    Accented Speech Recognition With Accent-specific Codebooks

    Full text link
    Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems. Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR. In this work, we propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks. These learnable codebooks capture accent-specific information and are integrated within the ASR encoder layers. The model is trained on accented English speech, while the test data also contained accents which were not seen during training. On the Mozilla Common Voice multi-accented dataset, we show that our proposed approach yields significant performance gains not only on the seen English accents (up to 37%37\% relative improvement in word error rate) but also on the unseen accents (up to 5%5\% relative improvement in WER). Further, we illustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We also compare the performance with other approaches based on accent adversarial training.Comment: Accepted to EMNLP 2023 Main Conference (Long Paper

    A computational model for studying L1’s effect on L2 speech learning

    Get PDF
    abstract: Much evidence has shown that first language (L1) plays an important role in the formation of L2 phonological system during second language (L2) learning process. This combines with the fact that different L1s have distinct phonological patterns to indicate the diverse L2 speech learning outcomes for speakers from different L1 backgrounds. This dissertation hypothesizes that phonological distances between accented speech and speakers' L1 speech are also correlated with perceived accentedness, and the correlations are negative for some phonological properties. Moreover, contrastive phonological distinctions between L1s and L2 will manifest themselves in the accented speech produced by speaker from these L1s. To test the hypotheses, this study comes up with a computational model to analyze the accented speech properties in both segmental (short-term speech measurements on short-segment or phoneme level) and suprasegmental (long-term speech measurements on word, long-segment, or sentence level) feature space. The benefit of using a computational model is that it enables quantitative analysis of L1's effect on accent in terms of different phonological properties. The core parts of this computational model are feature extraction schemes to extract pronunciation and prosody representation of accented speech based on existing techniques in speech processing field. Correlation analysis on both segmental and suprasegmental feature space is conducted to look into the relationship between acoustic measurements related to L1s and perceived accentedness across several L1s. Multiple regression analysis is employed to investigate how the L1's effect impacts the perception of foreign accent, and how accented speech produced by speakers from different L1s behaves distinctly on segmental and suprasegmental feature spaces. Results unveil the potential application of the methodology in this study to provide quantitative analysis of accented speech, and extend current studies in L2 speech learning theory to large scale. Practically, this study further shows that the computational model proposed in this study can benefit automatic accentedness evaluation system by adding features related to speakers' L1s.Dissertation/ThesisDoctoral Dissertation Speech and Hearing Science 201

    Improving multilingual speech recognition systems

    Get PDF
    End-to-end trainable deep neural networks have become the state-of-the-art architecture for automatic speech recognition (ASR), provided that the network is trained with a sufficiently large dataset. However, many existing languages are too sparsely resourced for deep learning networks to achieve as high accuracy as their resource-abundant counterparts. Multilingual recognition systems mitigate data sparsity issues by training models on data from multiple language resources to learn a speech-to-text or speech-to-phone model universal to all languages. The resulting multilingual ASR models usually have better recognition accuracy than the models trained on the individual dataset. In this work, we propose that two limitations exist for multilingual systems, and resolving the two limitations could result in improved recognition accuracy: (1) existing corpora are of the considerably varied form (spontaneous or read speech), corpus size, noise level, and phoneme distribution and the ASR models trained on the joint multilingual dataset have large performance disparities over different languages. We present an optimizable loss function, equal accuracy ratio (EAR), that measures the sequence-level performance disparity between different user groups and we show that explicitly optimizing this objective reduces the performance gap and improves the multilingual recognition accuracy. (2) While having good accuracy on the seen training language, the multilingual systems do not generalize well to unseen testing languages, which we refer to as cross-lingual recognition accuracy. We introduce language embedding using external linguistic typologies and show that such embedding can significantly increase both multilingual and cross-lingual accuracy. We illustrate the effectiveness of the proposed methods with experiments on multilingual and multi-user and multi-dialect corpora
    • …
    corecore