24 research outputs found

    PLASER: Pronunciation Learning via Automatic Speech Recognition

    Get PDF
    PLASER is a multimedia tool with instant feedback designed to teach English pronunciation for high-school students of Hong Kong whose mother tongue is Cantonese Chinese. The objective is to teach correct pronunciation and not to assess a student's overall pronunciation quality. Major challenges related to speech recognition technology include: allowance for non-native accent, reliable and corrective feedbacks, and visualization of errors

    Integrating A Context-Dependent Phrase Grammar In The Variable N-Gram Framework

    No full text
    This paper focuses on the learning of multi-word lexical units, or phrases, and how to model them within the variable n-gram framework. We introduce the notion of contextdependent phrases and suggest an algorithm for unsupervised learning of phrases. Also, we propose an approach to integrate a phrase grammar and a variable n-gram without the need of explicitly handling multi-word lexical items. The combined variable n-gram phrase grammar improves recognition accuracy on the Switchboard corpus over both the baseline trigram and using a variable n-gram alone. 1. INTRODUCTION Although words in English are reasonable lexical units for language modeling, there are many cases that longer lexical units may be more appropriate. Frequently used word sequences, such as I mean or you know, are so common in conversational speech that they may be effectively used by the speaker as a single lexical item. We call these multiword units "phrases". There are several ways of treating a multi-word sequ..

    N-best tokenization in a GMM-SVM language identification system

    No full text
    N-best or lattice-based tokenization has been widely used in speech-related classification tasks. In this paper, we extended the n-best tokenization approach to GMM-based language identification systems with either maximum likelihood (ML) trained or SVM-based language models. We explored the effect of n-best tokenization in training or testing, and its interaction with n-gram order and system fusion. We showed that for both systems, the n-best tokenization gives good performance improvement. However, the SVM-based system benefited from both n-best training and test while the ML-trained system can only benefit from n-best training. Results show n-best tokenization can reduce the relative EER of our best GMM-SVM system by about 5% for 30s and 10s tests

    Evaluation of the robustness of the polynomial segment models to noisy environments with unsupervised adaptation

    No full text
    Recently, the polynomial segment models (PSMs) have been shown to be a competitive alternative to the HMM in large vocabulary continuous recognition task [Li, C., Siu, M., Au-yeung, S., 2006. Recursive likelihood evaluation and fast search algorithm for polynomial segment model with application to speech recognition. IEEE Trans. on Audio, Speech and Language Processing 14, 1704-1708]. Its more constrained nature raises the issue of robustness under environmental mis-matches. In this paper, we examine the robustness properties of PSMs using the Aurora 4 corpus under both clean training and multi-conditional training. In addition, we generalize two unsupervised model adaptation schemes, namely, the maximum likelihood linear regression (MLLR) and reference speaker weighting (RSW), to be applicable for PSMs and explore their effectiveness in PSM environmental adaptation. Our experiments showed that although the word error rate differences between PSMs and HMMs became smaller under noisy test environments than under clean test environment, PSMs were still competitive under mis-match conditions. After model adaptation, especially with the RSW adaptation, the word error rates were reduced for both HMMs and PSMs. The best word error rate was obtained with RSW-adapted PSMs by rescoring lattices generated with the adapted HMMs. Overall, with model adaptation, the recognition word error rate can be reduced by more than 20\%. (c) 2008 Elsevier B.V. All rights reserved

    Articulatory-feature-based confidence measures

    No full text
    Confidence measures are computed to estimate the certainty that target acoustic units are spoken in specific speech segments. They are applied in tasks such as keyword verification or utterance verification. Because many of the confidence measures use the same set of models and features as in recognition, the resulting scores may not provide an independent measure of reliability. In this paper, we propose two articulatory feature (AF) based phoneme confidence measures that estimate the acoustic reliability based on the match in AF properties. While acoustic-based features, such as Mel-frequency cepstral coefficients (MFCC), are widely used in speech processing, some recent works have focus on linguistically based features, such as the articulatory features that relate directly to the human articulatory process which may better capture speech characteristics. The articulatory features can either replace or complement the acoustic-based features in speech processing. The proposed AF-based measures in this paper were evaluated, in comparison and in combination, with the HMM-based scores on phoneme and keyword verification tasks using childrenÕs speech collected for a computer-based English pronunciation learning project. To fully evaluate their usefulness, the proposed measures and combinations were evaluated on both native and non-native data; and under field test conditions that mis-matches with the training condition. The experimental results show that under the different environments, combinations of the AF scores with the HMM-base
    corecore