52 research outputs found

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    A Study of the Automatic Speech Recognition Process and Speaker Adaptation

    Get PDF
    This thesis considers the entire automated speech recognition process and presents a standardised approach to LVCSR experimentation with HMMs. It also discusses various approaches to speaker adaptation such as MLLR and multiscale, and presents experimental results for cross­-task speaker adaptation. An analysis of training parameters and data sufficiency for reasonable system performance estimates are also included. It is found that Maximum Likelihood Linear Regression (MLLR) supervised adaptation can result in 6% reduction (absolute) in word error rate given only one minute of adaptation data, as compared with an unadapted model set trained on a different task. The unadapted system performed at 24% WER and the adapted system at 18% WER. This is achieved with only 4 to 7 adaptation classes per speaker, as generated from a regression tree

    The effects of child language development on the performance of automatic speech recognition

    Get PDF
    In comparison to adults’, children’s ASR appears to be more challenging and yields inferior results. It has been suggested that for this issue to be addressed, linguistic understanding of children’s speech development needs to be employed to either provide a solution or an explanation. The present work aims to explore the influence of phonological effects associated with language acquisition (PEALA) in children’s ASR and investigate whether they can be detected in systematic patterns of ASR phone confusion errors or they can be evidenced in systematic patterns of acoustic feature structure. Findings from speech development research are used as the framework upon which a set of predictable error patterns is defined and guides the analysis of the experimental results reported. Several ASR experiments are conducted involving both children’s and adults’ speech. ASR phone confusion matrices are extracted and analysed according to a statistical significance test, proposed for the purposes of this work. A mathematical model is introduced to interpret the emerging results. Additionally, bottleneck features and i-vectors representing the acoustic features in one of the systems developed, are extracted and visualised using linear discriminant analysis (LDA). A qualitative analysis is conducted with reference to patterns that can be predicted through PEALA

    Word-Final /s/ in English

    Get PDF
    Synopsis: The complexities of speech production, perception, and comprehension are enormous. Theoretical approaches of these complexities most recently face the challenge of accounting for findings on subphonemic differences. The aim of the present dissertation is to establish a robust foundation of findings on such subphonemic differences. One rather popular case for differences in subphonemic detail is word-final /s/ and /z/ in English (henceforth S) as it constitutes a number of morphological functions. Using word-final S, three general issues are investigated. First, are there subphonemic durational differences between different types of word-final S? If there are such differences, how can they be accounted for? Second, can such subphonemic durational differences be perceived? Third, do such subphonemic durational differences influence the comprehension of S? These questions are investigated by five highly controlled studies: a production task, an implementation of Linear Discriminative Learning, a same-different task, and two number-decision tasks. Using not only real words but also pseudowords as target items, potentially confounding effects of lexical storage are controlled for. Concerning the first issue, the results show that there are indeed durational differences between different types of word-final S. Non-morphemic S is longest in duration, clitic S is shortest in duration, and plural S duration is in-between non-morphemic S and clitic S durations. It appears that the durational differences are connected to a word’s semantic activation diversity and its phonological certainty. Regarding the second issue, subphonemic durational differences in word-final S can be perceived, with higher levels of perceptibility for differences of 35 ms and higher. In regard to the third issue, subphonemic durational differences are found not to influence the speed of comprehension, but show a significant effect on the process of comprehension. The overall results give raise to a revision of various extant models of speech production, perception, and comprehension

    Semi-continuous hidden Markov models for speech recognition

    Get PDF

    Production, perception, and comprehension of subphonemic detail

    Get PDF
    The complexities of speech production, perception, and comprehension are enormous. Theoretical approaches of these complexities most recently face the challenge of accounting for findings on subphonemic differences. The aim of the present dissertation is to establish a robust foundation of findings on such subphonemic differences. One rather popular case for differences in subphonemic detail is word-final /s/ and /z/ in English (henceforth S) as it constitutes a number of morphological functions. Using word-final S, three general issues are investigated. First, are there subphonemic durational differences between different types of word-final S? If there are such differences, how can they be accounted for? Second, can such subphonemic durational differences be perceived? Third, do such subphonemic durational differences influence the comprehension of S? These questions are investigated by five highly controlled studies: a production task, an implementation of Linear Discriminative Learning, a same-different task, and two number-decision tasks. Using not only real words but also pseudowords as target items, potentially confounding effects of lexical storage are controlled for. Concerning the first issue, the results show that there are indeed durational differences between different types of word-final S. Non-morphemic S is longest in duration, clitic S is shortest in duration, and plural S duration is in-between non-morphemic S and clitic S durations. It appears that the durational differences are connected to a word’s semantic activation diversity and its phonological certainty. Regarding the second issue, subphonemic durational differences in word-final S can be perceived, with higher levels of perceptibility for differences of 35 ms and higher. In regard to the third issue, subphonemic durational differences are found not to influence the speed of comprehension, but show a significant effect on the process of comprehension. The overall results give raise to a revision of various extant models of speech production, perception, and comprehension

    Multi-level acoustic modeling for automatic speech recognition

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 183-192).Context-dependent acoustic modeling is commonly used in large-vocabulary Automatic Speech Recognition (ASR) systems as a way to model coarticulatory variations that occur during speech production. Typically, the local phoneme context is used as a means to define context-dependent units. Because the number of possible context-dependent units can grow exponentially with the length of the contexts, many units will not have enough training examples to train a robust model, resulting in a data sparsity problem. For nearly two decades, this data sparsity problem has been dealt with by a clustering-based framework which systematically groups different context-dependent units into clusters such that each cluster can have enough data. Although dealing with the data sparsity issue, the clustering-based approach also makes all context-dependent units within a cluster have the same acoustic score, resulting in a quantization effect that can potentially limit the performance of the context-dependent model. In this work, a multi-level acoustic modeling framework is proposed to address both the data sparsity problem and the quantization effect. Under the multi-level framework, each context-dependent unit is associated with classifiers that target multiple levels of contextual resolution, and the outputs of the classifiers are linearly combined for scoring during recognition. By choosing the classifiers judiciously, both the data sparsity problem and the quantization effect can be dealt with. The proposed multi-level framework can also be integrated into existing large-vocabulary ASR systems, such as FST-based ASR systems, and is compatible with state-of-the-art error reduction techniques for ASR systems, such as discriminative training methods. Multiple sets of experiments have been conducted to compare the performance of the clustering-based acoustic model and the proposed multi-level model. In a phonetic recognition experiment on TIMIT, the multi-level model has about 8% relative improvement in terms of phone error rate, showing that the multi-level framework can help improve phonetic prediction accuracy. In a large-vocabulary transcription task, combining the proposed multi-level modeling framework with discriminative training can provide more than 20% relative improvement over a clustering baseline model in terms of Word Error Rate (WER), showing that the multi-level framework can be integrated into existing large-vocabulary decoding frameworks and that it combines well with discriminative training methods. In speaker adaptive transcription task, the multi-level model has about 14% relative WER improvement, showing that the proposed framework can adapt better to new speakers, and potentially to new environments than the conventional clustering-based approach.by Hung-An Chang.Ph.D

    CONNECTIONIST SPEECH RECOGNITION - A Hybrid Approach

    Get PDF

    Feature extraction and event detection for automatic speech recognition

    Get PDF
    • 

    corecore