169,413 research outputs found

    A continuous speech recognition system integrating additional acoustic knowledge sources in a data-driven beam search algorithm

    Get PDF
    The paper presents a continuous speech recognition system which integrates an additional acoustic knowledge source into the data-driven beam search algorithm. Details of the object oriented implementation of the beam search algorithm will be given. Integration of additional knowledge sources is treated within the flexible framework of Dempster-Shafer theory. As a first example, a rule-based plosive detector is added to the baseline system

    Large vocabulary continuous speech recognition using linguistic features and constraints

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (leaves 111-123).Automatic speech recognition (ASR) is a process of applying constraints, as encoded in the computer system (the recognizer), to the speech signal until ambiguity is satisfactorily resolved to the extent that only one sequence of words is hypothesized. Such constraints fall naturally into two categories. One deals with the ordering of words (syntax) and organization of their meanings (semantics, pragmatics, etc). The other governs how speech signals are related to words, a process often termed as lexical access". This thesis studies the Huttenlocher-Zue lexical access model, its implementation in a modern probabilistic speech recognition framework and its application to continuous speech from an open vocabulary. The Huttenlocher-Zue model advocates a two-pass lexical access paradigm. In the first pass, the lexicon is effectively pruned using broad linguistic constraints. In the original Huttenlocher-Zue model, the authors had proposed six linguistic features motivated by the manner of pronunciation. The first pass classifies speech signals into a sequence of linguistic features, and only words that match this sequence - the cohort - are activated. The second pass performs a detailed acoustic phonetic analysis within the cohort to decide the identity of the word. This model differs from the lexical access model nowadays commonly employed in speech recognizers where detailed acoustic phonetic analysis is performed directly and lexical items are retrieved in one pass. The thesis first studies the implementation issues of the Huttenlocher-Zue model. A number of extensions to the original proposal are made to take advantage of the existing facilities of a probabilistic, graph-based recognition framework and, more importantly, to model the broad linguistic features in a data-driven approach. First, we analyze speech signals along the two diagonal dimensions of manner and place of articulation, rather than the manner dimension alone. Secondly, we adopt a set of feature-based landmarks optimized for data-driven modeling as the basic recognition units, and Gaussian mixture models are trained for these units. We explore information fusion techniques to integrate constraints from both the manner and place dimensions, as well as examining how to integrate constraints from the feature-based first pass with the second pass of detailed acoustic phonetic analysis. Our experiments on a large-vocabulary isolated word recognition task show that, while constraints from each individual feature dimension provide only limited help in this lexical access model, the utilization of both dimensions and information fusion techniques leads to significant performance gain over a one-pass phonetic system. The thesis then proposes to generalize the original Huttenlocher-Zue model, which limits itself to only isolated word tasks, to handle continuous speech. With continuous speech, the search space for both stages is infinite if all possible word sequences are allowed. We generalize the original cohort idea from the Huttenlocher-Zue proposal and use the bag of words of the N-best list of the first pass as cohorts for continuous speech. This approach transfers the constraints of broad linguistic features into a much reduced search space for the second stage. The thesis also studies how to recover from errors made by the first pass, which is not discussed in the original Huttenlocher- Zue proposal. In continuous speech recognition, a way of recovering from errors made in the first pass is vital to the performance of the over-all system. We find empirical evidence that such errors tend to occur around function words, possibly due to the lack of prominence, in meaning and henceforth in linguistic features, of such words. This thesis proposes an error-recovery mechanism based on empirical analysis on a development set for the two-pass lexical access model. Our experiments on a medium- sized, telephone-quality continuous speech recognition task achieve higher accuracy than a state-of-the-art one-pass baseline system. The thesis applies the generalized two-pass lexical access model to the challenge of recognizing continuous speech from an open vocabulary. Telephony information query systems often need to deal with a large list of words that are not observed in the training data, for example the city names in a weather information query system. The large portion of vocabulary unseen in the training data - the open vocabulary - poses a serious data-sparseness problem to both acoustic and language modeling. A two-pass lexical access model provides a solution by activating a small cohort within the open vocabulary in the first pass, thus significantly reducing the data- sparseness problem. Also, the broad linguistic constraints in the first pass generalize better to unseen data compared to finer, context-dependent acoustic phonetic models. This thesis also studies a data-driven analysis of acoustic similarities among open vocabulary items. The results are used for recovering possible errors in the first pass. This approach demonstrates an advantage over a two-pass approach based on specific semantic constraints. In summary, this thesis implements the original Huttenlocher-Zue two-pass lexical access model in a modern probabilistic speech recognition framework. This thesis also extends the original model to recognize continuous speech from an open vocabulary, with our two-stage model achieving a better performance than the baseline system. In the future, sub-lexical linguistic hierarchy constraints, such as syllables, can be introduced into this two-pass model to further improve the lexical access performance.by Min Tang.Ph.D

    Integrated speech and morphological processing in a connectionist continuous speech understanding for Korean

    Full text link
    A new tightly coupled speech and natural language integration model is presented for a TDNN-based continuous possibly large vocabulary speech recognition system for Korean. Unlike popular n-best techniques developed for integrating mainly HMM-based speech recognition and natural language processing in a {\em word level}, which is obviously inadequate for morphologically complex agglutinative languages, our model constructs a spoken language system based on a {\em morpheme-level} speech and language integration. With this integration scheme, the spoken Korean processing engine (SKOPE) is designed and implemented using a TDNN-based diphone recognition module integrated with a Viterbi-based lexical decoding and symbolic phonological/morphological co-analysis. Our experiment results show that the speaker-dependent continuous {\em eojeol} (Korean word) recognition and integrated morphological analysis can be achieved with over 80.6% success rate directly from speech inputs for the middle-level vocabularies.Comment: latex source with a4 style, 15 pages, to be published in computer processing of oriental language journa

    SKOPE: A connectionist/symbolic architecture of spoken Korean processing

    Full text link
    Spoken language processing requires speech and natural language integration. Moreover, spoken Korean calls for unique processing methodology due to its linguistic characteristics. This paper presents SKOPE, a connectionist/symbolic spoken Korean processing engine, which emphasizes that: 1) connectionist and symbolic techniques must be selectively applied according to their relative strength and weakness, and 2) the linguistic characteristics of Korean must be fully considered for phoneme recognition, speech and language integration, and morphological/syntactic processing. The design and implementation of SKOPE demonstrates how connectionist/symbolic hybrid architectures can be constructed for spoken agglutinative language processing. Also SKOPE presents many novel ideas for speech and language processing. The phoneme recognition, morphological analysis, and syntactic analysis experiments show that SKOPE is a viable approach for the spoken Korean processing.Comment: 8 pages, latex, use aaai.sty & aaai.bst, bibfile: nlpsp.bib, to be presented at IJCAI95 workshops on new approaches to learning for natural language processin

    Speech Synthesis Based on Hidden Markov Models

    Get PDF
    corecore