54 research outputs found

    Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework

    Full text link
    Speech recognition systems for irregularly-spelled languages like English normally require hand-written pronunciations. In this paper, we describe a system for automatically obtaining pronunciations of words for which pronunciations are not available, but for which transcribed data exists. Our method integrates information from the letter sequence and from the acoustic evidence. The novel aspect of the problem that we address is the problem of how to prune entries from such a lexicon (since, empirically, lexicons with too many entries do not tend to be good for ASR performance). Experiments on various ASR tasks show that, with the proposed framework, starting with an initial lexicon of several thousand words, we are able to learn a lexicon which performs close to a full expert lexicon in terms of WER performance on test data, and is better than lexicons built using G2P alone or with a pruning criterion based on pronunciation probability

    Pronunciation modeling for Cantonese speech recognition.

    Get PDF
    Kam Patgi.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaf 103).Abstracts in English and Chinese.Chapter Chapter 1. --- Introduction --- p.1Chapter 1.1 --- Automatic Speech Recognition --- p.1Chapter 1.2 --- Pronunciation Modeling in ASR --- p.2Chapter 1.3 --- Obj ectives of the Thesis --- p.5Chapter 1.4 --- Thesis Outline --- p.5Reference --- p.7Chapter Chapter 2. --- The Cantonese Dialect --- p.9Chapter 2.1 --- Cantonese - A Typical Chinese Dialect --- p.10Chapter 2.1.1 --- Cantonese Phonology --- p.11Chapter 2.1.2 --- Cantonese Phonetics --- p.12Chapter 2.2 --- Pronunciation Variation in Cantonese --- p.13Chapter 2.2.1 --- Phone Change and Sound Change --- p.14Chapter 2.2.2 --- Notation for Different Sound Units --- p.16Chapter 2.3 --- Summary --- p.17Reference --- p.18Chapter Chapter 3. --- Large-Vocabulary Continuous Speech Recognition for Cantonese --- p.19Chapter 3.1 --- Feature Representation of the Speech Signal --- p.20Chapter 3.2 --- Probabilistic Framework of ASR --- p.20Chapter 3.3 --- Hidden Markov Model for Acoustic Modeling --- p.21Chapter 3.4 --- Pronunciation Lexicon --- p.25Chapter 3.5 --- Statistical Language Model --- p.25Chapter 3.6 --- Decoding --- p.26Chapter 3.7 --- The Baseline Cantonese LVCSR System --- p.26Chapter 3.7.1 --- System Architecture --- p.26Chapter 3.7.2 --- Speech Databases --- p.28Chapter 3.8 --- Summary --- p.29Reference --- p.30Chapter Chapter 4. --- Pronunciation Model --- p.32Chapter 4.1 --- Pronunciation Modeling at Different Levels --- p.33Chapter 4.2 --- Phone-level pronunciation model and its Application --- p.35Chapter 4.2.1 --- IF Confusion Matrix (CM) --- p.35Chapter 4.2.2 --- Decision Tree Pronunciation Model (DTPM) --- p.38Chapter 4.2.3 --- Refinement of Confusion Matrix --- p.41Chapter 4.3 --- Summary --- p.43References --- p.44Chapter Chapter 5. --- Pronunciation Modeling at Lexical Level --- p.45Chapter 5.1 --- Construction of PVD --- p.46Chapter 5.2 --- PVD Pruning by Word Unigram --- p.48Chapter 5.3 --- Recognition Experiments --- p.49Chapter 5.3.1 --- Experiment 1 ´ؤPronunciation Modeling in LVCSR --- p.49Chapter 5.3.2 --- Experiment 2 ´ؤ Pronunciation Modeling in Domain Specific task --- p.58Chapter 5.3.3 --- Experiment 3 ´ؤ PVD Pruning by Word Unigram --- p.62Chapter 5.4 --- Summary --- p.63Reference --- p.64Chapter Chapter 6. --- Pronunciation Modeling at Acoustic Model Level --- p.66Chapter 6.1 --- Hierarchy of HMM --- p.67Chapter 6.2 --- Sharing of Mixture Components --- p.68Chapter 6.3 --- Adaptation of Mixture Components --- p.70Chapter 6.4 --- Combination of Mixture Component Sharing and Adaptation --- p.74Chapter 6.5 --- Recognition Experiments --- p.78Chapter 6.6 --- Result Analysis --- p.80Chapter 6.6.1 --- Performance of Sharing Mixture Components --- p.81Chapter 6.6.2 --- Performance of Mixture Component Adaptation --- p.84Chapter 6.7 --- Summary --- p.85Reference --- p.87Chapter Chapter 7. --- Pronunciation Modeling at Decoding Level --- p.88Chapter 7.1 --- Search Process in Cantonese LVCSR --- p.88Chapter 7.2 --- Model-Level Search Space Expansion --- p.90Chapter 7.3 --- State-Level Output Probability Modification --- p.92Chapter 7.4 --- Recognition Experiments --- p.93Chapter 7.4.1 --- Experiment 1 ´ؤModel-Level Search Space Expansion --- p.93Chapter 7.4.2 --- Experiment 2 ´ؤ State-Level Output Probability Modification …… --- p.94Chapter 7.5 --- Summary --- p.96Reference --- p.97Chapter Chapter 8. --- Conclusions and Suggestions for Future Work --- p.98Chapter 8.1 --- Conclusions --- p.98Chapter 8.2 --- Suggestions for Future Work --- p.100Reference --- p.103Appendix I Base Syllable Table --- p.104Appendix II Cantonese Initials and Finals --- p.105Appendix III IF confusion matrix --- p.106Appendix IV Phonetic Question Set --- p.112Appendix V CDDT and PCDT --- p.11

    Learning Lexicons From Speech Using a Pronunciation Mixture Model

    Full text link

    Feature-based pronunciation modeling for automatic speech recognition

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.Includes bibliographical references (p. 131-140).Spoken language, especially conversational speech, is characterized by great variability in word pronunciation, including many variants that differ grossly from dictionary prototypes. This is one factor in the poor performance of automatic speech recognizers on conversational speech. One approach to handling this variation consists of expanding the dictionary with phonetic substitution, insertion, and deletion rules. Common rule sets, however, typically leave many pronunciation variants unaccounted for and increase word confusability due to the coarse granularity of phone units. We present an alternative approach, in which many types of variation are explained by representing a pronunciation as multiple streams of linguistic features rather than a single stream of phones. Features may correspond to the positions of the speech articulators, such as the lips and tongue, or to acoustic or perceptual categories. By allowing for asynchrony between features and per-feature substitutions, many pronunciation changes that are difficult to account for with phone-based models become quite natural. Although it is well-known that many phenomena can be attributed to this "semi-independent evolution" of features, previous models of pronunciation variation have typically not taken advantage of this. In particular, we propose a class of feature-based pronunciation models represented as dynamic Bayesian networks (DBNs).(cont.) The DBN framework allows us to naturally represent the factorization of the state space of feature combinations into feature-specific factors, as well as providing standard algorithms for inference and parameter learning. We investigate the behavior of such a model in isolation using manually transcribed words. Compared to a phone-based baseline, the feature-based model has both higher coverage of observed pronunciations and higher recognition rate for isolated words. We also discuss the ways in which such a model can be incorporated into various types of end-to-end speech recognizers and present several examples of implemented systems, for both acoustic speech recognition and lipreading tasks.by Karen Livescu.Ph.D

    Towards an automatic speech recognition system for use by deaf students in lectures

    Get PDF
    According to the Royal National Institute for Deaf people there are nearly 7.5 million hearing-impaired people in Great Britain. Human-operated machine transcription systems, such as Palantype, achieve low word error rates in real-time. The disadvantage is that they are very expensive to use because of the difficulty in training operators, making them impractical for everyday use in higher education. Existing automatic speech recognition systems also achieve low word error rates, the disadvantages being that they work for read speech in a restricted domain. Moving a system to a new domain requires a large amount of relevant data, for training acoustic and language models. The adopted solution makes use of an existing continuous speech phoneme recognition system as a front-end to a word recognition sub-system. The subsystem generates a lattice of word hypotheses using dynamic programming with robust parameter estimation obtained using evolutionary programming. Sentence hypotheses are obtained by parsing the word lattice using a beam search and contributing knowledge consisting of anti-grammar rules, that check the syntactic incorrectness’ of word sequences, and word frequency information. On an unseen spontaneous lecture taken from the Lund Corpus and using a dictionary containing "2637 words, the system achieved 815% words correct with 15% simulated phoneme error, and 73.1% words correct with 25% simulated phoneme error. The system was also evaluated on 113 Wall Street Journal sentences. The achievements of the work are a domain independent method, using the anti- grammar, to reduce the word lattice search space whilst allowing normal spontaneous English to be spoken; a system designed to allow integration with new sources of knowledge, such as semantics or prosody, providing a test-bench for determining the impact of different knowledge upon word lattice parsing without the need for the underlying speech recognition hardware; the robustness of the word lattice generation using parameters that withstand changes in vocabulary and domain

    Unsupervised pattern discovery in speech : applications to word acquisition and speaker segmentation

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2007.Includes bibliographical references (p. 167-176).We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a pre-specified inventory of lexical units (i.e. phones or words). Instead, we attempt to discover such an inventory in an unsupervised manner by exploiting the structure of repeating patterns within the speech signal. We show how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multi-word phrases. On a corpus of academic lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream.(cont.) We demonstrate two applications of our pattern discovery procedure. First, we propose and evaluate two methods for automatically identifying sound clusters generated through pattern discovery. Our results show that high identification accuracy can be achieved for single word clusters using a constrained isolated word recognizer. Second, we apply acoustic pattern matching to the problem of speaker segmentation by attempting to find word-level speech patterns that are repeated by the same speaker. When used to segment a ten hour corpus of multi-speaker lectures, we found that our approach is able to generate segmentations that correlate well to independently generated human segmentations.by Alex Seungryong Park.Ph.D

    Automatic orthographic alignment of speech

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1994.Includes bibliographical references (leaves 86-87).by Jerome S. Khohayting.M.Eng

    Single Speaker Segmentation and Inventory Selection Using Dynamic Time Warping Self Organization and Joint Multigram Mapping

    Get PDF
    In speech synthesis the inventory of units is decided by inspection and on the basis of phonological and phonetic expertise. The ephone (or emergent phone) project at CSTR is investigating how self organisation techniques can be applied to build an inventory based on collected acoustic data together with the constraints of a synthesis lexicon. In this paper we will describe a prototype inventory creation method using dynamic time warping (DTW) for acoustic clustering and a joint multigram approach for relating a series of symbols that represent the speech to these emerged units. We initially examined two symbol sets: 1) A baseline of standard phones 2) Orthographic symbols. The success of the approach is evaluated by comparing word boundaries generated by the emergent phones against those created using state-of-the-art HMM segmentation. Initial results suggest the DTW segmentation can match word boundaries with a root mean square error (RMSE) of 35ms. Results from mapping units onto phones resulted in a higher RMSE of 103ms. This error was increased when multiple multigram types were added and when the default unit clustering was altered from 40 (our baseline) to 10. Results for orthographic matching had a higher RMSE of 125ms. To conclude we discuss future work that we believe can reduce this error rate to a level sufficient for the techniques to be applied to a unit selection synthesis system
    corecore