528 research outputs found

    Incorporating pitch features for tone modeling in automatic recognition of Mandarin Chinese

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 53-56).Tone plays a fundamental role in Mandarin Chinese, as it plays a lexical role in determining the meanings of words in spoken Mandarin. For example, these two sentences ... (I like horses) and ... (I like to scold) differ only in the tone carried by the last syllable. Thus, the inclusion of tone-related information through analysis of pitch data should improve the performance of automatic speech recognition (ASR) systems on Mandarin Chinese. The focus of this thesis is to improve the performance of a non-tonal automatic speech recognition (ASR) system on a Mandarin Chinese corpus by implementing modifications to the system code to incorporate pitch features. We compile and format a Mandarin Chinese broadcast new corpus for use with the ASR system, and implement a pitch feature extraction algorithm. Additionally, we investigate two algorithms for incorporating pitch features in Mandarin Chinese speech recognition. Firstly, we build and test a baseline tonal ASR system with embedded tone modeling by concatenating the cepstral and pitch feature vectors for use as the input to our phonetic model (a Hidden Markov Model, or HMM). We find that our embedded tone modeling algorithm does improve performance on Mandarin Chinese, showing that including tonal information is in fact contributive for Mandarin Chinese speech recognition. Secondly, we implement and test the effectiveness of HMM-based multistream models.by Karen Lingyun Chu.M.Eng

    PHONOTACTIC AND ACOUSTIC LANGUAGE RECOGNITION

    Get PDF
    Práce pojednává o fonotaktickém a akustickém přístupu pro automatické rozpoznávání jazyka. První část práce pojednává o fonotaktickém přístupu založeném na výskytu fonémových sekvenci v řeči. Nejdříve je prezentován popis vývoje fonémového rozpoznávače jako techniky pro přepis řeči do sekvence smysluplných symbolů. Hlavní důraz je kladen na dobré natrénování fonémového rozpoznávače a kombinaci výsledků z několika fonémových rozpoznávačů trénovaných na různých jazycích (Paralelní fonémové rozpoznávání následované jazykovými modely (PPRLM)). Práce také pojednává o nové technice anti-modely v PPRLM a studuje použití fonémových grafů místo nejlepšího přepisu. Na závěr práce jsou porovnány dva přístupy modelování výstupu fonémového rozpoznávače -- standardní n-gramové jazykové modely a binární rozhodovací stromy. Hlavní přínos v akustickém přístupu je diskriminativní modelování cílových modelů jazyků a první experimenty s kombinací diskriminativního trénování a na příznacích, kde byl odstraněn vliv kanálu. Práce dále zkoumá různé druhy technik fúzi akustického a fonotaktického přístupu. Všechny experimenty jsou provedeny na standardních datech z NIST evaluaci konané v letech 2003, 2005 a 2007, takže jsou přímo porovnatelné s výsledky ostatních skupin zabývajících se automatickým rozpoznáváním jazyka. S fúzí uvedených technik jsme posunuli state-of-the-art výsledky a dosáhli vynikajících výsledků ve dvou NIST evaluacích.This thesis deals with phonotactic and acoustic techniques for automatic language recognition (LRE). The first part of the thesis deals with the phonotactic language recognition based on co-occurrences of phone sequences in speech. A thorough study of phone recognition as tokenization technique for LRE is done, with focus on the amounts of training data for phone recognizer and on the combination of phone recognizers trained on several language (Parallel Phone Recognition followed by Language Model - PPRLM). The thesis also deals with novel technique of anti-models in PPRLM and investigates into using phone lattices instead of strings. The work on phonotactic approach is concluded by a comparison of classical n-gram modeling techniques and binary decision trees. The acoustic LRE was addressed too, with the main focus on discriminative techniques for training target language acoustic models and on initial (but successful) experiments with removing channel dependencies. We have also investigated into the fusion of phonotactic and acoustic approaches. All experiments were performed on standard data from NIST 2003, 2005 and 2007 evaluations so that the results are directly comparable to other laboratories in the LRE community. With the above mentioned techniques, the fused systems defined the state-of-the-art in the LRE field and reached excellent results in NIST evaluations.

    Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

    Get PDF
    This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

    Low Resource Efficient Speech Retrieval

    Get PDF
    Speech retrieval refers to the task of retrieving the information, which is useful or relevant to a user query, from speech collection. This thesis aims to examine ways in which speech retrieval can be improved in terms of requiring low resources - without extensively annotated corpora on which automated processing systems are typically built - and achieving high computational efficiency. This work is focused on two speech retrieval technologies, spoken keyword retrieval and spoken document classification. Firstly, keyword retrieval - also referred to as keyword search (KWS) or spoken term detection - is defined as the task of retrieving the occurrences of a keyword specified by the user in text form, from speech collections. We make advances in an open vocabulary KWS platform using context-dependent Point Process Model (PPM). We further accomplish a PPM-based lattice generation framework, which improves KWS performance and enables automatic speech recognition (ASR) decoding. Secondly, the massive volumes of speech data motivate the effort to organize and search speech collections through spoken document classification. In classifying real-world unstructured speech into predefined classes, the wildly collected speech recordings can be extremely long, of varying length, and contain multiple class label shifts at variable locations in the audio. For this reason each spoken document is often first split into sequential segments, and then each segment is independently classified. We present a general purpose method for classifying spoken segments, using a cascade of language independent acoustic modeling, foreign-language to English translation lexicons, and English-language classification. Next, instead of classifying each segment independently, we demonstrate that exploring the contextual dependencies across sequential segments can provide large classification performance improvements. Lastly, we remove the need of any orthographic lexicon and instead exploit alternative unsupervised approaches to decoding speech in terms of automatically discovered word-like or phoneme-like units. We show that the spoken segment representations based on such lexical or phonetic discovery can achieve competitive classification performance as compared to those based on a domain-mismatched ASR or a universal phone set ASR

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Language modeling for speech recognition of spoken Cantonese.

    Get PDF
    Yeung, Yu Ting.Thesis (M.Phil.)--Chinese University of Hong Kong, 2009.Includes bibliographical references (leaves 84-93).Abstracts in English and Chinese.Acknowledgement --- p.iiiAbstract --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Cantonese Speech Recognition --- p.3Chapter 1.2 --- Objectives --- p.4Chapter 1.3 --- Thesis Outline --- p.5Chapter 2 --- Fundamentals of Large Vocabulary Continuous Speech Recognition --- p.7Chapter 2.1 --- Problem Formulation --- p.7Chapter 2.2 --- Feature Extraction --- p.8Chapter 2.3 --- Acoustic Models --- p.9Chapter 2.4 --- Decoding --- p.10Chapter 2.5 --- Statistical Language Modeling --- p.12Chapter 2.5.1 --- N-gram Language Models --- p.12Chapter 2.5.2 --- N-gram Smoothing --- p.13Chapter 2.5.3 --- Complexity of Language Model --- p.15Chapter 2.5.4 --- Class-based Langauge Model --- p.16Chapter 2.5.5 --- Language Model Pruning --- p.17Chapter 2.6 --- Performance Evaluation --- p.18Chapter 3 --- The Cantonese Dialect --- p.19Chapter 3.1 --- Phonology of Cantonese --- p.19Chapter 3.2 --- Orthographic Representation of Cantonese --- p.22Chapter 3.3 --- Classification of Cantonese speech --- p.25Chapter 3.4 --- Cantonese-English Code-mixing --- p.27Chapter 4 --- Rule-based Translation Method --- p.29Chapter 4.1 --- Motivations --- p.29Chapter 4.2 --- Transformation-based Learning --- p.30Chapter 4.2.1 --- Algorithm Overview --- p.30Chapter 4.2.2 --- Learning of Translation Rules --- p.32Chapter 4.3 --- Performance Evaluation --- p.35Chapter 4.3.1 --- The Learnt Translation Rules --- p.35Chapter 4.3.2 --- Evaluation of the Rules --- p.37Chapter 4.3.3 --- Analysis of the Rules --- p.37Chapter 4.4 --- Preparation of Training Data for Language Modeling --- p.41Chapter 4.5 --- Discussion --- p.43Chapter 5 --- Language Modeling for Cantonese --- p.44Chapter 5.1 --- Training Data --- p.44Chapter 5.1.1 --- Text Corpora --- p.44Chapter 5.1.2 --- Preparation of Formal Cantonese Text Data --- p.45Chapter 5.2 --- Training of Language Models --- p.46Chapter 5.2.1 --- Language Models for Standard Chinese --- p.46Chapter 5.2.2 --- Language Models for Formal Cantonese --- p.46Chapter 5.2.3 --- Language models for Colloquial Cantonese --- p.47Chapter 5.3 --- Evaluation of Language Models --- p.48Chapter 5.3.1 --- Speech Corpora for Evaluation --- p.48Chapter 5.3.2 --- Perplexities of Formal Cantonese Language Models --- p.49Chapter 5.3.3 --- Perplexities of Colloquial Cantonese Language Models --- p.51Chapter 5.4 --- Speech Recognition Experiments --- p.53Chapter 5.4.1 --- Speech Corpora --- p.53Chapter 5.4.2 --- Experimental Setup --- p.54Chapter 5.4.3 --- Results on Formal Cantonese Models --- p.55Chapter 5.4.4 --- Results on Colloquial Cantonese Models --- p.56Chapter 5.5 --- Analysis of Results --- p.58Chapter 5.6 --- Discussion --- p.59Chapter 5.6.1 --- Cantonese Language Modeling --- p.59Chapter 5.6.2 --- Interpolated Language Models --- p.59Chapter 5.6.3 --- Class-based Language Models --- p.60Chapter 6 --- Towards Language Modeling of Code-mixing Speech --- p.61Chapter 6.1 --- Data Collection --- p.61Chapter 6.1.1 --- Data Collection --- p.62Chapter 6.1.2 --- Filtering of Collected Data --- p.63Chapter 6.1.3 --- Processing of Collected Data --- p.63Chapter 6.2 --- Clustering of Chinese and English Words --- p.64Chapter 6.3 --- Language Modeling for Code-mixing Speech --- p.64Chapter 6.3.1 --- Language Models from Collected Data --- p.64Chapter 6.3.2 --- Class-based Language Models --- p.66Chapter 6.3.3 --- Performance Evaluation of Code-mixing Language Models --- p.67Chapter 6.4 --- Speech Recognition Experiments with Code-mixing Language Models --- p.69Chapter 6.4.1 --- Experimental Setup --- p.69Chapter 6.4.2 --- Monolingual Cantonese Recognition --- p.70Chapter 6.4.3 --- Code-mixing Speech Recognition --- p.72Chapter 6.5 --- Discussion --- p.74Chapter 6.5.1 --- Data Collection from the Internet --- p.74Chapter 6.5.2 --- Speech Recognition of Code-mixing Speech --- p.75Chapter 7 --- Conclusions and Future Work --- p.77Chapter 7.1 --- Conclusions --- p.77Chapter 7.1.1 --- Rule-based Translation Method --- p.77Chapter 7.1.2 --- Cantonese Language Modeling --- p.78Chapter 7.1.3 --- Code-mixing Language Modeling --- p.78Chapter 7.2 --- Future Works --- p.79Chapter 7.2.1 --- Rule-based Translation --- p.79Chapter 7.2.2 --- Training data --- p.80Chapter 7.2.3 --- Code-mixing speech --- p.80Chapter A --- Equation Derivation --- p.82Chapter A.l --- Relationship between Average Mutual Information and Perplexity --- p.82Bibliography --- p.8

    Recent development of the HMM-based speech synthesis system (HTS)

    Get PDF
    A statistical parametric approach to speech synthesis based on hidden Markov models (HMMs) has grown in popularity over the last few years. In this approach, spectrum, excitation, and duration of speech are simultaneously modeled by context-dependent HMMs, and speech waveforms are generate from the HMMs themselves. Since December 2002, we have publicly released an open-source software toolkit named “HMM-based speech synthesis system (HTS)” to provide a research and development toolkit for statistical parametric speech synthesis. This paper describes recent developments of HTS in detail, as well as future release plans

    Modularity and Neural Integration in Large-Vocabulary Continuous Speech Recognition

    Get PDF
    This Thesis tackles the problems of modularity in Large-Vocabulary Continuous Speech Recognition with use of Neural Network
    corecore