1,793 research outputs found

    Language modeling for speech recognition of spoken Cantonese.

    Get PDF
    Yeung, Yu Ting.Thesis (M.Phil.)--Chinese University of Hong Kong, 2009.Includes bibliographical references (leaves 84-93).Abstracts in English and Chinese.Acknowledgement --- p.iiiAbstract --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Cantonese Speech Recognition --- p.3Chapter 1.2 --- Objectives --- p.4Chapter 1.3 --- Thesis Outline --- p.5Chapter 2 --- Fundamentals of Large Vocabulary Continuous Speech Recognition --- p.7Chapter 2.1 --- Problem Formulation --- p.7Chapter 2.2 --- Feature Extraction --- p.8Chapter 2.3 --- Acoustic Models --- p.9Chapter 2.4 --- Decoding --- p.10Chapter 2.5 --- Statistical Language Modeling --- p.12Chapter 2.5.1 --- N-gram Language Models --- p.12Chapter 2.5.2 --- N-gram Smoothing --- p.13Chapter 2.5.3 --- Complexity of Language Model --- p.15Chapter 2.5.4 --- Class-based Langauge Model --- p.16Chapter 2.5.5 --- Language Model Pruning --- p.17Chapter 2.6 --- Performance Evaluation --- p.18Chapter 3 --- The Cantonese Dialect --- p.19Chapter 3.1 --- Phonology of Cantonese --- p.19Chapter 3.2 --- Orthographic Representation of Cantonese --- p.22Chapter 3.3 --- Classification of Cantonese speech --- p.25Chapter 3.4 --- Cantonese-English Code-mixing --- p.27Chapter 4 --- Rule-based Translation Method --- p.29Chapter 4.1 --- Motivations --- p.29Chapter 4.2 --- Transformation-based Learning --- p.30Chapter 4.2.1 --- Algorithm Overview --- p.30Chapter 4.2.2 --- Learning of Translation Rules --- p.32Chapter 4.3 --- Performance Evaluation --- p.35Chapter 4.3.1 --- The Learnt Translation Rules --- p.35Chapter 4.3.2 --- Evaluation of the Rules --- p.37Chapter 4.3.3 --- Analysis of the Rules --- p.37Chapter 4.4 --- Preparation of Training Data for Language Modeling --- p.41Chapter 4.5 --- Discussion --- p.43Chapter 5 --- Language Modeling for Cantonese --- p.44Chapter 5.1 --- Training Data --- p.44Chapter 5.1.1 --- Text Corpora --- p.44Chapter 5.1.2 --- Preparation of Formal Cantonese Text Data --- p.45Chapter 5.2 --- Training of Language Models --- p.46Chapter 5.2.1 --- Language Models for Standard Chinese --- p.46Chapter 5.2.2 --- Language Models for Formal Cantonese --- p.46Chapter 5.2.3 --- Language models for Colloquial Cantonese --- p.47Chapter 5.3 --- Evaluation of Language Models --- p.48Chapter 5.3.1 --- Speech Corpora for Evaluation --- p.48Chapter 5.3.2 --- Perplexities of Formal Cantonese Language Models --- p.49Chapter 5.3.3 --- Perplexities of Colloquial Cantonese Language Models --- p.51Chapter 5.4 --- Speech Recognition Experiments --- p.53Chapter 5.4.1 --- Speech Corpora --- p.53Chapter 5.4.2 --- Experimental Setup --- p.54Chapter 5.4.3 --- Results on Formal Cantonese Models --- p.55Chapter 5.4.4 --- Results on Colloquial Cantonese Models --- p.56Chapter 5.5 --- Analysis of Results --- p.58Chapter 5.6 --- Discussion --- p.59Chapter 5.6.1 --- Cantonese Language Modeling --- p.59Chapter 5.6.2 --- Interpolated Language Models --- p.59Chapter 5.6.3 --- Class-based Language Models --- p.60Chapter 6 --- Towards Language Modeling of Code-mixing Speech --- p.61Chapter 6.1 --- Data Collection --- p.61Chapter 6.1.1 --- Data Collection --- p.62Chapter 6.1.2 --- Filtering of Collected Data --- p.63Chapter 6.1.3 --- Processing of Collected Data --- p.63Chapter 6.2 --- Clustering of Chinese and English Words --- p.64Chapter 6.3 --- Language Modeling for Code-mixing Speech --- p.64Chapter 6.3.1 --- Language Models from Collected Data --- p.64Chapter 6.3.2 --- Class-based Language Models --- p.66Chapter 6.3.3 --- Performance Evaluation of Code-mixing Language Models --- p.67Chapter 6.4 --- Speech Recognition Experiments with Code-mixing Language Models --- p.69Chapter 6.4.1 --- Experimental Setup --- p.69Chapter 6.4.2 --- Monolingual Cantonese Recognition --- p.70Chapter 6.4.3 --- Code-mixing Speech Recognition --- p.72Chapter 6.5 --- Discussion --- p.74Chapter 6.5.1 --- Data Collection from the Internet --- p.74Chapter 6.5.2 --- Speech Recognition of Code-mixing Speech --- p.75Chapter 7 --- Conclusions and Future Work --- p.77Chapter 7.1 --- Conclusions --- p.77Chapter 7.1.1 --- Rule-based Translation Method --- p.77Chapter 7.1.2 --- Cantonese Language Modeling --- p.78Chapter 7.1.3 --- Code-mixing Language Modeling --- p.78Chapter 7.2 --- Future Works --- p.79Chapter 7.2.1 --- Rule-based Translation --- p.79Chapter 7.2.2 --- Training data --- p.80Chapter 7.2.3 --- Code-mixing speech --- p.80Chapter A --- Equation Derivation --- p.82Chapter A.l --- Relationship between Average Mutual Information and Perplexity --- p.82Bibliography --- p.8

    Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

    Get PDF
    We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation. We present a meta-analysis of the performance of speech recognition adaptation algorithms, based on relative error rate reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27 figure

    Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

    Get PDF
    Peer reviewe

    Robust text independent closed set speaker identification systems and their evaluation

    Get PDF
    PhD ThesisThis thesis focuses upon text independent closed set speaker identi cation. The contributions relate to evaluation studies in the presence of various types of noise and handset e ects. Extensive evaluations are performed on four databases. The rst contribution is in the context of the use of the Gaussian Mixture Model-Universal Background Model (GMM-UBM) with original speech recordings from only the TIMIT database. Four main simulations for Speaker Identi cation Accuracy (SIA) are presented including di erent fusion strategies: Late fusion (score based), early fusion (feature based) and early-late fusion (combination of feature and score based), late fusion using concatenated static and dynamic features (features with temporal derivatives such as rst order derivative delta and second order derivative delta-delta features, namely acceleration features), and nally fusion of statistically independent normalized scores. The second contribution is again based on the GMM-UBM approach. Comprehensive evaluations of the e ect of Additive White Gaussian Noise (AWGN), and Non-Stationary Noise (NSN) (with and without a G.712 type handset) upon identi cation performance are undertaken. In particular, three NSN types with varying Signal to Noise Ratios (SNRs) were tested corresponding to: street tra c, a bus interior and a crowded talking environment. The performance evaluation also considered the e ect of late fusion techniques based on score fusion, namely mean, maximum, and linear weighted sum fusion. The databases employed were: TIMIT, SITW, and NIST 2008; and 120 speakers were selected from each database to yield 3,600 speech utterances. The third contribution is based on the use of the I-vector, four combinations of I-vectors with 100 and 200 dimensions were employed. Then, various fusion techniques using maximum, mean, weighted sum and cumulative fusion with the same I-vector dimension were used to improve the SIA. Similarly, both interleaving and concatenated I-vector fusion were exploited to produce 200 and 400 I-vector dimensions. The system was evaluated with four di erent databases using 120 speakers from each database. TIMIT, SITW and NIST 2008 databases were evaluated for various types of NSN namely, street-tra c NSN, bus-interior NSN and crowd talking NSN; and the G.712 type handset at 16 kHz was also applied. As recommendations from the study in terms of the GMM-UBM approach, mean fusion is found to yield overall best performance in terms of the SIA with noisy speech, whereas linear weighted sum fusion is overall best for original database recordings. However, in the I-vector approach the best SIA was obtained from the weighted sum and the concatenated fusion.Ministry of Higher Education and Scienti c Research (MoHESR), and the Iraqi Cultural Attach e, Al-Mustansiriya University, Al-Mustansiriya University College of Engineering in Iraq for supporting my PhD scholarship

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Brain responses to contrastive and noncontrastive morphosyntactic structures in African American English and Mainstream American English: ERP evidence for the neural indices of dialect

    Get PDF
    Recent research has shown that distinct event-related potential (ERP) signatures are associated with switching between languages compared to switching between dialects or registers (e.g., Khamis-Dakwar & Froud, 2007; Moreno, Federmeier & Kutas, 2002). The current investigation builds on these findings to examine whether contrastive and non-contrastive morphosyntactic features in English elicit differing neural responses in bidialectal speakers of African American English (AAE) and Mainstream American English (MAE), compared to monodialectal speakers of MAE. Event-related potentials (ERPs) and behavioral responses (response types and reaction time) to grammaticality judgments targeting a contrasting morphosyntactic feature between MAE and AAE are presented as evidence of dual-language representation in bidialectal speakers. Results from 30 participants (15 monodialectal; 15 bidialectal) support the notion that bidialectal populations demonstrate distinct neurophysiological profiles from monolingual groups as indicated by a significantly greater P600 amplitude from 500ms – 800ms time window in the monodialectal group, when listening to sentences containing contrasting features. Such evidence can support the development of linguistically informed educational curriculums and clinical approaches from speech-language pathologists, by elucidating the differing underlying processes of language between monodialectal and bidialectal speakers of American English
    • …
    corecore