29 research outputs found
Bayesian Speaker Adaptation Based on a New Hierarchical Probabilistic Model
In this paper, a new hierarchical Bayesian speaker adaptation method called HMAP is proposed that combines the advantages of three conventional algorithms, maximum a posteriori (MAP), maximum-likelihood linear regression (MLLR), and eigenvoice, resulting in excellent performance across a wide range of adaptation conditions. The new method efficiently utilizes intra-speaker and inter-speaker correlation information through modeling phone and speaker subspaces in a consistent hierarchical Bayesian way. The phone variations for a specific speaker are assumed to be located in a low-dimensional subspace. The phone coordinate, which is shared among different speakers, implicitly contains the intra-speaker correlation information. For a specific speaker, the phone variation, represented by speaker-dependent eigenphones, are concatenated into a supervector. The eigenphone supervector space is also a low dimensional speaker subspace, which contains inter-speaker correlation information. Using principal component analysis (PCA), a new hierarchical probabilistic model for the generation of the speech observations is obtained. Speaker adaptation based on the new hierarchical model is derived using the maximum a posteriori criterion in a top-down manner. Both batch adaptation and online adaptation schemes are proposed. With tuned parameters, the new method can handle varying amounts of adaptation data automatically and efficiently. Experimental results on a Mandarin Chinese continuous speech recognition task show good performance under all testing conditions
Advancing Electromyographic Continuous Speech Recognition: Signal Preprocessing and Modeling
Speech is the natural medium of human communication, but audible speech can be overheard by bystanders and excludes speech-disabled people. This work presents a speech recognizer based on surface electromyography, where electric potentials of the facial muscles are captured by surface electrodes, allowing speech to be processed nonacoustically. A system which was state-of-the-art at the beginning of this book is substantially improved in terms of accuracy, flexibility, and robustness
Advancing Electromyographic Continuous Speech Recognition: Signal Preprocessing and Modeling
Speech is the natural medium of human communication, but audible speech can be overheard by bystanders and excludes speech-disabled people. This work presents a speech recognizer based on surface electromyography, where electric potentials of the facial muscles are captured by surface electrodes, allowing speech to be processed nonacoustically. A system which was state-of-the-art at the beginning of this book is substantially improved in terms of accuracy, flexibility, and robustness
Unsupervised model adaptation for continuous speech recognition using model-level confidence measures.
Kwan Ka Yan.Thesis (M.Phil.)--Chinese University of Hong Kong, 2002.Includes bibliographical references.Abstracts in English and Chinese.Chapter 1. --- Introduction --- p.1Chapter 1.1. --- Automatic Speech Recognition --- p.1Chapter 1.2. --- Robustness of ASR Systems --- p.3Chapter 1.3. --- Model Adaptation for Robust ASR --- p.4Chapter 1.4. --- Thesis outline --- p.6References --- p.8Chapter 2. --- Fundamentals of Continuous Speech Recognition --- p.10Chapter 2.1. --- Acoustic Front-End --- p.10Chapter 2.2. --- Recognition Module --- p.11Chapter 2.2.1. --- Acoustic Modeling with HMM --- p.12Chapter 2.2.2. --- Basic Phonology of Cantonese --- p.14Chapter 2.2.3. --- Acoustic Modeling for Cantonese --- p.15Chapter 2.2.4. --- Language Modeling --- p.16References --- p.17Chapter 3. --- Unsupervised Model Adaptation --- p.18Chapter 3.1. --- A General Review of Model Adaptation --- p.18Chapter 3.1.1. --- Supervised and Unsupervised Adaptation --- p.20Chapter 3.1.2. --- N-Best Adaptation --- p.22Chapter 3.2. --- MAP --- p.23Chapter 3.3. --- MLLR --- p.25Chapter 3.3.1. --- Adaptation Approach --- p.26Chapter 3.3.2. --- Estimation of MLLR regression matrices --- p.27Chapter 3.3.3. --- Least Mean Squares Regression --- p.29Chapter 3.3.4. --- Number of Transformations --- p.30Chapter 3.4. --- Experiment Results --- p.32Chapter 3.4.1. --- Standard MLLR versus LMS MLLR --- p.36Chapter 3.4.2. --- Effect of the Number of Transformations --- p.43Chapter 3.4.3. --- MAP Vs. MLLR --- p.46Chapter 3.5. --- Conclusions --- p.48ReferencesxlixChapter 4. --- Use of Confidence Measure for MLLR based Adaptation --- p.50Chapter 4.1. --- Introduction to Confidence Measure --- p.50Chapter 4.2. --- Confidence Measure Based on Word Density --- p.51Chapter 4.3. --- Model-level confidence measure --- p.53Chapter 4.4. --- Integrating Confusion Information into Confidence Measure --- p.55Chapter 4.5. --- Adaptation Data Distributions in Different Confidence Measures..… --- p.57References --- p.65Chapter 5. --- Experimental Results and Analysis --- p.66Chapter 5.1. --- Supervised Adaptation --- p.67Chapter 5.2. --- Cheated Confidence Measure --- p.69Chapter 5.3. --- Confidence Measures of Different Levels --- p.71Chapter 5.4. --- Incorporation of Confusion Matrix --- p.81Chapter 5.5. --- Conclusions --- p.83Chapter 6. --- Conclusions --- p.35Chapter 6.1. --- Future Works --- p.8
Robust speech recognition under band-limited channels and other channel distortions
Tesis doctoral inédita. Universidad Autónoma de Madrid, Escuela Politécnica Superior, junio de 200
Grapheme-based Automatic Speech Recognition using Probabilistic Lexical Modeling
Automatic speech recognition (ASR) systems incorporate expert knowledge of language or the linguistic expertise through the use of phone pronunciation lexicon (or dictionary) where each word is associated with a sequence of phones. The creation of phone pronunciation lexicon for a new language or domain is costly as it requires linguistic expertise, and includes time and money. In this thesis, we focus on effective building of ASR systems in the absence of linguistic expertise for a new domain or language. Particularly, we consider graphemes as alternate subword units for speech recognition. In a grapheme lexicon, pronunciation of a word is derived from its orthography. However, modeling graphemes for speech recognition is a challenging task for two reasons. Firstly, grapheme-to-phoneme (G2P) relationship can be ambiguous as languages continue to evolve after their spelling has been standardized. Secondly, as elucidated in this thesis, typically ASR systems directly model the relationship between graphemes and acoustic features; and the acoustic features depict the envelope of speech, which is related to phones. In this thesis, a grapheme-based ASR approach is proposed where the modeling of the relationship between graphemes and acoustic features is factored through a latent variable into two models, namely, acoustic model and lexical model. In the acoustic model the relationship between latent variables and acoustic features is modeled, while in the lexical model a probabilistic relationship between latent variables and graphemes is modeled. We refer to the proposed approach as probabilistic lexical modeling based ASR. In the thesis we show that the latent variables can be phones or multilingual phones or clustered context-dependent subword units; and an acoustic model can be trained on domain-independent or language-independent resources. The lexical model is trained on transcribed speech data from the target domain or language. In doing so, the parameters of the lexical model capture a probabilistic relationship between graphemes and phones. In the proposed grapheme-based ASR approach, lexicon learning is implicitly integrated as a phase in ASR system training as opposed to the conventional approach where first phone pronunciation lexicon is developed and then a phone-based ASR system is trained. The potential and the efficacy of the proposed approach is demonstrated through experiments and comparisons with other standard approaches on ASR for resource rich languages, nonnative and accented speech, under-resourced languages, and minority languages. The studies revealed that the proposed framework is particularly suitable when the task is challenged by the lack of both linguistic expertise and transcribed data. Furthermore, our investigations also showed that standard ASR approaches in which the lexical model is deterministic are more suitable for phones than graphemes, while probabilistic lexical model based ASR approach is suitable for both. Finally, we show that the captured grapheme-to-phoneme relationship can be exploited to perform acoustic data-driven G2P conversion
Acoustic Approaches to Gender and Accent Identification
There has been considerable research on the problems of speaker and language recognition
from samples of speech. A less researched problem is that of accent recognition. Although this
is a similar problem to language identification, di�erent accents of a language exhibit more
fine-grained di�erences between classes than languages. This presents a tougher problem
for traditional classification techniques. In this thesis, we propose and evaluate a number of
techniques for gender and accent classification. These techniques are novel modifications and
extensions to state of the art algorithms, and they result in enhanced performance on gender
and accent recognition.
The first part of the thesis focuses on the problem of gender identification, and presents a
technique that gives improved performance in situations where training and test conditions are
mismatched.
The bulk of this thesis is concerned with the application of the i-Vector technique to accent
identification, which is the most successful approach to acoustic classification to have emerged
in recent years. We show that it is possible to achieve high accuracy accent identification without
reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis
describes various stages in the development of i-Vector based accent classification that improve
the standard approaches usually applied for speaker or language identification, which are
insu�cient. We demonstrate that very good accent identification performance is possible with
acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector
configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can
obtain from the same data.
We claim to have achieved the best accent identification performance on the test corpus
for acoustic methods, with up to 90% identification rate. This performance is even better than
previously reported acoustic-phonotactic based systems on the same corpus, and is very close
to performance obtained via transcription based accent identification. Finally, we demonstrate
that the utilization of our techniques for speech recognition purposes leads to considerably
lower word error rates.
Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian
Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British
English, Prosody, Speech Recognition