57 research outputs found
Hidden Markov Models
Hidden Markov Models (HMMs), although known for decades, have made a big career nowadays and are still in state of development. This book presents theoretical issues and a variety of HMMs applications in speech recognition and synthesis, medicine, neurosciences, computational biology, bioinformatics, seismology, environment protection and engineering. I hope that the reader will find this book useful and helpful for their own research
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
An integrated approach to feature compensation combining particle filters and Hidden Markov Models for robust speech recognition
The performance of automatic speech recognition systems often degrades in adverse conditions where there is a mismatch between training and testing conditions. This is true for most modern systems which employ Hidden Markov Models (HMMs) to decode speech utterances. One strategy is to map the distorted features back to clean speech features that correspond well to the features used for training of HMMs. This can be achieved by treating the noisy speech as the distorted version of the clean speech of interest. Under this framework, we can track and consequently extract the underlying clean speech from the noisy signal and use this derived signal to perform utterance recognition. Particle filter is a versatile tracking technique that can be used where often conventional techniques such as Kalman filter fall short. We propose a particle filters based algorithm to compensate the corrupted features according to an additive noise model incorporating both the statistics from clean speech HMMs and observed background noise to map noisy features back to clean speech features. Instead of using specific knowledge at the model and state levels from HMMs which is hard to estimate, we pool model states into clusters as side information. Since each cluster encompasses more statistics when compared to the original HMM states, there is a higher possibility that the newly formed probability density function at the cluster level can cover the underlying speech variation to generate appropriate particle filter samples for feature compensation. Additionally, a dynamic joint tracking framework to monitor the clean speech signal and noise simultaneously is also introduced to obtain good noise statistics. In this approach, the information available from clean speech tracking can be effectively used for noise estimation. The availability of dynamic noise information can enhance the robustness of the algorithm in case of large fluctuations in noise parameters within an utterance. Testing the proposed PF-based compensation scheme on the Aurora 2 connected digit recognition task, we achieve an error reduction of 12.15% from the best multi-condition trained models using this integrated PF-HMM framework to estimate the cluster-based HMM state sequence information. Finally, we extended the PFC framework and evaluated it on a large-vocabulary recognition task, and showed that PFC works well for large-vocabulary systems also.Ph.D
Recommended from our members
Optimisation Methods For Training Deep Neural Networks in Speech Recognition
Automatic Speech Recognition (ASR) is an example of a sequence to sequence level classification task where, given an acoustic waveform, the goal is to produce the correct word level hypotheses. In machine learning, a classification problem such as ASR is solved in two stages: an inference stage that models the uncertainty associated with the choice of hypothesis given the acoustic waveform using a mathematical model, and a decision stage which employs the inference model in conjunction with decision theory to make optimal class assignments. With the advent of careful network initialisation and GPU computing, hybrid Hidden Markov Models (HMMs) augmented with Deep Neural Networks (DNNs) have shown to outperform traditional HMMs using Gaussian Mixture Models (GMMs) in solving the inference problem for ASR. In comparison to GMMs, DNNs possess a better capability to model the underlying non-linear data manifold due to their deep and complex structure. While the structure of such models gives rich modelling capability, it also creates complex dependencies between the parameters which can make learning difficult via first order stochastic gradient descent (SGD). The task of finding the best procedure to train DNNs continues to be an active area of research and has been made even more challenging by the availability of ever more training data. This thesis focuses on designing better optimisation approaches to train hybrid HMM-DNN models using sequence level discriminative criterion which is a natural loss function that preserves the sequential ordering of frames within a spoken utterance. The thesis presents an implementation of the second order Hessian Free (HF) optimisation method, and shows how the method can made efficient through appropriate modifications to the Conjugate Gradient algorithm. To achieve better convergence than SGD, this work explores the Natural Gradient method to train DNNs with discriminative sequence training. In the DNN literature, the method has been applied to train models for the Maximum Likelihood objective criterion. A novel contribution of this thesis is to extend this approach to the domain of Minimum Bayes Risk objective functions for discriminative sequence training. With sigmoid models trained on a 50hr and 200hr training set from the Multi-Genre Broadcast 1 (MGB1) transcription task, the NG method applied in a HF styled optimisation framework is shown to achieve better Word Error Rate (WER) reductions on the MGB1 development set than SGD from sequence training.
This thesis also addresses the particular issue of overfitting between the training criterion and WER, that primarily arises during sequence training of DNN models that use Rectified Linear Units (ReLUs) as activation functions. It is shown how by scaling with the Gauss Newton matrix, the HF method unlike other approaches can overcome this issue. Seeing that different optimisers work best with different models, it is attractive to have a consistent optimisation framework that is agnostic to the choice of activation function. To address the issue, this thesis develops the geometry of the underlying function space captured by different realisations of DNN model parameters, and presents the design considerations for an optimisation algorithm to be well defined on this space. Building on this analysis, a novel optimisation technique called NGHF is presented that uses both the direction of steepest descent on a probabilistic manifold and local curvature information to effectively probe the error surface. The basis of the method relies on an alternative derivation of Taylor’s theorem using the concepts of manifolds, tangent vectors and directional derivatives from the perspective of Information Geometry. Apart from being well defined on the function space, when framed within a HF style optimisation framework, the method of NGHF is shown to achieve the greatest WER reductions from sequence training on the MGB1 development set with both sigmoid and ReLU based models trained on the 200hr MGB1 training set. The evaluation of the above optimisation methods in training different DNN model architectures is also presented.IDB Cambridge International Scholarshi
Robust gesture recognition
It is a challenging problem to make a general hand gesture recognition system work in a practical operation environment. In this study, it is mainly focused on recognizing English letters and digits performed near the steering wheel of a car and captured by a video camera. Like most human computer interaction (HCI) scenarios, the in-car gesture recognition suffers from various robustness issues, including multiple human factors and highly varying lighting conditions. It therefore brings up quite a few research issues to be addressed. First, multiple gesturing alternatives may share the same meaning, which is not typical in most previous systems. Next, gestures may not be the same as expected because users cannot see what exactly has been written, which increases the gesture diversity significantly.In addition, varying illumination conditions will make hand detection trivial and thus result in noisy hand gestures. And most severely, users will tend to perform letters at a fast pace, which may result in lack of frames for well-describing gestures. Since users are allowed to perform gestures in free-style, multiple alternatives and variations should be considered while modeling gestures. The main contribution of this work is to analyze and address these challenging issues step-by-step such that eventually the robustness of the whole system can be effectively improved. By choosing color-space representation and performing the compensation techniques for varying recording conditions, the hand detection performance for multiple illumination conditions is first enhanced. Furthermore, the issues of low frame rate and different gesturing tempo will be separately resolved via the cubic B-spline interpolation and i-vector method for feature extraction. Finally, remaining issues will be handled by other modeling techniques such as sub-letter stroke modeling. According to experimental results based on the above strategies, the proposed framework clearly improved the system robustness and thus encouraged the future research direction on exploring more discriminative features and modeling techniques.Ph.D
- …