1,308 research outputs found
Why has (reasonably accurate) Automatic Speech Recognition been so hard to achieve?
Hidden Markov models (HMMs) have been successfully applied to automatic
speech recognition for more than 35 years in spite of the fact that a key HMM
assumption -- the statistical independence of frames -- is obviously violated
by speech data. In fact, this data/model mismatch has inspired many attempts to
modify or replace HMMs with alternative models that are better able to take
into account the statistical dependence of frames. However it is fair to say
that in 2010 the HMM is the consensus model of choice for speech recognition
and that HMMs are at the heart of both commercially available products and
contemporary research systems. In this paper we present a preliminary
exploration aimed at understanding how speech data depart from HMMs and what
effect this departure has on the accuracy of HMM-based speech recognition. Our
analysis uses standard diagnostic tools from the field of statistics --
hypothesis testing, simulation and resampling -- which are rarely used in the
field of speech recognition. Our main result, obtained by novel manipulations
of real and resampled data, demonstrates that real data have statistical
dependency and that this dependency is responsible for significant numbers of
recognition errors. We also demonstrate, using simulation and resampling, that
if we `remove' the statistical dependency from data, then the resulting
recognition error rates become negligible. Taken together, these results
suggest that a better understanding of the structure of the statistical
dependency in speech data is a crucial first step towards improving HMM-based
speech recognition
Hidden Markov models and neural networks for speech recognition
The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as speech, it has a very limited capacity for recognizing complex patterns involving more than first order dependencies in the observed data sequences. This is due to the first order state process and the assumption of state conditional independence between observations. Artificial Neural Networks (NNs) are almost the opposite: they cannot model dynamic, temporally extended phenomena very well, but are good at static classification and regression tasks. Combining the two frameworks in a sensible way can therefore lead to a more powerful model with better classification abilities. The overall aim of this work has been to develop a probabilistic hybrid of hidden Markov models and neural networks and ..
Conditional Random Field Autoencoders for Unsupervised Structured Prediction
We introduce a framework for unsupervised learning of structured predictors
with overlapping, global features. Each input's latent representation is
predicted conditional on the observable data using a feature-rich conditional
random field. Then a reconstruction of the input is (re)generated, conditional
on the latent structure, using models for which maximum likelihood estimation
has a closed-form. Our autoencoder formulation enables efficient learning
without making unrealistic independence assumptions or restricting the kinds of
features that can be used. We illustrate insightful connections to traditional
autoencoders, posterior regularization and multi-view learning. We show
competitive results with instantiations of the model for two canonical NLP
tasks: part-of-speech induction and bitext word alignment, and show that
training our model can be substantially more efficient than comparable
feature-rich baselines
Continuous Action Recognition Based on Sequence Alignment
Continuous action recognition is more challenging than isolated recognition
because classification and segmentation must be simultaneously carried out. We
build on the well known dynamic time warping (DTW) framework and devise a novel
visual alignment technique, namely dynamic frame warping (DFW), which performs
isolated recognition based on per-frame representation of videos, and on
aligning a test sequence with a model sequence. Moreover, we propose two
extensions which enable to perform recognition concomitant with segmentation,
namely one-pass DFW and two-pass DFW. These two methods have their roots in the
domain of continuous recognition of speech and, to the best of our knowledge,
their extension to continuous visual action recognition has been overlooked. We
test and illustrate the proposed techniques with a recently released dataset
(RAVEL) and with two public-domain datasets widely used in action recognition
(Hollywood-1 and Hollywood-2). We also compare the performances of the proposed
isolated and continuous recognition algorithms with several recently published
methods
Automatic speech recognition: from study to practice
Today, automatic speech recognition (ASR) is widely used for different purposes such as robotics, multimedia, medical and industrial application. Although many researches have been performed in this field in the past decades, there is still a lot of room to work. In order to start working in this area, complete knowledge of ASR systems as well as their weak points and problems is inevitable. Besides that, practical experience improves the theoretical knowledge understanding in a reliable way. Regarding to these facts, in this master thesis, we have first reviewed the principal structure of the standard HMM-based ASR systems from technical point of view. This includes, feature extraction, acoustic modeling, language modeling and decoding. Then, the most significant challenging points in ASR systems is discussed. These challenging points address different internal components characteristics or external agents which affect the ASR systems performance. Furthermore, we have implemented a Spanish language recognizer using HTK toolkit. Finally, two open research lines according to the studies of different sources in the field of ASR has been suggested for future work
- …