10 research outputs found
Predictive Interfaces for Long-Distance Tele-Operations
We address the development of predictive tele-operator interfaces for humanoid robots with respect to two basic challenges. Firstly, we address automating the transition from fully tele-operated systems towards degrees of autonomy. Secondly, we develop compensation for the time-delay that exists when sending telemetry data from a remote operation point to robots located at low earth orbit and beyond. Humanoid robots have a great advantage over other robotic platforms for use in space-based construction and maintenance because they can use the same tools as astronauts do. The major disadvantage is that they are difficult to control due to the large number of degrees of freedom, which makes it difficult to synthesize autonomous behaviors using conventional means. We are working with the NASA Johnson Space Center's Robonaut which is an anthropomorphic robot with fully articulated hands, arms, and neck. We have trained hidden Markov models that make use of the command data, sensory streams, and other relevant data sources to predict a tele-operator's intent. This allows us to achieve subgoal level commanding without the use of predefined command dictionaries, and to create sub-goal autonomy via sequence generation from generative models. Our method works as a means to incrementally transition from manual tele-operation to semi-autonomous, supervised operation. The multi-agent laboratory experiments conducted by Ambrose et. al. have shown that it is feasible to directly tele-operate multiple Robonauts with humans to perform complex tasks such as truss assembly. However, once a time-delay is introduced into the system, the rate of tele\ioperation slows down to mimic a bump and wait type of activity. We would like to maintain the same interface to the operator despite time-delays. To this end, we are developing an interface which will allow for us to predict the intentions of the operator while interacting with a 3D virtual representation of the expected state of the robot. The predictive interface anticipates the intention of the operator, and then uses this prediction to initiate appropriate sub-goal autonomy tasks
Bayesian adaptive learning of the parameters of hidden Markov model for speech recognition
A theoretical framework for Bayesian adaptive training of the parameters of a discrete hidden Markov model (DHMM) and of a semi-continuous HMM (SCHMM) with Gaussian mixture state observation densities is presented. In addition to formulating the forward-backward MAP (maximum a posteriori) and the segmental MAP algorithms for estimating the above HMM parameters, a computationally efficient segmental quasi-Bayes algorithm for estimating the state-specific mixture coefficients in SCHMM is developed. For estimating the parameters of the prior densities, a new empirical Bayes method based on the moment estimates is also proposed. The MAP algorithms and the prior parameter specification are directly applicable to training speaker adaptive HMMs. Practical issues related to the use of the proposed techniques for HMM-based speaker adaptation are studied. The proposed MAP algorithms are shown to be effective especially in the cases in which the training or adaptation data are limited.published_or_final_versio
Grapheme-based Automatic Speech Recognition using Probabilistic Lexical Modeling
Automatic speech recognition (ASR) systems incorporate expert knowledge of language or the linguistic expertise through the use of phone pronunciation lexicon (or dictionary) where each word is associated with a sequence of phones. The creation of phone pronunciation lexicon for a new language or domain is costly as it requires linguistic expertise, and includes time and money. In this thesis, we focus on effective building of ASR systems in the absence of linguistic expertise for a new domain or language. Particularly, we consider graphemes as alternate subword units for speech recognition. In a grapheme lexicon, pronunciation of a word is derived from its orthography. However, modeling graphemes for speech recognition is a challenging task for two reasons. Firstly, grapheme-to-phoneme (G2P) relationship can be ambiguous as languages continue to evolve after their spelling has been standardized. Secondly, as elucidated in this thesis, typically ASR systems directly model the relationship between graphemes and acoustic features; and the acoustic features depict the envelope of speech, which is related to phones. In this thesis, a grapheme-based ASR approach is proposed where the modeling of the relationship between graphemes and acoustic features is factored through a latent variable into two models, namely, acoustic model and lexical model. In the acoustic model the relationship between latent variables and acoustic features is modeled, while in the lexical model a probabilistic relationship between latent variables and graphemes is modeled. We refer to the proposed approach as probabilistic lexical modeling based ASR. In the thesis we show that the latent variables can be phones or multilingual phones or clustered context-dependent subword units; and an acoustic model can be trained on domain-independent or language-independent resources. The lexical model is trained on transcribed speech data from the target domain or language. In doing so, the parameters of the lexical model capture a probabilistic relationship between graphemes and phones. In the proposed grapheme-based ASR approach, lexicon learning is implicitly integrated as a phase in ASR system training as opposed to the conventional approach where first phone pronunciation lexicon is developed and then a phone-based ASR system is trained. The potential and the efficacy of the proposed approach is demonstrated through experiments and comparisons with other standard approaches on ASR for resource rich languages, nonnative and accented speech, under-resourced languages, and minority languages. The studies revealed that the proposed framework is particularly suitable when the task is challenged by the lack of both linguistic expertise and transcribed data. Furthermore, our investigations also showed that standard ASR approaches in which the lexical model is deterministic are more suitable for phones than graphemes, while probabilistic lexical model based ASR approach is suitable for both. Finally, we show that the captured grapheme-to-phoneme relationship can be exploited to perform acoustic data-driven G2P conversion
Aportación a la extracción paramétrica en reconocimiento de voz robusto basada en la aplicación de conocimiento de fonética acústica
This thesis is based on the following hypothesis: the introduction of direct
knowledge from the acoustic-phonetic field to the speech recognition problem,
especially in the feature extraction step, may constitute a solid base of analysis for the
determination of the behavior and capabilities of those systems and their improvement,
as well.
Most of the complexity of this Ph.D. thesis comes from the different subjects
related with the speech processing área. The application of acoustic-phonetic
information to the speech recognition research área implies a deep knowledge of both
subjects.
The research carried out in this work has been divided in two main parts: analysis
of the current feature extraction methods and a study of several possible procedures
about the incorporation of phonetic-acoustic knowledge to those systems.
Abundant recognition and related quality measure results are presented for 50
different parameter extraction models.
Details about the real-time implementation on a DSP platform (TMS3230C31-60)
of two different parameter extraction models are presented.
Finally, a set of computer tools developed for building and testing new speech
recognition systems has been produced. Besides, the application of several results from
this work can be extended to other speech processing áreas, such as computer assisted
language learning, linguistic rehabilitation, etc.---ABSTRACT---La hipótesis en la que se basa el desarrollo de esta tesis, se centra en la suposición
de que la aportación de conocimiento directo, proveniente del campo de la fonética
acústica, al problema del reconocimiento automático de la voz, en concreto a la etapa de
extracción de características, puede constituir una base sólida con la que poder analizar
el comportamiento y capacidad de discriminación de dichos sistemas, así como una
forma de mejorar sus prestaciones.
Parte de la complejidad que presenta esta tesis doctoral, viene motivada por las
diferentes disciplinas que están relacionadas con el área de procesamiento de la voz. La
aplicación de información fonética-acústica al campo de investigación del
reconocimiento del habla requiere un amplio conocimiento de ambas materias.
Las investigaciones desarrolladas en este trabajo se han dividido en dos bloques
fundamentales: análisis de los métodos actuales de extracción de rasgos fonéticos y un
estudio de algunas posibles formas de incorporación de conocimiento fonético-acústico
a dichos sistemas.
En esta tesis se ofrecen abundantes resultados relativos a tasas de reconocimiento
y medidas acerca de la calidad de este proceso, para un total de 50 modelos de
extracción de parámetros.
Así mismo se incluyen los detalles de la implementación en tiempo real para una
plataforma DSP, en concreto TMS320C31-60, de dos diferentes modelos de extracción
de rasgos.
Además, se ha desarrollado un conjunto de las herramientas informáticas que
pueden servir de base para construir y validar de forma sencilla, nuevos sistemas de
reconocimiento. La aplicación de algunos de los resultados del trabajo puede extenderse
también a otras áreas del tratamiento de la voz, tales como la enseñanza de una segunda
lengua, logopedia, etc
Signal processing and acoustic modelling of speech signals for speech recognition systems
Natural man-machine interaction is currently one of the most unfulfilled pledges of automatic speech recognition (ASR). The purpose of an automatic speech recognition system is to accurately transcribe or execute what has been said. State-of-the-art speech recognition systems consist of four basic modules: the signal processing, the acoustic modelling, the language modelling, and the search engine. The subject of this thesis is the signal processing and acoustic modelling modules. We pursue the modelling of spoken signals in an optimum way. The resultant modules can be used successfully for the subsequent two modules.
Since the first order hidden Markov model (HMM) has been a tremendously successful mathematically established paradigm, which makes it the up-to-the-minute technique in current speech recognition systems, this dissertation bases all its studies and experiments on HMM. HMM is a statistical framework that supports both acoustic and temporal modelling. It is widely used despite making a number of suboptimal modelling assumptions, which put limits on its full potential. We investigate how the model design strategy and the algorithms can be adapted to HMMs. Large suites of experimental results are demonstrated to expound the relative effectiveness of each component within the HMM paradigm.
This dissertation presents several strategies for improving the overall performance of baseline speech recognition systems. The implementation of these strategies was optimised in a series of experiments. We also investigate selecting the optimal feature sets for speech recognition improvement. Moreover, the reliability of human speech recognition is attributed to the specific properties of the auditory presentation of speech. Thus, in this dissertation, we explore the use of perceptually inspired signal processing strategies, such as critical band frequency analysis. The resulting speech representation called Gammatone cepstral coefficients (GTCC) provides relative improvement over the baseline recogniser. We also investigate multiple signal representations for recognition in an ASR to improve the recognition rate.
Additionally, we developed fast techniques that are useful for evaluation and comparison procedures between different signal processing paradigms. The following list gives the main contributions of this dissertation:
• Speech/background discrimination.
• HMM initialisation techniques.
• Multiple signal representation with multi-stream paradigms.
• Gender based modelling.
• Feature vectors dimensionality reduction.
• Perceptually motivated feature sets.
• ASR training and recognition packages for research and development.
Many of these methods can be applied in practical applications. The proposed techniques can be used directly in more complicated speech recognition systems by introducing their resultants to the language and search engine modules.UnpublishedAbdulla, W. H. and N. K. Kasabov (1999a). Speech recognition
enhancement via robust CHMM speech background discrimination.
Proc. ICONIP/ANZIIS/ANNES'99 International Workshop, New Zealand.
Abdulla, W. H. and N. K. Kasabov (1999b). Two pass hidden Markov
model for speech recognition systems. Proc. ICICS'99, Singapore.
Abdulla, W. H. and N. K. Kasabov (1999c). The concepts of hidden
Markov model in speech recognition. IJCNN'99, N. K. Kasabov, W. H.
Abdulla (Ed.), Washington, DC, July 10-16, Chapter 4.
Abdulla, W. H. and N. K. Kasabov (2000). Feature selection for parallel
CHMM speech recognition systems. Proc. of the Fifth Joint Conference on
Information Sciences, vol.2, pp 874-878, Atlantic City, New Jersey, USA.
Abdulla, W. H. and N. K. Kasabov (2001). Improving speech
recognition performance through gender separation. Proc. Artificial Neural
Networks and Expert Systems International Conference (ANNES), pp 218-
222, Dunedin, New Zealand.
Abramson, N. (1963). Information Theory and Coding. New York,
McGraw-Hill.
Abrash, V., H. Franco, M. Cohen, N. Morgan, et al. (1992).
"Connectionist gender adaptation in hybrid neural network/hidden Markov
model speech recognition system." Proc. ICSLP'92.
Aertsen, A. and P. Johannesma (1980). "Spectra-temporal receptive
fields of auditory neurons in the grass frog. I. Characterization of tonal and
natural stimuli." Biol. Cybern. 38: 223 - 234.
Allen, J. B. (1995). Speech and hearing in communication. New York,
ASA edition, Acoustical Society of America.
Alsabti, K., S. Ranka and V. Singh (1999). An efficient K-means
clustering algorithm. IPPS/SPDP Workshop on High Performance Data
Mining, San Juan, Puerto Rico.
Arai, T. and S. Greenberg (1998). Speech intelligibility in the presence
of cross channel spectral asynchrony. Proc. IEEE ICASSP'98.
Atal, B. S. (1972). "Automatic speaker recognition based on pitch
contours." J. Acoust. Soc. Am. 52: 1687-1697.
Ata, B. S. and L. R Rabiner (1976). "A pattern recognition approach to
voiced-unvoiced-silence classification with application to speech
recognition." IEEE Trans. ASSP 24(4): 201-212.
Bahl, L. R., P. F. Brown, P. V. de Souza and R L. Mercer (1988a).
Speech recognition with continuous parameter hidden Markov models.
Proc. IEEE ICASSP'88, New York, NY.
Bahl, L. R., P. F. Brown, P. V. de Souza, R. L. Mercer, et al. (1988b).
Acoustic Markov models used in Tangora speech recognition system. Proc.
IEEE ICASSP'88, New York, USA.
Baker, J. K. (1975a). "The DRAGON system - an overview." IEEE
Trans. ASSP 23: 24-29.
Baker, J. K. (1975b). "The Dragon system - an overview." IEEE Trans.
ASSP 23(1): 24-29.
Baker, J. K., Ed. (1975c). Stochastic modeling for automatic speech
understanding. Speech Recognition: Invited paper presented at the 1974
IEEE symposium. New York, Academic Press.
Barnwell-III, T. P. (1980). A comparison of parametrically different
objective speech quality measures using correlation analysis with
subjective quality results. Proc. IEEE ICASSP'80, Denver.
Bateman, D. C., D. K. Bye and M. J. Hunt (1992). Spectral contrast
normalization and other techniques for speech recognition in noise. Proc.
IEEE ICASSP'92, San Francisco, USA.
Baum, L. E. (1972). "An inequality and associated miximization
technique in statistical estimation for probabilistic functions of Markov
processe." Proc. Symp. On Inequalities 3: 1-7.
Baum, L. E. and J. A. Egon (1967). "An inequality with applications to
statistical estimation for probabilistic functions of Markov process and to a
model for ecology." Bull. Amer. Meteorol. Soc. 73: 360-363.
Baum, L. E. and T. Petrie (1966). "Statistical inference for probabilistic
functions of finite state Markov chains." Ann. Math. Stat. 37: 1554-1563.
Baum, L. E., T. Petrie, G. Soules and N. Weiss (1970). "A maximization
technique occurring in the statistical analysis of probabilistic functions of
markov chains." Annals of Mathematical Statistics 41(1): 164-171.
Becchetti, C. and L. P. Ricotti (1999). Speech recognition theory and
C++ implementation, John Wiley & Sons.
Bellegarda, J. and D. Nahamoo (1989). Tied mixture continuous
parameter models for large vocabulary isolated speech recognition. Proc.
ICASSP'89, Glasgow, Scotland.
Bellegarda, J. and D. Naharnoo (1990). "Tied mixture continuous
parameter modeling for speech recognition." IEEE Trans. ASSP 38(12):
2033-2045.
Bengio, S. and Y. Bengio (2000a). "Taking on the curse of
dimentionality in joint distributions using neural networks." IEEE Trans.
Neural Networks 11(3): 550-557.
Bengio, Y. (1996). Neural Networks for Speech and Sequence
Processing, International Thomson Computer Press.
Bengio, Y. and S. Bengio (2000b). Modeling high-dimensional discrete
data with multi-layer neural networks. Advances in Neural Information
Processing Systems 12. S. A. Jolla, T. K. Leen and K.-R. Miler, MIT Press:
400-406.
Bin, J., T. Calhurst, A. EI-Jaroudi, R. lyer, et al. (1999). Recent
experiments in large vocabulary conversational speech recognition. Proc.
IEEE ICASSP'99, Phoenix.
Blimes, J. A. (1998). A gentle tutorial of the EM algorithm and its
application to parameter estimation for Gaussian mixture and hidden
Markov models. Berkeley, CA, International Computer Science Institute.
Bocchieri, E. and B. Mak (1997). Subspace distribution clustering for
continuous observation density hidden Markov models. Proc. Eurospeech.
Boll, S. F. (1979). "Suppression of acoustic noise in speech using
spectral subtraction." IEEE Trans. ASSP 27(2): 113-120.
Bou-Ghazale, S. E. and A. 0. Asadi (2000). Hands-free voice activation
of personal communication devices. Proc. IEEE ICASSP'2000, Istanbul,
Turkey.
Bourlard, H., S. Bengio and K. Weber (2001). New approaches towards
robust and adaptive speech recognition. Advances in Neural Information
Processing Systems 13. T. K. Leen, T. G. Dietterich and V. Tresp, MIT
Press: 751-757.
Bourlard, H., S. Dupont and C. Ris (1996). Multi-stream speech
recognition.
Bourlard, H. and N. Morgan (1993). Connectionist Speech Recognition.
A Hybrid Approach. Boston, Kluwer Academic Publishers.
Bourlard, H. and N. Morgan (1994). Connectionist Speech Recognition,
Kluwer Academic Publishers.
Bourlard, H., C. J. Wellekens and H. Ney (1984). Connected digit
recognition using vector quantization. Proc. IEEE ICASSP'84, San Diego,
USA.
Burton, D. K. and J. E. Shore (1985). "Speaker-dependent isolated
word recognition using speaker-independent vector quantization
coodbooks augmented with speaker-specific data." IEEE Trans. ASSP
33(2): 440-443.
Chan, A. K. and S. J. Liu (1998). Wavelet Toolware: Software for
Wavelet Training, Academic Press.
Chen, S. S. and R. A. Gopinath (2001). Gaussianization. Advances in
Neural Information Processing Systems 13. T. K. Leen, T. G. Dietterich
and V. Tresp, MIT Press: 821-827.
Chow, Y. L., M. 0. Dunham, 0. A. Kimball, M. A. Krasner, et al. (1987).
BYBLOS: The BBN continuous speech recognition system. Proc. IEEE
ICASSP'87, Dallas, USA.
Cooke, M. (1993). Modelling Auditory Processing and Organization.
U.K., Cambridge University Press.
Cosi, P., Ed. (1999). Auditory modeling and neural networks. Speech
Processing, Recognition and Artificial Neural Networks. London, Springer-
Verlag.
Cover, T. M. (1977). "On the possible ordering in the measurement
selection problem." IEEE Trans. on Systems, Man, and Cybernetics 7(9):
657-661.
Crochiere, R. E. and L. Rabiner (1983). Multirate Digital Signal
Processing. Englewood Cliffs, NJ, Prentice Hall.
Das, S., R. Bakis, A. Nadas, D. Nahamoo, et al. (1993). Influence of
background noise and microphone on the performance of the IBM
TANGORA speech recognition system. Proc. IEEE ICASSP'93.
Das, S., A. Nadas, D. Nahamoo and M. Picheny (1994). Adaptation
techniques for ambience and microphone compensation in the IBM
Tangora speech recognition system. Proc. IEEE ICASSP'94, Adelaide,
Australia.
Daubechies, I. (1990). "The wavelet transform, time-frequency
localization and signal analysis." IEEE Trans. IT 36(5): 961 -1005.
Daubechies, I., Ed. (1992a). Ten Lectures on Wavelets. CBMS-NSF
Regional Conference Series in Applied Mathematics. Philadelphia,
Pennsylvania, SIAM Press.
Daubechies, I. (1992b). Ten lectures on wavelets, SIAM.
Davis, B. and P. Mermelstein (1980). "Comparison of parametric
representations for monosylabic word recognirion in continuously spoken
sentences." IEEE Trans. ASSP 28(4): 357-366.
Davis, S. B., Ed. (1990). Comparison of parametric representations for
monosyllabic word recognition in continuous spoken sentences. Readings
in Speech Recognition.
de-Boer, E. and H. R. de Jongh (1978). "On cochlea encoding:
Potentialities and limitations of the reverse-correlation technique." J.
Acoust. Soc. Am. 63(1): 115- 135.
de-Boer, E. and P. Kuyper (1968). "Triggered correlation." IEEE Trans.
Biomed. Eng. BME-15: 169 - 179.
DeIler, J. R., J. G. Proakis and J. H. Hansen (1993). Discrete-Time
Processing of Speech Signals. New York, Macmillan Publishing.
Dempster, A. P., N. M. Laird and D. B. Rubin (1977). "Maximum
likelihood from incomplete data via the EM algorithm." J. Royal Statist. Soc.
Ser. B 39(1): 1-38.
Devijver, P. A. and J. Kittler (1982). Pattern Recognition: A Statistical
Approach. Englewood Cliffs, NJ, Prentice-Hall.
Donoho, D. L., Ed. (1993). Nonlinear wavelet methods for recovery of
signals, densities, and spectra from indirect and noisy data. Proceedings of
the Symposia in Applied Mathematics, American Mathematical Society.
Donoho, D. L. (1995). "Denoising by soft-thresholding." IEEE Trans. IT
41(3): 613-627.
Donoho, D. L. and I. M. Johnstone (1994). "Ideal spatial adaptation by
wavelet shrinkage." Biometrika 81: 425-455.
Draper, N. R. and H. Smith (1981). Applied Regression Analysis. New
York, Wiley.
Duda, R. 0. and P. E. Hart (1973). Pattern Classification and Scene
Analysis. New York, Wiley.
Duncan Luce, R. (1993). Sound & Hearing A Conceptual Introduction,
Lawrence Erlbaum Associates.
Elenius, K. and M. Blomberg (1982). Effects of emphasizing transitional
or stationary parts of the speech signal in a discrete utterance recognition
system. Proc. ICASSP'86.
Ellermann, C., S. V. Even, C. Huang and L. Manganaro (1993).
"Dragon systems' experiences in small to large vocabulary multi-lingual
speech recognition applications." Proc. Eurospeech 3: 2077-2080.
Ellis, D. (2001). TANDEM acoustic modeling in large-vocabulary
recognition. Proc. ICASSP'2001, Salt Lake City.
Ellis, D. and J. A. Bilmes (2000). Using mutual information to design
feature combinations. Proc. ICSLP-2000, Beijing.
Ellis, D. P. (2000a). Improved recognition by combining different
features and different systems. Proc. AVIOS-2000, San Jose.
Ellis, D. P. (2000b). Using mutual information to design feature
combinations. Proc. ICSLP-2000, Beijing.
Ellis, D. P., R. Singh and S. Sivadas (2001). Tandem acoustic
modelling in large-vocabulary recognition. Proc. IEEE ICASSP'2001, Salt
Lake City.
Elman, J. L. (1990). "Finding structure in time." Cognitive Science
14(2): 179-211.
Ferguson, J. D. (1980). Hidden Markov Analysis: An Introduction.
Princeton, NJ, Institute of Defence Analyses.
Fermin, C. D. (2000). Very Detailed Tutorial on the ear,
http://www1.omi.tulane.edu/departments/pathology/fermin/Hearing.html.
6000.
Forney, G. D. (1973). "The Viterbi algorithm." Proc. IEEE 61: 268-278.
Furui, S. (1986a). Speaker independent isolated word recognition
based on emphasised spectral dynamics. Proc. ICASSP'86, Tokyo - Japan.
Furui, S. (1986b). Speaker independent isolated word recognition
based on emphasized spectral dynamics. Proc. IEEE ICASSP'86, Tokyo-
Japan.
Furui, S. (1986c), "Speaker independent isolated word recognition
using dynamic features of speech recognition." IEEE Trans. ASSP 34(2):
52-59.
Furui, S. (1986d). "Speaker-independent isolated word recognition
using dynamic features of speech spectrum." IEEE Trans. ASSP 34: 52-59.
Furui, S. (1988). "A VO-based preprocesor using cepstral dynamic
features for speaker-independent large vocabulary word recognition." IEEE
Trans. ASSN 36(7): 980-987.
Garofolo, J. S., L. F. Lame!, W. M. Fisher, J. G. Fiscus, et al. (1990).
DARPA TIM IT Acoustic-Phonetic Continuous Speech Corpus CD-ROM.
NISTIR 4930.
Gauvain, J.-L. and C.-H. Lee (1994). "Maximum a posteriori estimation
for multivariate gaussian mixture observations of Markov chains." IEEE
Trans. SAP 2(2): 291-298.
Ghitza, 0., Ed. (1992). Auditory nerve representation as a basis for
speech processing. Advances in Speech Signal Processing. New York,
Marcel Dekker.
Glasberg, B. R. and B. C. Moore (1990). "Derivation of auditory filter
shapes from notched-noise data." Hearing Research 47: 103 - 108.
Glinski, S. C. (1985). "On the use of vector quantization for connecteddigit
recognition." The AT & T Tech, J. 64(5): 1033-1045.
Gong, Y. (1995). "Speech recognition in noisy environments: A survey."
Speech Communication 16: 261-291.
Graps, A. (1995). "An introduction to wavelets." IEEE Computational
Science and Engineering 2(2).
Gravier, G., M. Sigelle and G. Chollet (1998). "Marrkov random field
modelling for speech recognition." Australian J. of Intelligent Information
Processing Systems 5(4): 245-251.
Gray, J. and J. D. Markel (1976). "Distance measures for speech
processing." IEEE Trans. ASSP 24(5): 380-391.
Gray Jr, A. H. and J. D. Markel (1974). "A spectral flatness measure for
studying the autocorrelation method of linear prediction of speech
analysis." IEEE Trans. ASSP ASSP-22: 607 - 217.
Gray, R. M. (1984). "Vector quantization." IEEE ASSP Magazine 1(2): 4
- 29.
Gray, R. M., A. Buzo, J. Gray and Matsuyama (1980). "Distortion
measures for speech processing." IEEE Trans. ASSP 68(4): 367-376.
Greenberg, S., T. Arai and R. Silipo (1998). Speech derivation from
exceedingly sparse spectral information. Proc. IEEE ICSLP'98.
Greenwood, D. (1961). "Critical bandwidth and the frequency
coordinates of the basilar membrane." J. Acoust. Soc. Am. 33: 1344 - 1356.
Greenwood, D. (1990). "A cochlear frequency-position function for
several species--29 years later." J. Accost, Soc. Am, 87(6): 2592 - 2605.
Gupta, V. N., M. Lennig and P. Mermelstein (1987a). Integration of
acoustic information in a large vocabulary word recognizer. Proc.
ICASSP'87, Dallas, USA.
Gupta, V. N., M. Lenning and P. Mermelstein (1987b). Integration of
acoustic information in a large vocabulary word recognizer. Proc. IEEE
ICASSP'87.
Haeb-Umbach, R. (1999a). Investigations on interspeaker variability
in the feature space. Proc. ICASSP'99, Phoenix, Arizona.
Haeb-Umbach, R. (1999b). Investigations on inter-speaker variability
in the feature space. Proc. IEEE ICASSP'99, Arizona-USA.
Hansen, J. H. and B. L. Pellom (1998). An effective quality
evaluation protocol for speech enhancement algorithms. Proc. ICSLP'98,
Sydney, Australia.
Hanson, B. A. and T. H. Applebaum (1990a). Features for noiserobust
speaker-independent word recognition. Proc. Int. Conf. Spoken
Language Processing (ICSLP), Kobe-Japan.
Hanson, B. A. and T. H. Applebaum (1990b). Robust speaker
independent word recognition using static, dynamic, and acceleration
features: experiments with lombard and noisy speech. Proc. IEEE
ICASSP'90, Albuquerque, NM.
Hanson, B. A. and T. H. Applebaum (1990c). Robust speaker
independent word recognition using static, dynamic, and acceleration
features: experiments with lombard and noisy speech. Proc. ICASSP'90,
Albuquerque, NM.
Hanson, B. A., T. H. Applebaum and J.-C. Junqua (1996a). Spectral
dynamics for speech recognition under adverse conditions. Automatic
Speech and Speaker Recognition Advanced Topics. C.-H. Lee, F. K.
Soong and K. K. Paliwal.
Hanson, B. A., T. H. Applebaum and J.-C. Junqua, Eds. (1996b).
Spectral dynamics for speech recognition under adverse conditions.
Automatic Speech and Speaker Recognition, Kluwer Academic Publishers.
Harborg, E. (1990). Hidden Markov Models Applied to Automatic
Speech Recognition,. PhD Thesis, Norwegian Institute of Technology
(Trondheim).
Harrington, J. and S. Cassidy (1999). Techniques in Speech
Acoustics, Kluwer Academic Publishers.
Hartmann, W. M. (1998). Signals, Sound, and Sensation, Springer-
Verlag.
Herrnansky, H. (1990a). "Perceptual linear predictive (PLP) analysis
for speech." J. Acoust. Soc. Am. 87: 1738-1752.
Hermansky, H. (1990b). "Perceptual linear predictive (PLP) analysis
of speech." J. Acoust. Soc. Am. 87(4): 1738-1752.
Hermansky, H. (1997). Should recognizers have ears? Proc. ESCA
Tutorial and Research Workshop on Robust Speech Recognition for
Unknown Communication Channels, France.
Hermansky, H. (1999). Analysis in automatic recognition of speech.
Speech Processing, Recognition and Artificial Neural Networks. G. Chollet,
D. Di Benedetto, A. Esposito and M. Marinaro, Springer-Verlag: 115-137.
Hermansky, H., D. Ellis and S. Sharma (2000). TANDEM
connectionist feature extraction for conventional HMM systems. Proc.
ICASSP2000, Istanbul.
Hermansky, H. and N. Malayath (1998). Spectral basis functions
from discriminant analysis. ICSLP'98, Sydney, Australia.
Hermansky, H. and S. Sharma (1999). Temporal patterns (TRAPS)
in ASR of noisy speech. Proc. ICASSP'99, Phoenix, AZ.
Hertz, J., A. Krogh and R G. Palmer (1991). Introduction To The
Theory Of Neural Computing, Addison-Wesley Publishing Company.
Hess, W. (1983). Pitch determination of speech signals. New York,
Springer-Verlag.
Hochberg, M., S. Rentals, A. J. Robinson and G. D. Cook (1995).
Recent improvements to the ABBOT large vocabulary CSR system. Proc.
IEEE ICASSP'95.
Huang, L.-S. and C.-h. Yang (2000). A novel approach to robust
speech endpoint detection in car environment. Proc. IEEE ICASSP'2000,
Istanbul, Turkey.
Huang, X., A. Acero, F. Allova, M. Y. Hwang, et al. (1995). "Microsoft
windows highly intelligent speech recognizer: Whisper." Proc. IEEE
ICASSP'95 1: 93-97.
Huang, X. D. (1992). Minimizing speaker variation effects for
speaker-independent speech recognition. Proceedings of Speech and
Natural Language Workshop.
Huang, X. D., M. A. Ariki and M. A. Jack (1990a). Hidden Markov
Models for Speech Recognition. Edinburgh, Edinburgh University Press.
Huang, X. D., Y. Ariki and M. A. Jack (1990b). Hidden Markov
Models for Speech Recognition, Edinburgh University Press.
Huang, X. D., H. W. Hon, M. Y. Huang and K. F. Lee (1993). "A
comparative study of discrete semi-continuous, and continuous hidden
Markov models." Computer Speech and Language 7: 359-368.
Huang, X. D. and M. A. Jack, Eds. (1990). Semi-continuous hidden
Markov models for speech signals. Readings in Speech Recognition,
Morgan Kaufmann.
Huang, X. D., K.-F. Lee, H. W. Hon and M. Y. Hwang (1991).
Improved acoustic modeling for the SPHINX speech recognition system.
Proc. IEEE ICASSP'91, Toronto, Canada.
Humphries, J. J. and P. C. Woodland (1997). Using accent-specific
pronounciation modelling for improved large vocabulary continuous speech
recognition. Proc. Eurospeech.
Hunt, M. J., S. M. Richardson, D. C. Bateman and A. Piau (1991).
An investigation of PLP and IMEL
Signal processing and acoustic modelling of speech signals for speech recognition systems
Natural man-machine interaction is currently one of the most unfulfilled pledges of automatic speech recognition (ASR). The purpose of an automatic speech recognition system is to accurately transcribe or execute what has been said. State-of-the-art speech recognition systems consist of four basic modules: the signal processing, the acoustic modelling, the language modelling, and the search engine. The subject of this thesis is the signal processing and acoustic modelling modules. We pursue the modelling of spoken signals in an optimum way. The resultant modules can be used successfully for the subsequent two modules.
Since the first order hidden Markov model (HMM) has been a tremendously successful mathematically established paradigm, which makes it the up-to-the-minute technique in current speech recognition systems, this dissertation bases all its studies and experiments on HMM. HMM is a statistical framework that supports both acoustic and temporal modelling. It is widely used despite making a number of suboptimal modelling assumptions, which put limits on its full potential. We investigate how the model design strategy and the algorithms can be adapted to HMMs. Large suites of experimental results are demonstrated to expound the relative effectiveness of each component within the HMM paradigm.
This dissertation presents several strategies for improving the overall performance of baseline speech recognition systems. The implementation of these strategies was optimised in a series of experiments. We also investigate selecting the optimal feature sets for speech recognition improvement. Moreover, the reliability of human speech recognition is attributed to the specific properties of the auditory presentation of speech. Thus, in this dissertation, we explore the use of perceptually inspired signal processing strategies, such as critical band frequency analysis. The resulting speech representation called Gammatone cepstral coefficients (GTCC) provides relative improvement over the baseline recogniser. We also investigate multiple signal representations for recognition in an ASR to improve the recognition rate.
Additionally, we developed fast techniques that are useful for evaluation and comparison procedures between different signal processing paradigms. The following list gives the main contributions of this dissertation:
• Speech/background discrimination.
• HMM initialisation techniques.
• Multiple signal representation with multi-stream paradigms.
• Gender based modelling.
• Feature vectors dimensionality reduction.
• Perceptually motivated feature sets.
• ASR training and recognition packages for research and development.
Many of these methods can be applied in practical applications. The proposed techniques can be used directly in more complicated speech recognition systems by introducing their resultants to the language and search engine modules.UnpublishedAbdulla, W. H. and N. K. Kasabov (1999a). Speech recognition
enhancement via robust CHMM speech background discrimination.
Proc. ICONIP/ANZIIS/ANNES'99 International Workshop, New Zealand.
Abdulla, W. H. and N. K. Kasabov (1999b). Two pass hidden Markov
model for speech recognition systems. Proc. ICICS'99, Singapore.
Abdulla, W. H. and N. K. Kasabov (1999c). The concepts of hidden
Markov model in speech recognition. IJCNN'99, N. K. Kasabov, W. H.
Abdulla (Ed.), Washington, DC, July 10-16, Chapter 4.
Abdulla, W. H. and N. K. Kasabov (2000). Feature selection for parallel
CHMM speech recognition systems. Proc. of the Fifth Joint Conference on
Information Sciences, vol.2, pp 874-878, Atlantic City, New Jersey, USA.
Abdulla, W. H. and N. K. Kasabov (2001). Improving speech
recognition performance through gender separation. Proc. Artificial Neural
Networks and Expert Systems International Conference (ANNES), pp 218-
222, Dunedin, New Zealand.
Abramson, N. (1963). Information Theory and Coding. New York,
McGraw-Hill.
Abrash, V., H. Franco, M. Cohen, N. Morgan, et al. (1992).
"Connectionist gender adaptation in hybrid neural network/hidden Markov
model speech recognition system." Proc. ICSLP'92.
Aertsen, A. and P. Johannesma (1980). "Spectra-temporal receptive
fields of auditory neurons in the grass frog. I. Characterization of tonal and
natural stimuli." Biol. Cybern. 38: 223 - 234.
Allen, J. B. (1995). Speech and hearing in communication. New York,
ASA edition, Acoustical Society of America.
Alsabti, K., S. Ranka and V. Singh (1999). An efficient K-means
clustering algorithm. IPPS/SPDP Workshop on High Performance Data
Mining, San Juan, Puerto Rico.
Arai, T. and S. Greenberg (1998). Speech intelligibility in the presence
of cross channel spectral asynchrony. Proc. IEEE ICASSP'98.
Atal, B. S. (1972). "Automatic speaker recognition based on pitch
contours." J. Acoust. Soc. Am. 52: 1687-1697.
Ata, B. S. and L. R Rabiner (1976). "A pattern recognition approach to
voiced-unvoiced-silence classification with application to speech
recognition." IEEE Trans. ASSP 24(4): 201-212.
Bahl, L. R., P. F. Brown, P. V. de Souza and R L. Mercer (1988a).
Speech recognition with continuous parameter hidden Markov models.
Proc. IEEE ICASSP'88, New York, NY.
Bahl, L. R., P. F. Brown, P. V. de Souza, R. L. Mercer, et al. (1988b).
Acoustic Markov models used in Tangora speech recognition system. Proc.
IEEE ICASSP'88, New York, USA.
Baker, J. K. (1975a). "The DRAGON system - an overview." IEEE
Trans. ASSP 23: 24-29.
Baker, J. K. (1975b). "The Dragon system - an overview." IEEE Trans.
ASSP 23(1): 24-29.
Baker, J. K., Ed. (1975c). Stochastic modeling for automatic speech
understanding. Speech Recognition: Invited paper presented at the 1974
IEEE symposium. New York, Academic Press.
Barnwell-III, T. P. (1980). A comparison of parametrically different
objective speech quality measures using correlation analysis with
subjective quality results. Proc. IEEE ICASSP'80, Denver.
Bateman, D. C., D. K. Bye and M. J. Hunt (1992). Spectral contrast
normalization and other techniques for speech recognition in noise. Proc.
IEEE ICASSP'92, San Francisco, USA.
Baum, L. E. (1972). "An inequality and associated miximization
technique in statistical estimation for probabilistic functions of Markov
processe." Proc. Symp. On Inequalities 3: 1-7.
Baum, L. E. and J. A. Egon (1967). "An inequality with applications to
statistical estimation for probabilistic functions of Markov process and to a
model for ecology." Bull. Amer. Meteorol. Soc. 73: 360-363.
Baum, L. E. and T. Petrie (1966). "Statistical inference for probabilistic
functions of finite state Markov chains." Ann. Math. Stat. 37: 1554-1563.
Baum, L. E., T. Petrie, G. Soules and N. Weiss (1970). "A maximization
technique occurring in the statistical analysis of probabilistic functions of
markov chains." Annals of Mathematical Statistics 41(1): 164-171.
Becchetti, C. and L. P. Ricotti (1999). Speech recognition theory and
C++ implementation, John Wiley & Sons.
Bellegarda, J. and D. Nahamoo (1989). Tied mixture continuous
parameter models for large vocabulary isolated speech recognition. Proc.
ICASSP'89, Glasgow, Scotland.
Bellegarda, J. and D. Naharnoo (1990). "Tied mixture continuous
parameter modeling for speech recognition." IEEE Trans. ASSP 38(12):
2033-2045.
Bengio, S. and Y. Bengio (2000a). "Taking on the curse of
dimentionality in joint distributions using neural networks." IEEE Trans.
Neural Networks 11(3): 550-557.
Bengio, Y. (1996). Neural Networks for Speech and Sequence
Processing, International Thomson Computer Press.
Bengio, Y. and S. Bengio (2000b). Modeling high-dimensional discrete
data with multi-layer neural networks. Advances in Neural Information
Processing Systems 12. S. A. Jolla, T. K. Leen and K.-R. Miler, MIT Press:
400-406.
Bin, J., T. Calhurst, A. EI-Jaroudi, R. lyer, et al. (1999). Recent
experiments in large vocabulary conversational speech recognition. Proc.
IEEE ICASSP'99, Phoenix.
Blimes, J. A. (1998). A gentle tutorial of the EM algorithm and its
application to parameter estimation for Gaussian mixture and hidden
Markov models. Berkeley, CA, International Computer Science Institute.
Bocchieri, E. and B. Mak (1997). Subspace distribution clustering for
continuous observation density hidden Markov models. Proc. Eurospeech.
Boll, S. F. (1979). "Suppression of acoustic noise in speech using
spectral subtraction." IEEE Trans. ASSP 27(2): 113-120.
Bou-Ghazale, S. E. and A. 0. Asadi (2000). Hands-free voice activation
of personal communication devices. Proc. IEEE ICASSP'2000, Istanbul,
Turkey.
Bourlard, H., S. Bengio and K. Weber (2001). New approaches towards
robust and adaptive speech recognition. Advances in Neural Information
Processing Systems 13. T. K. Leen, T. G. Dietterich and V. Tresp, MIT
Press: 751-757.
Bourlard, H., S. Dupont and C. Ris (1996). Multi-stream speech
recognition.
Bourlard, H. and N. Morgan (1993). Connectionist Speech Recognition.
A Hybrid Approach. Boston, Kluwer Academic Publishers.
Bourlard, H. and N. Morgan (1994). Connectionist Speech Recognition,
Kluwer Academic Publishers.
Bourlard, H., C. J. Wellekens and H. Ney (1984). Connected digit
recognition using vector quantization. Proc. IEEE ICASSP'84, San Diego,
USA.
Burton, D. K. and J. E. Shore (1985). "Speaker-dependent isolated
word recognition using speaker-independent vector quantization
coodbooks augmented with speaker-specific data." IEEE Trans. ASSP
33(2): 440-443.
Chan, A. K. and S. J. Liu (1998). Wavelet Toolware: Software for
Wavelet Training, Academic Press.
Chen, S. S. and R. A. Gopinath (2001). Gaussianization. Advances in
Neural Information Processing Systems 13. T. K. Leen, T. G. Dietterich
and V. Tresp, MIT Press: 821-827.
Chow, Y. L., M. 0. Dunham, 0. A. Kimball, M. A. Krasner, et al. (1987).
BYBLOS: The BBN continuous speech recognition system. Proc. IEEE
ICASSP'87, Dallas, USA.
Cooke, M. (1993). Modelling Auditory Processing and Organization.
U.K., Cambridge University Press.
Cosi, P., Ed. (1999). Auditory modeling and neural networks. Speech
Processing, Recognition and Artificial Neural Networks. London, Springer-
Verlag.
Cover, T. M. (1977). "On the possible ordering in the measurement
selection problem." IEEE Trans. on Systems, Man, and Cybernetics 7(9):
657-661.
Crochiere, R. E. and L. Rabiner (1983). Multirate Digital Signal
Processing. Englewood Cliffs, NJ, Prentice Hall.
Das, S., R. Bakis, A. Nadas, D. Nahamoo, et al. (1993). Influence of
background noise and microphone on the performance of the IBM
TANGORA speech recognition system. Proc. IEEE ICASSP'93.
Das, S., A. Nadas, D. Nahamoo and M. Picheny (1994). Adaptation
techniques for ambience and microphone compensation in the IBM
Tangora speech recognition system. Proc. IEEE ICASSP'94, Adelaide,
Australia.
Daubechies, I. (1990). "The wavelet transform, time-frequency
localization and signal analysis." IEEE Trans. IT 36(5): 961 -1005.
Daubechies, I., Ed. (1992a). Ten Lectures on Wavelets. CBMS-NSF
Regional Conference Series in Applied Mathematics. Philadelphia,
Pennsylvania, SIAM Press.
Daubechies, I. (1992b). Ten lectures on wavelets, SIAM.
Davis, B. and P. Mermelstein (1980). "Comparison of parametric
representations for monosylabic word recognirion in continuously spoken
sentences." IEEE Trans. ASSP 28(4): 357-366.
Davis, S. B., Ed. (1990). Comparison of parametric representations for
monosyllabic word recognition in continuous spoken sentences. Readings
in Speech Recognition.
de-Boer, E. and H. R. de Jongh (1978). "On cochlea encoding:
Potentialities and limitations of the reverse-correlation technique." J.
Acoust. Soc. Am. 63(1): 115- 135.
de-Boer, E. and P. Kuyper (1968). "Triggered correlation." IEEE Trans.
Biomed. Eng. BME-15: 169 - 179.
DeIler, J. R., J. G. Proakis and J. H. Hansen (1993). Discrete-Time
Processing of Speech Signals. New York, Macmillan Publishing.
Dempster, A. P., N. M. Laird and D. B. Rubin (1977). "Maximum
likelihood from incomplete data via the EM algorithm." J. Royal Statist. Soc.
Ser. B 39(1): 1-38.
Devijver, P. A. and J. Kittler (1982). Pattern Recognition: A Statistical
Approach. Englewood Cliffs, NJ, Prentice-Hall.
Donoho, D. L., Ed. (1993). Nonlinear wavelet methods for recovery of
signals, densities, and spectra from indirect and noisy data. Proceedings of
the Symposia in Applied Mathematics, American Mathematical Society.
Donoho, D. L. (1995). "Denoising by soft-thresholding." IEEE Trans. IT
41(3): 613-627.
Donoho, D. L. and I. M. Johnstone (1994). "Ideal spatial adaptation by
wavelet shrinkage." Biometrika 81: 425-455.
Draper, N. R. and H. Smith (1981). Applied Regression Analysis. New
York, Wiley.
Duda, R. 0. and P. E. Hart (1973). Pattern Classification and Scene
Analysis. New York, Wiley.
Duncan Luce, R. (1993). Sound & Hearing A Conceptual Introduction,
Lawrence Erlbaum Associates.
Elenius, K. and M. Blomberg (1982). Effects of emphasizing transitional
or stationary parts of the speech signal in a discrete utterance recognition
system. Proc. ICASSP'86.
Ellermann, C., S. V. Even, C. Huang and L. Manganaro (1993).
"Dragon systems' experiences in small to large vocabulary multi-lingual
speech recognition applications." Proc. Eurospeech 3: 2077-2080.
Ellis, D. (2001). TANDEM acoustic modeling in large-vocabulary
recognition. Proc. ICASSP'2001, Salt Lake City.
Ellis, D. and J. A. Bilmes (2000). Using mutual information to design
feature combinations. Proc. ICSLP-2000, Beijing.
Ellis, D. P. (2000a). Improved recognition by combining different
features and different systems. Proc. AVIOS-2000, San Jose.
Ellis, D. P. (2000b). Using mutual information to design feature
combinations. Proc. ICSLP-2000, Beijing.
Ellis, D. P., R. Singh and S. Sivadas (2001). Tandem acoustic
modelling in large-vocabulary recognition. Proc. IEEE ICASSP'2001, Salt
Lake City.
Elman, J. L. (1990). "Finding structure in time." Cognitive Science
14(2): 179-211.
Ferguson, J. D. (1980). Hidden Markov Analysis: An Introduction.
Princeton, NJ, Institute of Defence Analyses.
Fermin, C. D. (2000). Very Detailed Tutorial on the ear,
http://www1.omi.tulane.edu/departments/pathology/fermin/Hearing.html.
6000.
Forney, G. D. (1973). "The Viterbi algorithm." Proc. IEEE 61: 268-278.
Furui, S. (1986a). Speaker independent isolated word recognition
based on emphasised spectral dynamics. Proc. ICASSP'86, Tokyo - Japan.
Furui, S. (1986b). Speaker independent isolated word recognition
based on emphasized spectral dynamics. Proc. IEEE ICASSP'86, Tokyo-
Japan.
Furui, S. (1986c), "Speaker independent isolated word recognition
using dynamic features of speech recognition." IEEE Trans. ASSP 34(2):
52-59.
Furui, S. (1986d). "Speaker-independent isolated word recognition
using dynamic features of speech spectrum." IEEE Trans. ASSP 34: 52-59.
Furui, S. (1988). "A VO-based preprocesor using cepstral dynamic
features for speaker-independent large vocabulary word recognition." IEEE
Trans. ASSN 36(7): 980-987.
Garofolo, J. S., L. F. Lame!, W. M. Fisher, J. G. Fiscus, et al. (1990).
DARPA TIM IT Acoustic-Phonetic Continuous Speech Corpus CD-ROM.
NISTIR 4930.
Gauvain, J.-L. and C.-H. Lee (1994). "Maximum a posteriori estimation
for multivariate gaussian mixture observations of Markov chains." IEEE
Trans. SAP 2(2): 291-298.
Ghitza, 0., Ed. (1992). Auditory nerve representation as a basis for
speech processing. Advances in Speech Signal Processing. New York,
Marcel Dekker.
Glasberg, B. R. and B. C. Moore (1990). "Derivation of auditory filter
shapes from notched-noise data." Hearing Research 47: 103 - 108.
Glinski, S. C. (1985). "On the use of vector quantization for connecteddigit
recognition." The AT & T Tech, J. 64(5): 1033-1045.
Gong, Y. (1995). "Speech recognition in noisy environments: A survey."
Speech Communication 16: 261-291.
Graps, A. (1995). "An introduction to wavelets." IEEE Computational
Science and Engineering 2(2).
Gravier, G., M. Sigelle and G. Chollet (1998). "Marrkov random field
modelling for speech recognition." Australian J. of Intelligent Information
Processing Systems 5(4): 245-251.
Gray, J. and J. D. Markel (1976). "Distance measures for speech
processing." IEEE Trans. ASSP 24(5): 380-391.
Gray Jr, A. H. and J. D. Markel (1974). "A spectral flatness measure for
studying the autocorrelation method of linear prediction of speech
analysis." IEEE Trans. ASSP ASSP-22: 607 - 217.
Gray, R. M. (1984). "Vector quantization." IEEE ASSP Magazine 1(2): 4
- 29.
Gray, R. M., A. Buzo, J. Gray and Matsuyama (1980). "Distortion
measures for speech processing." IEEE Trans. ASSP 68(4): 367-376.
Greenberg, S., T. Arai and R. Silipo (1998). Speech derivation from
exceedingly sparse spectral information. Proc. IEEE ICSLP'98.
Greenwood, D. (1961). "Critical bandwidth and the frequency
coordinates of the basilar membrane." J. Acoust. Soc. Am. 33: 1344 - 1356.
Greenwood, D. (1990). "A cochlear frequency-position function for
several species--29 years later." J. Accost, Soc. Am, 87(6): 2592 - 2605.
Gupta, V. N., M. Lennig and P. Mermelstein (1987a). Integration of
acoustic information in a large vocabulary word recognizer. Proc.
ICASSP'87, Dallas, USA.
Gupta, V. N., M. Lenning and P. Mermelstein (1987b). Integration of
acoustic information in a large vocabulary word recognizer. Proc. IEEE
ICASSP'87.
Haeb-Umbach, R. (1999a). Investigations on interspeaker variability
in the feature space. Proc. ICASSP'99, Phoenix, Arizona.
Haeb-Umbach, R. (1999b). Investigations on inter-speaker variability
in the feature space. Proc. IEEE ICASSP'99, Arizona-USA.
Hansen, J. H. and B. L. Pellom (1998). An effective quality
evaluation protocol for speech enhancement algorithms. Proc. ICSLP'98,
Sydney, Australia.
Hanson, B. A. and T. H. Applebaum (1990a). Features for noiserobust
speaker-independent word recognition. Proc. Int. Conf. Spoken
Language Processing (ICSLP), Kobe-Japan.
Hanson, B. A. and T. H. Applebaum (1990b). Robust speaker
independent word recognition using static, dynamic, and acceleration
features: experiments with lombard and noisy speech. Proc. IEEE
ICASSP'90, Albuquerque, NM.
Hanson, B. A. and T. H. Applebaum (1990c). Robust speaker
independent word recognition using static, dynamic, and acceleration
features: experiments with lombard and noisy speech. Proc. ICASSP'90,
Albuquerque, NM.
Hanson, B. A., T. H. Applebaum and J.-C. Junqua (1996a). Spectral
dynamics for speech recognition under adverse conditions. Automatic
Speech and Speaker Recognition Advanced Topics. C.-H. Lee, F. K.
Soong and K. K. Paliwal.
Hanson, B. A., T. H. Applebaum and J.-C. Junqua, Eds. (1996b).
Spectral dynamics for speech recognition under adverse conditions.
Automatic Speech and Speaker Recognition, Kluwer Academic Publishers.
Harborg, E. (1990). Hidden Markov Models Applied to Automatic
Speech Recognition,. PhD Thesis, Norwegian Institute of Technology
(Trondheim).
Harrington, J. and S. Cassidy (1999). Techniques in Speech
Acoustics, Kluwer Academic Publishers.
Hartmann, W. M. (1998). Signals, Sound, and Sensation, Springer-
Verlag.
Herrnansky, H. (1990a). "Perceptual linear predictive (PLP) analysis
for speech." J. Acoust. Soc. Am. 87: 1738-1752.
Hermansky, H. (1990b). "Perceptual linear predictive (PLP) analysis
of speech." J. Acoust. Soc. Am. 87(4): 1738-1752.
Hermansky, H. (1997). Should recognizers have ears? Proc. ESCA
Tutorial and Research Workshop on Robust Speech Recognition for
Unknown Communication Channels, France.
Hermansky, H. (1999). Analysis in automatic recognition of speech.
Speech Processing, Recognition and Artificial Neural Networks. G. Chollet,
D. Di Benedetto, A. Esposito and M. Marinaro, Springer-Verlag: 115-137.
Hermansky, H., D. Ellis and S. Sharma (2000). TANDEM
connectionist feature extraction for conventional HMM systems. Proc.
ICASSP2000, Istanbul.
Hermansky, H. and N. Malayath (1998). Spectral basis functions
from discriminant analysis. ICSLP'98, Sydney, Australia.
Hermansky, H. and S. Sharma (1999). Temporal patterns (TRAPS)
in ASR of noisy speech. Proc. ICASSP'99, Phoenix, AZ.
Hertz, J., A. Krogh and R G. Palmer (1991). Introduction To The
Theory Of Neural Computing, Addison-Wesley Publishing Company.
Hess, W. (1983). Pitch determination of speech signals. New York,
Springer-Verlag.
Hochberg, M., S. Rentals, A. J. Robinson and G. D. Cook (1995).
Recent improvements to the ABBOT large vocabulary CSR system. Proc.
IEEE ICASSP'95.
Huang, L.-S. and C.-h. Yang (2000). A novel approach to robust
speech endpoint detection in car environment. Proc. IEEE ICASSP'2000,
Istanbul, Turkey.
Huang, X., A. Acero, F. Allova, M. Y. Hwang, et al. (1995). "Microsoft
windows highly intelligent speech recognizer: Whisper." Proc. IEEE
ICASSP'95 1: 93-97.
Huang, X. D. (1992). Minimizing speaker variation effects for
speaker-independent speech recognition. Proceedings of Speech and
Natural Language Workshop.
Huang, X. D., M. A. Ariki and M. A. Jack (1990a). Hidden Markov
Models for Speech Recognition. Edinburgh, Edinburgh University Press.
Huang, X. D., Y. Ariki and M. A. Jack (1990b). Hidden Markov
Models for Speech Recognition, Edinburgh University Press.
Huang, X. D., H. W. Hon, M. Y. Huang and K. F. Lee (1993). "A
comparative study of discrete semi-continuous, and continuous hidden
Markov models." Computer Speech and Language 7: 359-368.
Huang, X. D. and M. A. Jack, Eds. (1990). Semi-continuous hidden
Markov models for speech signals. Readings in Speech Recognition,
Morgan Kaufmann.
Huang, X. D., K.-F. Lee, H. W. Hon and M. Y. Hwang (1991).
Improved acoustic modeling for the SPHINX speech recognition system.
Proc. IEEE ICASSP'91, Toronto, Canada.
Humphries, J. J. and P. C. Woodland (1997). Using accent-specific
pronounciation modelling for improved large vocabulary continuous speech
recognition. Proc. Eurospeech.
Hunt, M. J., S. M. Richardson, D. C. Bateman and A. Piau (1991).
An investigation of PLP and IMEL