10 research outputs found

    Predictive Interfaces for Long-Distance Tele-Operations

    Get PDF
    We address the development of predictive tele-operator interfaces for humanoid robots with respect to two basic challenges. Firstly, we address automating the transition from fully tele-operated systems towards degrees of autonomy. Secondly, we develop compensation for the time-delay that exists when sending telemetry data from a remote operation point to robots located at low earth orbit and beyond. Humanoid robots have a great advantage over other robotic platforms for use in space-based construction and maintenance because they can use the same tools as astronauts do. The major disadvantage is that they are difficult to control due to the large number of degrees of freedom, which makes it difficult to synthesize autonomous behaviors using conventional means. We are working with the NASA Johnson Space Center's Robonaut which is an anthropomorphic robot with fully articulated hands, arms, and neck. We have trained hidden Markov models that make use of the command data, sensory streams, and other relevant data sources to predict a tele-operator's intent. This allows us to achieve subgoal level commanding without the use of predefined command dictionaries, and to create sub-goal autonomy via sequence generation from generative models. Our method works as a means to incrementally transition from manual tele-operation to semi-autonomous, supervised operation. The multi-agent laboratory experiments conducted by Ambrose et. al. have shown that it is feasible to directly tele-operate multiple Robonauts with humans to perform complex tasks such as truss assembly. However, once a time-delay is introduced into the system, the rate of tele\ioperation slows down to mimic a bump and wait type of activity. We would like to maintain the same interface to the operator despite time-delays. To this end, we are developing an interface which will allow for us to predict the intentions of the operator while interacting with a 3D virtual representation of the expected state of the robot. The predictive interface anticipates the intention of the operator, and then uses this prediction to initiate appropriate sub-goal autonomy tasks

    Bayesian adaptive learning of the parameters of hidden Markov model for speech recognition

    Get PDF
    A theoretical framework for Bayesian adaptive training of the parameters of a discrete hidden Markov model (DHMM) and of a semi-continuous HMM (SCHMM) with Gaussian mixture state observation densities is presented. In addition to formulating the forward-backward MAP (maximum a posteriori) and the segmental MAP algorithms for estimating the above HMM parameters, a computationally efficient segmental quasi-Bayes algorithm for estimating the state-specific mixture coefficients in SCHMM is developed. For estimating the parameters of the prior densities, a new empirical Bayes method based on the moment estimates is also proposed. The MAP algorithms and the prior parameter specification are directly applicable to training speaker adaptive HMMs. Practical issues related to the use of the proposed techniques for HMM-based speaker adaptation are studied. The proposed MAP algorithms are shown to be effective especially in the cases in which the training or adaptation data are limited.published_or_final_versio

    Grapheme-based Automatic Speech Recognition using Probabilistic Lexical Modeling

    Get PDF
    Automatic speech recognition (ASR) systems incorporate expert knowledge of language or the linguistic expertise through the use of phone pronunciation lexicon (or dictionary) where each word is associated with a sequence of phones. The creation of phone pronunciation lexicon for a new language or domain is costly as it requires linguistic expertise, and includes time and money. In this thesis, we focus on effective building of ASR systems in the absence of linguistic expertise for a new domain or language. Particularly, we consider graphemes as alternate subword units for speech recognition. In a grapheme lexicon, pronunciation of a word is derived from its orthography. However, modeling graphemes for speech recognition is a challenging task for two reasons. Firstly, grapheme-to-phoneme (G2P) relationship can be ambiguous as languages continue to evolve after their spelling has been standardized. Secondly, as elucidated in this thesis, typically ASR systems directly model the relationship between graphemes and acoustic features; and the acoustic features depict the envelope of speech, which is related to phones. In this thesis, a grapheme-based ASR approach is proposed where the modeling of the relationship between graphemes and acoustic features is factored through a latent variable into two models, namely, acoustic model and lexical model. In the acoustic model the relationship between latent variables and acoustic features is modeled, while in the lexical model a probabilistic relationship between latent variables and graphemes is modeled. We refer to the proposed approach as probabilistic lexical modeling based ASR. In the thesis we show that the latent variables can be phones or multilingual phones or clustered context-dependent subword units; and an acoustic model can be trained on domain-independent or language-independent resources. The lexical model is trained on transcribed speech data from the target domain or language. In doing so, the parameters of the lexical model capture a probabilistic relationship between graphemes and phones. In the proposed grapheme-based ASR approach, lexicon learning is implicitly integrated as a phase in ASR system training as opposed to the conventional approach where first phone pronunciation lexicon is developed and then a phone-based ASR system is trained. The potential and the efficacy of the proposed approach is demonstrated through experiments and comparisons with other standard approaches on ASR for resource rich languages, nonnative and accented speech, under-resourced languages, and minority languages. The studies revealed that the proposed framework is particularly suitable when the task is challenged by the lack of both linguistic expertise and transcribed data. Furthermore, our investigations also showed that standard ASR approaches in which the lexical model is deterministic are more suitable for phones than graphemes, while probabilistic lexical model based ASR approach is suitable for both. Finally, we show that the captured grapheme-to-phoneme relationship can be exploited to perform acoustic data-driven G2P conversion

    Aportación a la extracción paramétrica en reconocimiento de voz robusto basada en la aplicación de conocimiento de fonética acústica

    Full text link
    This thesis is based on the following hypothesis: the introduction of direct knowledge from the acoustic-phonetic field to the speech recognition problem, especially in the feature extraction step, may constitute a solid base of analysis for the determination of the behavior and capabilities of those systems and their improvement, as well. Most of the complexity of this Ph.D. thesis comes from the different subjects related with the speech processing área. The application of acoustic-phonetic information to the speech recognition research área implies a deep knowledge of both subjects. The research carried out in this work has been divided in two main parts: analysis of the current feature extraction methods and a study of several possible procedures about the incorporation of phonetic-acoustic knowledge to those systems. Abundant recognition and related quality measure results are presented for 50 different parameter extraction models. Details about the real-time implementation on a DSP platform (TMS3230C31-60) of two different parameter extraction models are presented. Finally, a set of computer tools developed for building and testing new speech recognition systems has been produced. Besides, the application of several results from this work can be extended to other speech processing áreas, such as computer assisted language learning, linguistic rehabilitation, etc.---ABSTRACT---La hipótesis en la que se basa el desarrollo de esta tesis, se centra en la suposición de que la aportación de conocimiento directo, proveniente del campo de la fonética acústica, al problema del reconocimiento automático de la voz, en concreto a la etapa de extracción de características, puede constituir una base sólida con la que poder analizar el comportamiento y capacidad de discriminación de dichos sistemas, así como una forma de mejorar sus prestaciones. Parte de la complejidad que presenta esta tesis doctoral, viene motivada por las diferentes disciplinas que están relacionadas con el área de procesamiento de la voz. La aplicación de información fonética-acústica al campo de investigación del reconocimiento del habla requiere un amplio conocimiento de ambas materias. Las investigaciones desarrolladas en este trabajo se han dividido en dos bloques fundamentales: análisis de los métodos actuales de extracción de rasgos fonéticos y un estudio de algunas posibles formas de incorporación de conocimiento fonético-acústico a dichos sistemas. En esta tesis se ofrecen abundantes resultados relativos a tasas de reconocimiento y medidas acerca de la calidad de este proceso, para un total de 50 modelos de extracción de parámetros. Así mismo se incluyen los detalles de la implementación en tiempo real para una plataforma DSP, en concreto TMS320C31-60, de dos diferentes modelos de extracción de rasgos. Además, se ha desarrollado un conjunto de las herramientas informáticas que pueden servir de base para construir y validar de forma sencilla, nuevos sistemas de reconocimiento. La aplicación de algunos de los resultados del trabajo puede extenderse también a otras áreas del tratamiento de la voz, tales como la enseñanza de una segunda lengua, logopedia, etc

    Signal processing and acoustic modelling of speech signals for speech recognition systems

    No full text
    Natural man-machine interaction is currently one of the most unfulfilled pledges of automatic speech recognition (ASR). The purpose of an automatic speech recognition system is to accurately transcribe or execute what has been said. State-of-the-art speech recognition systems consist of four basic modules: the signal processing, the acoustic modelling, the language modelling, and the search engine. The subject of this thesis is the signal processing and acoustic modelling modules. We pursue the modelling of spoken signals in an optimum way. The resultant modules can be used successfully for the subsequent two modules. Since the first order hidden Markov model (HMM) has been a tremendously successful mathematically established paradigm, which makes it the up-to-the-minute technique in current speech recognition systems, this dissertation bases all its studies and experiments on HMM. HMM is a statistical framework that supports both acoustic and temporal modelling. It is widely used despite making a number of suboptimal modelling assumptions, which put limits on its full potential. We investigate how the model design strategy and the algorithms can be adapted to HMMs. Large suites of experimental results are demonstrated to expound the relative effectiveness of each component within the HMM paradigm. This dissertation presents several strategies for improving the overall performance of baseline speech recognition systems. The implementation of these strategies was optimised in a series of experiments. We also investigate selecting the optimal feature sets for speech recognition improvement. Moreover, the reliability of human speech recognition is attributed to the specific properties of the auditory presentation of speech. Thus, in this dissertation, we explore the use of perceptually inspired signal processing strategies, such as critical band frequency analysis. The resulting speech representation called Gammatone cepstral coefficients (GTCC) provides relative improvement over the baseline recogniser. We also investigate multiple signal representations for recognition in an ASR to improve the recognition rate. Additionally, we developed fast techniques that are useful for evaluation and comparison procedures between different signal processing paradigms. The following list gives the main contributions of this dissertation: • Speech/background discrimination. • HMM initialisation techniques. • Multiple signal representation with multi-stream paradigms. • Gender based modelling. • Feature vectors dimensionality reduction. • Perceptually motivated feature sets. • ASR training and recognition packages for research and development. Many of these methods can be applied in practical applications. The proposed techniques can be used directly in more complicated speech recognition systems by introducing their resultants to the language and search engine modules.UnpublishedAbdulla, W. H. and N. K. Kasabov (1999a). Speech recognition enhancement via robust CHMM speech background discrimination. Proc. ICONIP/ANZIIS/ANNES'99 International Workshop, New Zealand. Abdulla, W. H. and N. K. Kasabov (1999b). Two pass hidden Markov model for speech recognition systems. Proc. ICICS'99, Singapore. Abdulla, W. H. and N. K. Kasabov (1999c). The concepts of hidden Markov model in speech recognition. IJCNN'99, N. K. Kasabov, W. H. Abdulla (Ed.), Washington, DC, July 10-16, Chapter 4. Abdulla, W. H. and N. K. Kasabov (2000). Feature selection for parallel CHMM speech recognition systems. Proc. of the Fifth Joint Conference on Information Sciences, vol.2, pp 874-878, Atlantic City, New Jersey, USA. Abdulla, W. H. and N. K. Kasabov (2001). Improving speech recognition performance through gender separation. Proc. Artificial Neural Networks and Expert Systems International Conference (ANNES), pp 218- 222, Dunedin, New Zealand. Abramson, N. (1963). Information Theory and Coding. New York, McGraw-Hill. Abrash, V., H. Franco, M. Cohen, N. Morgan, et al. (1992). "Connectionist gender adaptation in hybrid neural network/hidden Markov model speech recognition system." Proc. ICSLP'92. Aertsen, A. and P. Johannesma (1980). "Spectra-temporal receptive fields of auditory neurons in the grass frog. I. Characterization of tonal and natural stimuli." Biol. Cybern. 38: 223 - 234. Allen, J. B. (1995). Speech and hearing in communication. New York, ASA edition, Acoustical Society of America. Alsabti, K., S. Ranka and V. Singh (1999). An efficient K-means clustering algorithm. IPPS/SPDP Workshop on High Performance Data Mining, San Juan, Puerto Rico. Arai, T. and S. Greenberg (1998). Speech intelligibility in the presence of cross channel spectral asynchrony. Proc. IEEE ICASSP'98. Atal, B. S. (1972). "Automatic speaker recognition based on pitch contours." J. Acoust. Soc. Am. 52: 1687-1697. Ata, B. S. and L. R Rabiner (1976). "A pattern recognition approach to voiced-unvoiced-silence classification with application to speech recognition." IEEE Trans. ASSP 24(4): 201-212. Bahl, L. R., P. F. Brown, P. V. de Souza and R L. Mercer (1988a). Speech recognition with continuous parameter hidden Markov models. Proc. IEEE ICASSP'88, New York, NY. Bahl, L. R., P. F. Brown, P. V. de Souza, R. L. Mercer, et al. (1988b). Acoustic Markov models used in Tangora speech recognition system. Proc. IEEE ICASSP'88, New York, USA. Baker, J. K. (1975a). "The DRAGON system - an overview." IEEE Trans. ASSP 23: 24-29. Baker, J. K. (1975b). "The Dragon system - an overview." IEEE Trans. ASSP 23(1): 24-29. Baker, J. K., Ed. (1975c). Stochastic modeling for automatic speech understanding. Speech Recognition: Invited paper presented at the 1974 IEEE symposium. New York, Academic Press. Barnwell-III, T. P. (1980). A comparison of parametrically different objective speech quality measures using correlation analysis with subjective quality results. Proc. IEEE ICASSP'80, Denver. Bateman, D. C., D. K. Bye and M. J. Hunt (1992). Spectral contrast normalization and other techniques for speech recognition in noise. Proc. IEEE ICASSP'92, San Francisco, USA. Baum, L. E. (1972). "An inequality and associated miximization technique in statistical estimation for probabilistic functions of Markov processe." Proc. Symp. On Inequalities 3: 1-7. Baum, L. E. and J. A. Egon (1967). "An inequality with applications to statistical estimation for probabilistic functions of Markov process and to a model for ecology." Bull. Amer. Meteorol. Soc. 73: 360-363. Baum, L. E. and T. Petrie (1966). "Statistical inference for probabilistic functions of finite state Markov chains." Ann. Math. Stat. 37: 1554-1563. Baum, L. E., T. Petrie, G. Soules and N. Weiss (1970). "A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains." Annals of Mathematical Statistics 41(1): 164-171. Becchetti, C. and L. P. Ricotti (1999). Speech recognition theory and C++ implementation, John Wiley & Sons. Bellegarda, J. and D. Nahamoo (1989). Tied mixture continuous parameter models for large vocabulary isolated speech recognition. Proc. ICASSP'89, Glasgow, Scotland. Bellegarda, J. and D. Naharnoo (1990). "Tied mixture continuous parameter modeling for speech recognition." IEEE Trans. ASSP 38(12): 2033-2045. Bengio, S. and Y. Bengio (2000a). "Taking on the curse of dimentionality in joint distributions using neural networks." IEEE Trans. Neural Networks 11(3): 550-557. Bengio, Y. (1996). Neural Networks for Speech and Sequence Processing, International Thomson Computer Press. Bengio, Y. and S. Bengio (2000b). Modeling high-dimensional discrete data with multi-layer neural networks. Advances in Neural Information Processing Systems 12. S. A. Jolla, T. K. Leen and K.-R. Miler, MIT Press: 400-406. Bin, J., T. Calhurst, A. EI-Jaroudi, R. lyer, et al. (1999). Recent experiments in large vocabulary conversational speech recognition. Proc. IEEE ICASSP'99, Phoenix. Blimes, J. A. (1998). A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Berkeley, CA, International Computer Science Institute. Bocchieri, E. and B. Mak (1997). Subspace distribution clustering for continuous observation density hidden Markov models. Proc. Eurospeech. Boll, S. F. (1979). "Suppression of acoustic noise in speech using spectral subtraction." IEEE Trans. ASSP 27(2): 113-120. Bou-Ghazale, S. E. and A. 0. Asadi (2000). Hands-free voice activation of personal communication devices. Proc. IEEE ICASSP'2000, Istanbul, Turkey. Bourlard, H., S. Bengio and K. Weber (2001). New approaches towards robust and adaptive speech recognition. Advances in Neural Information Processing Systems 13. T. K. Leen, T. G. Dietterich and V. Tresp, MIT Press: 751-757. Bourlard, H., S. Dupont and C. Ris (1996). Multi-stream speech recognition. Bourlard, H. and N. Morgan (1993). Connectionist Speech Recognition. A Hybrid Approach. Boston, Kluwer Academic Publishers. Bourlard, H. and N. Morgan (1994). Connectionist Speech Recognition, Kluwer Academic Publishers. Bourlard, H., C. J. Wellekens and H. Ney (1984). Connected digit recognition using vector quantization. Proc. IEEE ICASSP'84, San Diego, USA. Burton, D. K. and J. E. Shore (1985). "Speaker-dependent isolated word recognition using speaker-independent vector quantization coodbooks augmented with speaker-specific data." IEEE Trans. ASSP 33(2): 440-443. Chan, A. K. and S. J. Liu (1998). Wavelet Toolware: Software for Wavelet Training, Academic Press. Chen, S. S. and R. A. Gopinath (2001). Gaussianization. Advances in Neural Information Processing Systems 13. T. K. Leen, T. G. Dietterich and V. Tresp, MIT Press: 821-827. Chow, Y. L., M. 0. Dunham, 0. A. Kimball, M. A. Krasner, et al. (1987). BYBLOS: The BBN continuous speech recognition system. Proc. IEEE ICASSP'87, Dallas, USA. Cooke, M. (1993). Modelling Auditory Processing and Organization. U.K., Cambridge University Press. Cosi, P., Ed. (1999). Auditory modeling and neural networks. Speech Processing, Recognition and Artificial Neural Networks. London, Springer- Verlag. Cover, T. M. (1977). "On the possible ordering in the measurement selection problem." IEEE Trans. on Systems, Man, and Cybernetics 7(9): 657-661. Crochiere, R. E. and L. Rabiner (1983). Multirate Digital Signal Processing. Englewood Cliffs, NJ, Prentice Hall. Das, S., R. Bakis, A. Nadas, D. Nahamoo, et al. (1993). Influence of background noise and microphone on the performance of the IBM TANGORA speech recognition system. Proc. IEEE ICASSP'93. Das, S., A. Nadas, D. Nahamoo and M. Picheny (1994). Adaptation techniques for ambience and microphone compensation in the IBM Tangora speech recognition system. Proc. IEEE ICASSP'94, Adelaide, Australia. Daubechies, I. (1990). "The wavelet transform, time-frequency localization and signal analysis." IEEE Trans. IT 36(5): 961 -1005. Daubechies, I., Ed. (1992a). Ten Lectures on Wavelets. CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia, Pennsylvania, SIAM Press. Daubechies, I. (1992b). Ten lectures on wavelets, SIAM. Davis, B. and P. Mermelstein (1980). "Comparison of parametric representations for monosylabic word recognirion in continuously spoken sentences." IEEE Trans. ASSP 28(4): 357-366. Davis, S. B., Ed. (1990). Comparison of parametric representations for monosyllabic word recognition in continuous spoken sentences. Readings in Speech Recognition. de-Boer, E. and H. R. de Jongh (1978). "On cochlea encoding: Potentialities and limitations of the reverse-correlation technique." J. Acoust. Soc. Am. 63(1): 115- 135. de-Boer, E. and P. Kuyper (1968). "Triggered correlation." IEEE Trans. Biomed. Eng. BME-15: 169 - 179. DeIler, J. R., J. G. Proakis and J. H. Hansen (1993). Discrete-Time Processing of Speech Signals. New York, Macmillan Publishing. Dempster, A. P., N. M. Laird and D. B. Rubin (1977). "Maximum likelihood from incomplete data via the EM algorithm." J. Royal Statist. Soc. Ser. B 39(1): 1-38. Devijver, P. A. and J. Kittler (1982). Pattern Recognition: A Statistical Approach. Englewood Cliffs, NJ, Prentice-Hall. Donoho, D. L., Ed. (1993). Nonlinear wavelet methods for recovery of signals, densities, and spectra from indirect and noisy data. Proceedings of the Symposia in Applied Mathematics, American Mathematical Society. Donoho, D. L. (1995). "Denoising by soft-thresholding." IEEE Trans. IT 41(3): 613-627. Donoho, D. L. and I. M. Johnstone (1994). "Ideal spatial adaptation by wavelet shrinkage." Biometrika 81: 425-455. Draper, N. R. and H. Smith (1981). Applied Regression Analysis. New York, Wiley. Duda, R. 0. and P. E. Hart (1973). Pattern Classification and Scene Analysis. New York, Wiley. Duncan Luce, R. (1993). Sound & Hearing A Conceptual Introduction, Lawrence Erlbaum Associates. Elenius, K. and M. Blomberg (1982). Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system. Proc. ICASSP'86. Ellermann, C., S. V. Even, C. Huang and L. Manganaro (1993). "Dragon systems' experiences in small to large vocabulary multi-lingual speech recognition applications." Proc. Eurospeech 3: 2077-2080. Ellis, D. (2001). TANDEM acoustic modeling in large-vocabulary recognition. Proc. ICASSP'2001, Salt Lake City. Ellis, D. and J. A. Bilmes (2000). Using mutual information to design feature combinations. Proc. ICSLP-2000, Beijing. Ellis, D. P. (2000a). Improved recognition by combining different features and different systems. Proc. AVIOS-2000, San Jose. Ellis, D. P. (2000b). Using mutual information to design feature combinations. Proc. ICSLP-2000, Beijing. Ellis, D. P., R. Singh and S. Sivadas (2001). Tandem acoustic modelling in large-vocabulary recognition. Proc. IEEE ICASSP'2001, Salt Lake City. Elman, J. L. (1990). "Finding structure in time." Cognitive Science 14(2): 179-211. Ferguson, J. D. (1980). Hidden Markov Analysis: An Introduction. Princeton, NJ, Institute of Defence Analyses. Fermin, C. D. (2000). Very Detailed Tutorial on the ear, http://www1.omi.tulane.edu/departments/pathology/fermin/Hearing.html. 6000. Forney, G. D. (1973). "The Viterbi algorithm." Proc. IEEE 61: 268-278. Furui, S. (1986a). Speaker independent isolated word recognition based on emphasised spectral dynamics. Proc. ICASSP'86, Tokyo - Japan. Furui, S. (1986b). Speaker independent isolated word recognition based on emphasized spectral dynamics. Proc. IEEE ICASSP'86, Tokyo- Japan. Furui, S. (1986c), "Speaker independent isolated word recognition using dynamic features of speech recognition." IEEE Trans. ASSP 34(2): 52-59. Furui, S. (1986d). "Speaker-independent isolated word recognition using dynamic features of speech spectrum." IEEE Trans. ASSP 34: 52-59. Furui, S. (1988). "A VO-based preprocesor using cepstral dynamic features for speaker-independent large vocabulary word recognition." IEEE Trans. ASSN 36(7): 980-987. Garofolo, J. S., L. F. Lame!, W. M. Fisher, J. G. Fiscus, et al. (1990). DARPA TIM IT Acoustic-Phonetic Continuous Speech Corpus CD-ROM. NISTIR 4930. Gauvain, J.-L. and C.-H. Lee (1994). "Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains." IEEE Trans. SAP 2(2): 291-298. Ghitza, 0., Ed. (1992). Auditory nerve representation as a basis for speech processing. Advances in Speech Signal Processing. New York, Marcel Dekker. Glasberg, B. R. and B. C. Moore (1990). "Derivation of auditory filter shapes from notched-noise data." Hearing Research 47: 103 - 108. Glinski, S. C. (1985). "On the use of vector quantization for connecteddigit recognition." The AT & T Tech, J. 64(5): 1033-1045. Gong, Y. (1995). "Speech recognition in noisy environments: A survey." Speech Communication 16: 261-291. Graps, A. (1995). "An introduction to wavelets." IEEE Computational Science and Engineering 2(2). Gravier, G., M. Sigelle and G. Chollet (1998). "Marrkov random field modelling for speech recognition." Australian J. of Intelligent Information Processing Systems 5(4): 245-251. Gray, J. and J. D. Markel (1976). "Distance measures for speech processing." IEEE Trans. ASSP 24(5): 380-391. Gray Jr, A. H. and J. D. Markel (1974). "A spectral flatness measure for studying the autocorrelation method of linear prediction of speech analysis." IEEE Trans. ASSP ASSP-22: 607 - 217. Gray, R. M. (1984). "Vector quantization." IEEE ASSP Magazine 1(2): 4 - 29. Gray, R. M., A. Buzo, J. Gray and Matsuyama (1980). "Distortion measures for speech processing." IEEE Trans. ASSP 68(4): 367-376. Greenberg, S., T. Arai and R. Silipo (1998). Speech derivation from exceedingly sparse spectral information. Proc. IEEE ICSLP'98. Greenwood, D. (1961). "Critical bandwidth and the frequency coordinates of the basilar membrane." J. Acoust. Soc. Am. 33: 1344 - 1356. Greenwood, D. (1990). "A cochlear frequency-position function for several species--29 years later." J. Accost, Soc. Am, 87(6): 2592 - 2605. Gupta, V. N., M. Lennig and P. Mermelstein (1987a). Integration of acoustic information in a large vocabulary word recognizer. Proc. ICASSP'87, Dallas, USA. Gupta, V. N., M. Lenning and P. Mermelstein (1987b). Integration of acoustic information in a large vocabulary word recognizer. Proc. IEEE ICASSP'87. Haeb-Umbach, R. (1999a). Investigations on interspeaker variability in the feature space. Proc. ICASSP'99, Phoenix, Arizona. Haeb-Umbach, R. (1999b). Investigations on inter-speaker variability in the feature space. Proc. IEEE ICASSP'99, Arizona-USA. Hansen, J. H. and B. L. Pellom (1998). An effective quality evaluation protocol for speech enhancement algorithms. Proc. ICSLP'98, Sydney, Australia. Hanson, B. A. and T. H. Applebaum (1990a). Features for noiserobust speaker-independent word recognition. Proc. Int. Conf. Spoken Language Processing (ICSLP), Kobe-Japan. Hanson, B. A. and T. H. Applebaum (1990b). Robust speaker independent word recognition using static, dynamic, and acceleration features: experiments with lombard and noisy speech. Proc. IEEE ICASSP'90, Albuquerque, NM. Hanson, B. A. and T. H. Applebaum (1990c). Robust speaker independent word recognition using static, dynamic, and acceleration features: experiments with lombard and noisy speech. Proc. ICASSP'90, Albuquerque, NM. Hanson, B. A., T. H. Applebaum and J.-C. Junqua (1996a). Spectral dynamics for speech recognition under adverse conditions. Automatic Speech and Speaker Recognition Advanced Topics. C.-H. Lee, F. K. Soong and K. K. Paliwal. Hanson, B. A., T. H. Applebaum and J.-C. Junqua, Eds. (1996b). Spectral dynamics for speech recognition under adverse conditions. Automatic Speech and Speaker Recognition, Kluwer Academic Publishers. Harborg, E. (1990). Hidden Markov Models Applied to Automatic Speech Recognition,. PhD Thesis, Norwegian Institute of Technology (Trondheim). Harrington, J. and S. Cassidy (1999). Techniques in Speech Acoustics, Kluwer Academic Publishers. Hartmann, W. M. (1998). Signals, Sound, and Sensation, Springer- Verlag. Herrnansky, H. (1990a). "Perceptual linear predictive (PLP) analysis for speech." J. Acoust. Soc. Am. 87: 1738-1752. Hermansky, H. (1990b). "Perceptual linear predictive (PLP) analysis of speech." J. Acoust. Soc. Am. 87(4): 1738-1752. Hermansky, H. (1997). Should recognizers have ears? Proc. ESCA Tutorial and Research Workshop on Robust Speech Recognition for Unknown Communication Channels, France. Hermansky, H. (1999). Analysis in automatic recognition of speech. Speech Processing, Recognition and Artificial Neural Networks. G. Chollet, D. Di Benedetto, A. Esposito and M. Marinaro, Springer-Verlag: 115-137. Hermansky, H., D. Ellis and S. Sharma (2000). TANDEM connectionist feature extraction for conventional HMM systems. Proc. ICASSP2000, Istanbul. Hermansky, H. and N. Malayath (1998). Spectral basis functions from discriminant analysis. ICSLP'98, Sydney, Australia. Hermansky, H. and S. Sharma (1999). Temporal patterns (TRAPS) in ASR of noisy speech. Proc. ICASSP'99, Phoenix, AZ. Hertz, J., A. Krogh and R G. Palmer (1991). Introduction To The Theory Of Neural Computing, Addison-Wesley Publishing Company. Hess, W. (1983). Pitch determination of speech signals. New York, Springer-Verlag. Hochberg, M., S. Rentals, A. J. Robinson and G. D. Cook (1995). Recent improvements to the ABBOT large vocabulary CSR system. Proc. IEEE ICASSP'95. Huang, L.-S. and C.-h. Yang (2000). A novel approach to robust speech endpoint detection in car environment. Proc. IEEE ICASSP'2000, Istanbul, Turkey. Huang, X., A. Acero, F. Allova, M. Y. Hwang, et al. (1995). "Microsoft windows highly intelligent speech recognizer: Whisper." Proc. IEEE ICASSP'95 1: 93-97. Huang, X. D. (1992). Minimizing speaker variation effects for speaker-independent speech recognition. Proceedings of Speech and Natural Language Workshop. Huang, X. D., M. A. Ariki and M. A. Jack (1990a). Hidden Markov Models for Speech Recognition. Edinburgh, Edinburgh University Press. Huang, X. D., Y. Ariki and M. A. Jack (1990b). Hidden Markov Models for Speech Recognition, Edinburgh University Press. Huang, X. D., H. W. Hon, M. Y. Huang and K. F. Lee (1993). "A comparative study of discrete semi-continuous, and continuous hidden Markov models." Computer Speech and Language 7: 359-368. Huang, X. D. and M. A. Jack, Eds. (1990). Semi-continuous hidden Markov models for speech signals. Readings in Speech Recognition, Morgan Kaufmann. Huang, X. D., K.-F. Lee, H. W. Hon and M. Y. Hwang (1991). Improved acoustic modeling for the SPHINX speech recognition system. Proc. IEEE ICASSP'91, Toronto, Canada. Humphries, J. J. and P. C. Woodland (1997). Using accent-specific pronounciation modelling for improved large vocabulary continuous speech recognition. Proc. Eurospeech. Hunt, M. J., S. M. Richardson, D. C. Bateman and A. Piau (1991). An investigation of PLP and IMEL

    Signal processing and acoustic modelling of speech signals for speech recognition systems

    No full text
    Natural man-machine interaction is currently one of the most unfulfilled pledges of automatic speech recognition (ASR). The purpose of an automatic speech recognition system is to accurately transcribe or execute what has been said. State-of-the-art speech recognition systems consist of four basic modules: the signal processing, the acoustic modelling, the language modelling, and the search engine. The subject of this thesis is the signal processing and acoustic modelling modules. We pursue the modelling of spoken signals in an optimum way. The resultant modules can be used successfully for the subsequent two modules. Since the first order hidden Markov model (HMM) has been a tremendously successful mathematically established paradigm, which makes it the up-to-the-minute technique in current speech recognition systems, this dissertation bases all its studies and experiments on HMM. HMM is a statistical framework that supports both acoustic and temporal modelling. It is widely used despite making a number of suboptimal modelling assumptions, which put limits on its full potential. We investigate how the model design strategy and the algorithms can be adapted to HMMs. Large suites of experimental results are demonstrated to expound the relative effectiveness of each component within the HMM paradigm. This dissertation presents several strategies for improving the overall performance of baseline speech recognition systems. The implementation of these strategies was optimised in a series of experiments. We also investigate selecting the optimal feature sets for speech recognition improvement. Moreover, the reliability of human speech recognition is attributed to the specific properties of the auditory presentation of speech. Thus, in this dissertation, we explore the use of perceptually inspired signal processing strategies, such as critical band frequency analysis. The resulting speech representation called Gammatone cepstral coefficients (GTCC) provides relative improvement over the baseline recogniser. We also investigate multiple signal representations for recognition in an ASR to improve the recognition rate. Additionally, we developed fast techniques that are useful for evaluation and comparison procedures between different signal processing paradigms. The following list gives the main contributions of this dissertation: • Speech/background discrimination. • HMM initialisation techniques. • Multiple signal representation with multi-stream paradigms. • Gender based modelling. • Feature vectors dimensionality reduction. • Perceptually motivated feature sets. • ASR training and recognition packages for research and development. Many of these methods can be applied in practical applications. The proposed techniques can be used directly in more complicated speech recognition systems by introducing their resultants to the language and search engine modules.UnpublishedAbdulla, W. H. and N. K. Kasabov (1999a). Speech recognition enhancement via robust CHMM speech background discrimination. Proc. ICONIP/ANZIIS/ANNES'99 International Workshop, New Zealand. Abdulla, W. H. and N. K. Kasabov (1999b). Two pass hidden Markov model for speech recognition systems. Proc. ICICS'99, Singapore. Abdulla, W. H. and N. K. Kasabov (1999c). The concepts of hidden Markov model in speech recognition. IJCNN'99, N. K. Kasabov, W. H. Abdulla (Ed.), Washington, DC, July 10-16, Chapter 4. Abdulla, W. H. and N. K. Kasabov (2000). Feature selection for parallel CHMM speech recognition systems. Proc. of the Fifth Joint Conference on Information Sciences, vol.2, pp 874-878, Atlantic City, New Jersey, USA. Abdulla, W. H. and N. K. Kasabov (2001). Improving speech recognition performance through gender separation. Proc. Artificial Neural Networks and Expert Systems International Conference (ANNES), pp 218- 222, Dunedin, New Zealand. Abramson, N. (1963). Information Theory and Coding. New York, McGraw-Hill. Abrash, V., H. Franco, M. Cohen, N. Morgan, et al. (1992). "Connectionist gender adaptation in hybrid neural network/hidden Markov model speech recognition system." Proc. ICSLP'92. Aertsen, A. and P. Johannesma (1980). "Spectra-temporal receptive fields of auditory neurons in the grass frog. I. Characterization of tonal and natural stimuli." Biol. Cybern. 38: 223 - 234. Allen, J. B. (1995). Speech and hearing in communication. New York, ASA edition, Acoustical Society of America. Alsabti, K., S. Ranka and V. Singh (1999). An efficient K-means clustering algorithm. IPPS/SPDP Workshop on High Performance Data Mining, San Juan, Puerto Rico. Arai, T. and S. Greenberg (1998). Speech intelligibility in the presence of cross channel spectral asynchrony. Proc. IEEE ICASSP'98. Atal, B. S. (1972). "Automatic speaker recognition based on pitch contours." J. Acoust. Soc. Am. 52: 1687-1697. Ata, B. S. and L. R Rabiner (1976). "A pattern recognition approach to voiced-unvoiced-silence classification with application to speech recognition." IEEE Trans. ASSP 24(4): 201-212. Bahl, L. R., P. F. Brown, P. V. de Souza and R L. Mercer (1988a). Speech recognition with continuous parameter hidden Markov models. Proc. IEEE ICASSP'88, New York, NY. Bahl, L. R., P. F. Brown, P. V. de Souza, R. L. Mercer, et al. (1988b). Acoustic Markov models used in Tangora speech recognition system. Proc. IEEE ICASSP'88, New York, USA. Baker, J. K. (1975a). "The DRAGON system - an overview." IEEE Trans. ASSP 23: 24-29. Baker, J. K. (1975b). "The Dragon system - an overview." IEEE Trans. ASSP 23(1): 24-29. Baker, J. K., Ed. (1975c). Stochastic modeling for automatic speech understanding. Speech Recognition: Invited paper presented at the 1974 IEEE symposium. New York, Academic Press. Barnwell-III, T. P. (1980). A comparison of parametrically different objective speech quality measures using correlation analysis with subjective quality results. Proc. IEEE ICASSP'80, Denver. Bateman, D. C., D. K. Bye and M. J. Hunt (1992). Spectral contrast normalization and other techniques for speech recognition in noise. Proc. IEEE ICASSP'92, San Francisco, USA. Baum, L. E. (1972). "An inequality and associated miximization technique in statistical estimation for probabilistic functions of Markov processe." Proc. Symp. On Inequalities 3: 1-7. Baum, L. E. and J. A. Egon (1967). "An inequality with applications to statistical estimation for probabilistic functions of Markov process and to a model for ecology." Bull. Amer. Meteorol. Soc. 73: 360-363. Baum, L. E. and T. Petrie (1966). "Statistical inference for probabilistic functions of finite state Markov chains." Ann. Math. Stat. 37: 1554-1563. Baum, L. E., T. Petrie, G. Soules and N. Weiss (1970). "A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains." Annals of Mathematical Statistics 41(1): 164-171. Becchetti, C. and L. P. Ricotti (1999). Speech recognition theory and C++ implementation, John Wiley & Sons. Bellegarda, J. and D. Nahamoo (1989). Tied mixture continuous parameter models for large vocabulary isolated speech recognition. Proc. ICASSP'89, Glasgow, Scotland. Bellegarda, J. and D. Naharnoo (1990). "Tied mixture continuous parameter modeling for speech recognition." IEEE Trans. ASSP 38(12): 2033-2045. Bengio, S. and Y. Bengio (2000a). "Taking on the curse of dimentionality in joint distributions using neural networks." IEEE Trans. Neural Networks 11(3): 550-557. Bengio, Y. (1996). Neural Networks for Speech and Sequence Processing, International Thomson Computer Press. Bengio, Y. and S. Bengio (2000b). Modeling high-dimensional discrete data with multi-layer neural networks. Advances in Neural Information Processing Systems 12. S. A. Jolla, T. K. Leen and K.-R. Miler, MIT Press: 400-406. Bin, J., T. Calhurst, A. EI-Jaroudi, R. lyer, et al. (1999). Recent experiments in large vocabulary conversational speech recognition. Proc. IEEE ICASSP'99, Phoenix. Blimes, J. A. (1998). A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Berkeley, CA, International Computer Science Institute. Bocchieri, E. and B. Mak (1997). Subspace distribution clustering for continuous observation density hidden Markov models. Proc. Eurospeech. Boll, S. F. (1979). "Suppression of acoustic noise in speech using spectral subtraction." IEEE Trans. ASSP 27(2): 113-120. Bou-Ghazale, S. E. and A. 0. Asadi (2000). Hands-free voice activation of personal communication devices. Proc. IEEE ICASSP'2000, Istanbul, Turkey. Bourlard, H., S. Bengio and K. Weber (2001). New approaches towards robust and adaptive speech recognition. Advances in Neural Information Processing Systems 13. T. K. Leen, T. G. Dietterich and V. Tresp, MIT Press: 751-757. Bourlard, H., S. Dupont and C. Ris (1996). Multi-stream speech recognition. Bourlard, H. and N. Morgan (1993). Connectionist Speech Recognition. A Hybrid Approach. Boston, Kluwer Academic Publishers. Bourlard, H. and N. Morgan (1994). Connectionist Speech Recognition, Kluwer Academic Publishers. Bourlard, H., C. J. Wellekens and H. Ney (1984). Connected digit recognition using vector quantization. Proc. IEEE ICASSP'84, San Diego, USA. Burton, D. K. and J. E. Shore (1985). "Speaker-dependent isolated word recognition using speaker-independent vector quantization coodbooks augmented with speaker-specific data." IEEE Trans. ASSP 33(2): 440-443. Chan, A. K. and S. J. Liu (1998). Wavelet Toolware: Software for Wavelet Training, Academic Press. Chen, S. S. and R. A. Gopinath (2001). Gaussianization. Advances in Neural Information Processing Systems 13. T. K. Leen, T. G. Dietterich and V. Tresp, MIT Press: 821-827. Chow, Y. L., M. 0. Dunham, 0. A. Kimball, M. A. Krasner, et al. (1987). BYBLOS: The BBN continuous speech recognition system. Proc. IEEE ICASSP'87, Dallas, USA. Cooke, M. (1993). Modelling Auditory Processing and Organization. U.K., Cambridge University Press. Cosi, P., Ed. (1999). Auditory modeling and neural networks. Speech Processing, Recognition and Artificial Neural Networks. London, Springer- Verlag. Cover, T. M. (1977). "On the possible ordering in the measurement selection problem." IEEE Trans. on Systems, Man, and Cybernetics 7(9): 657-661. Crochiere, R. E. and L. Rabiner (1983). Multirate Digital Signal Processing. Englewood Cliffs, NJ, Prentice Hall. Das, S., R. Bakis, A. Nadas, D. Nahamoo, et al. (1993). Influence of background noise and microphone on the performance of the IBM TANGORA speech recognition system. Proc. IEEE ICASSP'93. Das, S., A. Nadas, D. Nahamoo and M. Picheny (1994). Adaptation techniques for ambience and microphone compensation in the IBM Tangora speech recognition system. Proc. IEEE ICASSP'94, Adelaide, Australia. Daubechies, I. (1990). "The wavelet transform, time-frequency localization and signal analysis." IEEE Trans. IT 36(5): 961 -1005. Daubechies, I., Ed. (1992a). Ten Lectures on Wavelets. CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia, Pennsylvania, SIAM Press. Daubechies, I. (1992b). Ten lectures on wavelets, SIAM. Davis, B. and P. Mermelstein (1980). "Comparison of parametric representations for monosylabic word recognirion in continuously spoken sentences." IEEE Trans. ASSP 28(4): 357-366. Davis, S. B., Ed. (1990). Comparison of parametric representations for monosyllabic word recognition in continuous spoken sentences. Readings in Speech Recognition. de-Boer, E. and H. R. de Jongh (1978). "On cochlea encoding: Potentialities and limitations of the reverse-correlation technique." J. Acoust. Soc. Am. 63(1): 115- 135. de-Boer, E. and P. Kuyper (1968). "Triggered correlation." IEEE Trans. Biomed. Eng. BME-15: 169 - 179. DeIler, J. R., J. G. Proakis and J. H. Hansen (1993). Discrete-Time Processing of Speech Signals. New York, Macmillan Publishing. Dempster, A. P., N. M. Laird and D. B. Rubin (1977). "Maximum likelihood from incomplete data via the EM algorithm." J. Royal Statist. Soc. Ser. B 39(1): 1-38. Devijver, P. A. and J. Kittler (1982). Pattern Recognition: A Statistical Approach. Englewood Cliffs, NJ, Prentice-Hall. Donoho, D. L., Ed. (1993). Nonlinear wavelet methods for recovery of signals, densities, and spectra from indirect and noisy data. Proceedings of the Symposia in Applied Mathematics, American Mathematical Society. Donoho, D. L. (1995). "Denoising by soft-thresholding." IEEE Trans. IT 41(3): 613-627. Donoho, D. L. and I. M. Johnstone (1994). "Ideal spatial adaptation by wavelet shrinkage." Biometrika 81: 425-455. Draper, N. R. and H. Smith (1981). Applied Regression Analysis. New York, Wiley. Duda, R. 0. and P. E. Hart (1973). Pattern Classification and Scene Analysis. New York, Wiley. Duncan Luce, R. (1993). Sound & Hearing A Conceptual Introduction, Lawrence Erlbaum Associates. Elenius, K. and M. Blomberg (1982). Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system. Proc. ICASSP'86. Ellermann, C., S. V. Even, C. Huang and L. Manganaro (1993). "Dragon systems' experiences in small to large vocabulary multi-lingual speech recognition applications." Proc. Eurospeech 3: 2077-2080. Ellis, D. (2001). TANDEM acoustic modeling in large-vocabulary recognition. Proc. ICASSP'2001, Salt Lake City. Ellis, D. and J. A. Bilmes (2000). Using mutual information to design feature combinations. Proc. ICSLP-2000, Beijing. Ellis, D. P. (2000a). Improved recognition by combining different features and different systems. Proc. AVIOS-2000, San Jose. Ellis, D. P. (2000b). Using mutual information to design feature combinations. Proc. ICSLP-2000, Beijing. Ellis, D. P., R. Singh and S. Sivadas (2001). Tandem acoustic modelling in large-vocabulary recognition. Proc. IEEE ICASSP'2001, Salt Lake City. Elman, J. L. (1990). "Finding structure in time." Cognitive Science 14(2): 179-211. Ferguson, J. D. (1980). Hidden Markov Analysis: An Introduction. Princeton, NJ, Institute of Defence Analyses. Fermin, C. D. (2000). Very Detailed Tutorial on the ear, http://www1.omi.tulane.edu/departments/pathology/fermin/Hearing.html. 6000. Forney, G. D. (1973). "The Viterbi algorithm." Proc. IEEE 61: 268-278. Furui, S. (1986a). Speaker independent isolated word recognition based on emphasised spectral dynamics. Proc. ICASSP'86, Tokyo - Japan. Furui, S. (1986b). Speaker independent isolated word recognition based on emphasized spectral dynamics. Proc. IEEE ICASSP'86, Tokyo- Japan. Furui, S. (1986c), "Speaker independent isolated word recognition using dynamic features of speech recognition." IEEE Trans. ASSP 34(2): 52-59. Furui, S. (1986d). "Speaker-independent isolated word recognition using dynamic features of speech spectrum." IEEE Trans. ASSP 34: 52-59. Furui, S. (1988). "A VO-based preprocesor using cepstral dynamic features for speaker-independent large vocabulary word recognition." IEEE Trans. ASSN 36(7): 980-987. Garofolo, J. S., L. F. Lame!, W. M. Fisher, J. G. Fiscus, et al. (1990). DARPA TIM IT Acoustic-Phonetic Continuous Speech Corpus CD-ROM. NISTIR 4930. Gauvain, J.-L. and C.-H. Lee (1994). "Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains." IEEE Trans. SAP 2(2): 291-298. Ghitza, 0., Ed. (1992). Auditory nerve representation as a basis for speech processing. Advances in Speech Signal Processing. New York, Marcel Dekker. Glasberg, B. R. and B. C. Moore (1990). "Derivation of auditory filter shapes from notched-noise data." Hearing Research 47: 103 - 108. Glinski, S. C. (1985). "On the use of vector quantization for connecteddigit recognition." The AT & T Tech, J. 64(5): 1033-1045. Gong, Y. (1995). "Speech recognition in noisy environments: A survey." Speech Communication 16: 261-291. Graps, A. (1995). "An introduction to wavelets." IEEE Computational Science and Engineering 2(2). Gravier, G., M. Sigelle and G. Chollet (1998). "Marrkov random field modelling for speech recognition." Australian J. of Intelligent Information Processing Systems 5(4): 245-251. Gray, J. and J. D. Markel (1976). "Distance measures for speech processing." IEEE Trans. ASSP 24(5): 380-391. Gray Jr, A. H. and J. D. Markel (1974). "A spectral flatness measure for studying the autocorrelation method of linear prediction of speech analysis." IEEE Trans. ASSP ASSP-22: 607 - 217. Gray, R. M. (1984). "Vector quantization." IEEE ASSP Magazine 1(2): 4 - 29. Gray, R. M., A. Buzo, J. Gray and Matsuyama (1980). "Distortion measures for speech processing." IEEE Trans. ASSP 68(4): 367-376. Greenberg, S., T. Arai and R. Silipo (1998). Speech derivation from exceedingly sparse spectral information. Proc. IEEE ICSLP'98. Greenwood, D. (1961). "Critical bandwidth and the frequency coordinates of the basilar membrane." J. Acoust. Soc. Am. 33: 1344 - 1356. Greenwood, D. (1990). "A cochlear frequency-position function for several species--29 years later." J. Accost, Soc. Am, 87(6): 2592 - 2605. Gupta, V. N., M. Lennig and P. Mermelstein (1987a). Integration of acoustic information in a large vocabulary word recognizer. Proc. ICASSP'87, Dallas, USA. Gupta, V. N., M. Lenning and P. Mermelstein (1987b). Integration of acoustic information in a large vocabulary word recognizer. Proc. IEEE ICASSP'87. Haeb-Umbach, R. (1999a). Investigations on interspeaker variability in the feature space. Proc. ICASSP'99, Phoenix, Arizona. Haeb-Umbach, R. (1999b). Investigations on inter-speaker variability in the feature space. Proc. IEEE ICASSP'99, Arizona-USA. Hansen, J. H. and B. L. Pellom (1998). An effective quality evaluation protocol for speech enhancement algorithms. Proc. ICSLP'98, Sydney, Australia. Hanson, B. A. and T. H. Applebaum (1990a). Features for noiserobust speaker-independent word recognition. Proc. Int. Conf. Spoken Language Processing (ICSLP), Kobe-Japan. Hanson, B. A. and T. H. Applebaum (1990b). Robust speaker independent word recognition using static, dynamic, and acceleration features: experiments with lombard and noisy speech. Proc. IEEE ICASSP'90, Albuquerque, NM. Hanson, B. A. and T. H. Applebaum (1990c). Robust speaker independent word recognition using static, dynamic, and acceleration features: experiments with lombard and noisy speech. Proc. ICASSP'90, Albuquerque, NM. Hanson, B. A., T. H. Applebaum and J.-C. Junqua (1996a). Spectral dynamics for speech recognition under adverse conditions. Automatic Speech and Speaker Recognition Advanced Topics. C.-H. Lee, F. K. Soong and K. K. Paliwal. Hanson, B. A., T. H. Applebaum and J.-C. Junqua, Eds. (1996b). Spectral dynamics for speech recognition under adverse conditions. Automatic Speech and Speaker Recognition, Kluwer Academic Publishers. Harborg, E. (1990). Hidden Markov Models Applied to Automatic Speech Recognition,. PhD Thesis, Norwegian Institute of Technology (Trondheim). Harrington, J. and S. Cassidy (1999). Techniques in Speech Acoustics, Kluwer Academic Publishers. Hartmann, W. M. (1998). Signals, Sound, and Sensation, Springer- Verlag. Herrnansky, H. (1990a). "Perceptual linear predictive (PLP) analysis for speech." J. Acoust. Soc. Am. 87: 1738-1752. Hermansky, H. (1990b). "Perceptual linear predictive (PLP) analysis of speech." J. Acoust. Soc. Am. 87(4): 1738-1752. Hermansky, H. (1997). Should recognizers have ears? Proc. ESCA Tutorial and Research Workshop on Robust Speech Recognition for Unknown Communication Channels, France. Hermansky, H. (1999). Analysis in automatic recognition of speech. Speech Processing, Recognition and Artificial Neural Networks. G. Chollet, D. Di Benedetto, A. Esposito and M. Marinaro, Springer-Verlag: 115-137. Hermansky, H., D. Ellis and S. Sharma (2000). TANDEM connectionist feature extraction for conventional HMM systems. Proc. ICASSP2000, Istanbul. Hermansky, H. and N. Malayath (1998). Spectral basis functions from discriminant analysis. ICSLP'98, Sydney, Australia. Hermansky, H. and S. Sharma (1999). Temporal patterns (TRAPS) in ASR of noisy speech. Proc. ICASSP'99, Phoenix, AZ. Hertz, J., A. Krogh and R G. Palmer (1991). Introduction To The Theory Of Neural Computing, Addison-Wesley Publishing Company. Hess, W. (1983). Pitch determination of speech signals. New York, Springer-Verlag. Hochberg, M., S. Rentals, A. J. Robinson and G. D. Cook (1995). Recent improvements to the ABBOT large vocabulary CSR system. Proc. IEEE ICASSP'95. Huang, L.-S. and C.-h. Yang (2000). A novel approach to robust speech endpoint detection in car environment. Proc. IEEE ICASSP'2000, Istanbul, Turkey. Huang, X., A. Acero, F. Allova, M. Y. Hwang, et al. (1995). "Microsoft windows highly intelligent speech recognizer: Whisper." Proc. IEEE ICASSP'95 1: 93-97. Huang, X. D. (1992). Minimizing speaker variation effects for speaker-independent speech recognition. Proceedings of Speech and Natural Language Workshop. Huang, X. D., M. A. Ariki and M. A. Jack (1990a). Hidden Markov Models for Speech Recognition. Edinburgh, Edinburgh University Press. Huang, X. D., Y. Ariki and M. A. Jack (1990b). Hidden Markov Models for Speech Recognition, Edinburgh University Press. Huang, X. D., H. W. Hon, M. Y. Huang and K. F. Lee (1993). "A comparative study of discrete semi-continuous, and continuous hidden Markov models." Computer Speech and Language 7: 359-368. Huang, X. D. and M. A. Jack, Eds. (1990). Semi-continuous hidden Markov models for speech signals. Readings in Speech Recognition, Morgan Kaufmann. Huang, X. D., K.-F. Lee, H. W. Hon and M. Y. Hwang (1991). Improved acoustic modeling for the SPHINX speech recognition system. Proc. IEEE ICASSP'91, Toronto, Canada. Humphries, J. J. and P. C. Woodland (1997). Using accent-specific pronounciation modelling for improved large vocabulary continuous speech recognition. Proc. Eurospeech. Hunt, M. J., S. M. Richardson, D. C. Bateman and A. Piau (1991). An investigation of PLP and IMEL
    corecore