9 research outputs found

    On segments and syllables in the sound structure of language: Curve-based approaches to phonology and the auditory representation of speech.

    Get PDF
    http://msh.revues.org/document7813.htmlInternational audienceRecent approaches to the syllable reintroduce continuous and mathematical descriptions of sound objects designed as ''curves''. Psycholinguistic research on oral language perception usually refer to symbolic and highly hierarchized approaches to the syllable which strongly differenciate segments (phones) and syllables. Recent work on the auditory bases of speech perception evidence the ability of listeners to extract phonetic information when strong degradations of the speech signal have been produced in the spectro-temporal domain. Implications of these observations for the modelling of syllables in the fields of speech perception and phonology are discussed.Les approches récentes de la syllabe réintroduisent une description continue et descriptible mathématiquement des objets sonores: les courbes. Les recherches psycholinguistiques sur la perception du langage parlé ont plutôt recours à des descriptions symboliques et hautement hiérarchisées de la syllabe dans le cadre desquelles segments (phones) et syllabes sont strictement différenciés. Des travaux récents sur les fondements auditifs de la perception de la parole mettent en évidence la capacité qu'ont les locuteurs à extraire une information phonétique alors même que des dégradations majeures du signal sont effectuées dans le domaine spectro-temporel. Les implications de ces observations pour la conception de la syllabe dans le champ de la perception de la parole et en phonologie sont discutées

    Perceptual Restoration of Temporally Distorted Speech in L1 vs. L2: Local Time Reversal and Modulation Filtering

    Get PDF
    Speech is intelligible even when the temporal envelope of speech is distorted. The current study investigates how native and non-native speakers perceptually restore temporally distorted speech. Participants were native English speakers (NS), and native Japanese speakers who spoke English as a second language (NNS). In Experiment 1, participants listened to “locally time-reversed speech” where every x-ms of speech signal was reversed on the temporal axis. Here, the local time reversal shifted the constituents of the speech signal forward or backward from the original position, and the amplitude envelope of speech was altered as a function of reversed segment length. In Experiment 2, participants listened to “modulation-filtered speech” where the modulation frequency components of speech were low-pass filtered at a particular cut-off frequency. Here, the temporal envelope of speech was altered as a function of cut-off frequency. The results suggest that speech becomes gradually unintelligible as the length of reversed segments increases (Experiment 1), and as a lower cut-off frequency is imposed (Experiment 2). Both experiments exhibit the equivalent level of speech intelligibility across six levels of degradation for native and non-native speakers respectively, which poses a question whether the regular occurrence of local time reversal can be discussed in the modulation frequency domain, by simply converting the length of reversed segments (ms) into frequency (Hz)

    Eros, Beauty, and Phon-Aesthetic Judgements of Language Sound. We Like It Flat and Fast, but Not Melodious. Comparing Phonetic and Acoustic Features of 16 European Languages

    Get PDF
    This paper concerns sound aesthetic preferences for European foreign languages. We investigated the phonetic-acoustic dimension of the linguistic aesthetic pleasure to describe the “music” found in European languages. The Romance languages, French, Italian, and Spanish, take a lead when people talk about melodious language – the music-like effects in the language (a.k.a., phonetic chill). On the other end of the melodiousness spectrum are German and Arabic that are often considered sounding harsh and un-attractive. Despite the public interest, limited research has been conducted on the topic of phonaesthetics, i.e., the subfield of phonetics that is concerned with the aesthetic properties of speech sounds (Crystal, 2008). Our goal is to fill the existing research gap by identifying the acoustic features that drive the auditory perception of language sound beauty. What is so music-like in the language that makes people say “it is music in my ears”? We had 45 central European participants listening to 16 auditorily presented European languages and rating each language in terms of 22 binary characteristics (e.g., beautiful – ugly, funny - boring) plus indicating their language familiarities, L2 backgrounds, speaker voice liking, demographics and musicality levels. Findings revealed that all factors in complex interplay explain a certain percentage of variance: familiarity and expertise in foreign languages, speaker voice characteristics, phonetic complexity, musical acoustic properties, and finally musical expertise of the listener. The most important discovery was the trade-off between speech tempo and so-called linguistic melody (pitch variance): the faster the language, the flatter/more atonal it is in terms of the pitch (speech melody), making it highly appealing acoustically (sounding beautiful and sexy), but not so melodious in a “musical” sense

    Windows into Sensory Integration and Rates in Language Processing: Insights from Signed and Spoken Languages

    Get PDF
    This dissertation explores the hypothesis that language processing proceeds in "windows" that correspond to representational units, where sensory signals are integrated according to time-scales that correspond to the rate of the input. To investigate universal mechanisms, a comparison of signed and spoken languages is necessary. Underlying the seemingly effortless process of language comprehension is the perceiver's knowledge about the rate at which linguistic form and meaning unfold in time and the ability to adapt to variations in the input. The vast body of work in this area has focused on speech perception, where the goal is to determine how linguistic information is recovered from acoustic signals. Testing some of these theories in the visual processing of American Sign Language (ASL) provides a unique opportunity to better understand how sign languages are processed and which aspects of speech perception models are in fact about language perception across modalities. The first part of the dissertation presents three psychophysical experiments investigating temporal integration windows in sign language perception by testing the intelligibility of locally time-reversed sentences. The findings demonstrate the contribution of modality for the time-scales of these windows, where signing is successively integrated over longer durations (~ 250-300 ms) than in speech (~ 50-60 ms), while also pointing to modality-independent mechanisms, where integration occurs in durations that correspond to the size of linguistic units. The second part of the dissertation focuses on production rates in sentences taken from natural conversations of English, Korean, and ASL. Data from word, sign, morpheme, and syllable rates suggest that while the rate of words and signs can vary from language to language, the relationship between the rate of syllables and morphemes is relatively consistent among these typologically diverse languages. The results from rates in ASL also complement the findings in perception experiments by confirming that time-scales at which phonological units fluctuate in production match the temporal integration windows in perception. These results are consistent with the hypothesis that there are modality-independent time pressures for language processing, and discussions provide a synthesis of converging findings from other domains of research and propose ideas for future investigations

    CORTICAL DYNAMICS OF AUDITORY-VISUAL SPEECH: A FORWARD MODEL OF MULTISENSORY INTEGRATION.

    Get PDF
    In noisy settings, seeing the interlocutor's face helps to disambiguate what is being said. For this to happen, the brain must integrate auditory and visual information. Three major problems are (1) bringing together separate sensory streams of information, (2) extracting auditory and visual speech information, and (3) identifying this information as a unified auditory-visual percept. In this dissertation, a new representational framework for auditory visual (AV) speech integration is offered. The experimental work (psychophysics and electrophysiology (EEG)) suggests specific neural mechanisms for solving problems (1), (2), and (3) that are consistent with a (forward) 'analysis-by-synthesis' view of AV speech integration. In Chapter I, multisensory perception and integration are reviewed. A unified conceptual framework serves as background for the study of AV speech integration. In Chapter II, psychophysics testing the perception of desynchronized AV speech inputs show the existence of a ~250ms temporal window of integration in AV speech integration. In Chapter III, an EEG study shows that visual speech modulates early on the neural processing of auditory speech. Two functionally independent modulations are (i) a ~250ms amplitude reduction of auditory evoked potentials (AEPs) and (ii) a systematic temporal facilitation of the same AEPs as a function of the saliency of visual speech. In Chapter IV, an EEG study of desynchronized AV speech inputs shows that (i) fine-grained (gamma, ~25ms) and (ii) coarse-grained (theta, ~250ms) neural mechanisms simultaneously mediate the processing of AV speech. In Chapter V, a new illusory effect is proposed, where non-speech visual signals modify the perceptual quality of auditory objects. EEG results show very different patterns of activation as compared to those observed in AV speech integration. An MEG experiment is subsequently proposed to test hypotheses on the origins of these differences. In Chapter VI, the 'analysis-by-synthesis' model of AV speech integration is contrasted with major speech theories. From a Cognitive Neuroscience perspective, the 'analysis-by-synthesis' model is argued to offer the most sensible representational system for AV speech integration. This thesis shows that AV speech integration results from both the statistical nature of stimulation and the inherent predictive capabilities of the nervous system

    Discriminative connectionist approaches for automatic speech recognition in cars

    Get PDF
    The first part of this thesis is devoted to the evaluation of approaches which exploit the inherent redundancy of the speech signal to improve the noise robustness. On the basis of this evaluation on the AURORA 2000 database, we further study in detail two of the evaluated approaches. The first of these approaches is the hybrid RBF/HMM approach, which is an attempt to combine the superior classification performance of radial basis functions (RBFs) with the ability of HMMs to model time variation. The second approach is using neural networks to non-linearly reduce the dimensionality of large feature vectors including context frames. We propose the use of different MLP topologies for that purpose. Experiments on the AURORA 2000 database reveal that the performance of the first approach is similar to the performance of systems based on SCHMMs. The second approach cannot outperform the performance of linear discriminant analysis (LDA) on a database recorded in real car environments, but it is on average significantly better than LDA on the AURORA 2000 database.Im ersten Teil dieser Arbeit werden bestehende Verfahren zur Erhöhung der Robustheit von Spracherkennungssystemen in lauten Umgebungen evaluiert, die auf der Ausnutzung der Redundanz im Sprachsignal basieren. Auf der Grundlage dieser Evaluation auf der AURORA 2000 Datenbank werden zwei spezielle Ansätze weiter ausgearbeitet und detalliert analysiert. Der erste dieser Ansätze verbindet die herausragende Klassifikationsleistung von neuronalen Netzen mit radialen Basisfunktionen (RBF) mit der Fähigkeit von Hidden-Markov-Modellen (HMM), Zeitveränderlichkeiten zu modellieren. In einem zweiten Ansatz werden NN zur nichtlinearen Dimensionsreduktion hochdimensionaler Kontextvektoren in unterschiedlichen Netzwerk-Topologien untersucht. In Experimenten konnte gezeigt werden, dass der erste dieser Ansätze für die AURORA-Datenbank eine ähnliche Leistungsfähigkeit wie semikontinuierliche HMM (SCHMM) aufweist. Der zweite Ansatz erzielt auf einer im Kraftfahrzeug aufgenommenen Datenbank keine Verbesserung gegenüber den klassischen linearen Ansätzen zu Dimensionsreduktion (LDA), erweist sich aber auf der AURORA-Datenbank als signifikan

    Signal processing and acoustic modelling of speech signals for speech recognition systems

    No full text
    Natural man-machine interaction is currently one of the most unfulfilled pledges of automatic speech recognition (ASR). The purpose of an automatic speech recognition system is to accurately transcribe or execute what has been said. State-of-the-art speech recognition systems consist of four basic modules: the signal processing, the acoustic modelling, the language modelling, and the search engine. The subject of this thesis is the signal processing and acoustic modelling modules. We pursue the modelling of spoken signals in an optimum way. The resultant modules can be used successfully for the subsequent two modules. Since the first order hidden Markov model (HMM) has been a tremendously successful mathematically established paradigm, which makes it the up-to-the-minute technique in current speech recognition systems, this dissertation bases all its studies and experiments on HMM. HMM is a statistical framework that supports both acoustic and temporal modelling. It is widely used despite making a number of suboptimal modelling assumptions, which put limits on its full potential. We investigate how the model design strategy and the algorithms can be adapted to HMMs. Large suites of experimental results are demonstrated to expound the relative effectiveness of each component within the HMM paradigm. This dissertation presents several strategies for improving the overall performance of baseline speech recognition systems. The implementation of these strategies was optimised in a series of experiments. We also investigate selecting the optimal feature sets for speech recognition improvement. Moreover, the reliability of human speech recognition is attributed to the specific properties of the auditory presentation of speech. Thus, in this dissertation, we explore the use of perceptually inspired signal processing strategies, such as critical band frequency analysis. The resulting speech representation called Gammatone cepstral coefficients (GTCC) provides relative improvement over the baseline recogniser. We also investigate multiple signal representations for recognition in an ASR to improve the recognition rate. Additionally, we developed fast techniques that are useful for evaluation and comparison procedures between different signal processing paradigms. The following list gives the main contributions of this dissertation: • Speech/background discrimination. • HMM initialisation techniques. • Multiple signal representation with multi-stream paradigms. • Gender based modelling. • Feature vectors dimensionality reduction. • Perceptually motivated feature sets. • ASR training and recognition packages for research and development. Many of these methods can be applied in practical applications. The proposed techniques can be used directly in more complicated speech recognition systems by introducing their resultants to the language and search engine modules.UnpublishedAbdulla, W. H. and N. K. Kasabov (1999a). Speech recognition enhancement via robust CHMM speech background discrimination. Proc. ICONIP/ANZIIS/ANNES'99 International Workshop, New Zealand. Abdulla, W. H. and N. K. Kasabov (1999b). Two pass hidden Markov model for speech recognition systems. Proc. ICICS'99, Singapore. Abdulla, W. H. and N. K. Kasabov (1999c). The concepts of hidden Markov model in speech recognition. IJCNN'99, N. K. Kasabov, W. H. Abdulla (Ed.), Washington, DC, July 10-16, Chapter 4. Abdulla, W. H. and N. K. Kasabov (2000). Feature selection for parallel CHMM speech recognition systems. Proc. of the Fifth Joint Conference on Information Sciences, vol.2, pp 874-878, Atlantic City, New Jersey, USA. Abdulla, W. H. and N. K. Kasabov (2001). Improving speech recognition performance through gender separation. Proc. Artificial Neural Networks and Expert Systems International Conference (ANNES), pp 218- 222, Dunedin, New Zealand. Abramson, N. (1963). Information Theory and Coding. New York, McGraw-Hill. Abrash, V., H. Franco, M. Cohen, N. Morgan, et al. (1992). "Connectionist gender adaptation in hybrid neural network/hidden Markov model speech recognition system." Proc. ICSLP'92. Aertsen, A. and P. Johannesma (1980). "Spectra-temporal receptive fields of auditory neurons in the grass frog. I. Characterization of tonal and natural stimuli." Biol. Cybern. 38: 223 - 234. Allen, J. B. (1995). Speech and hearing in communication. New York, ASA edition, Acoustical Society of America. Alsabti, K., S. Ranka and V. Singh (1999). An efficient K-means clustering algorithm. IPPS/SPDP Workshop on High Performance Data Mining, San Juan, Puerto Rico. Arai, T. and S. Greenberg (1998). Speech intelligibility in the presence of cross channel spectral asynchrony. Proc. IEEE ICASSP'98. Atal, B. S. (1972). "Automatic speaker recognition based on pitch contours." J. Acoust. Soc. Am. 52: 1687-1697. Ata, B. S. and L. R Rabiner (1976). "A pattern recognition approach to voiced-unvoiced-silence classification with application to speech recognition." IEEE Trans. ASSP 24(4): 201-212. Bahl, L. R., P. F. Brown, P. V. de Souza and R L. Mercer (1988a). Speech recognition with continuous parameter hidden Markov models. Proc. IEEE ICASSP'88, New York, NY. Bahl, L. R., P. F. Brown, P. V. de Souza, R. L. Mercer, et al. (1988b). Acoustic Markov models used in Tangora speech recognition system. Proc. IEEE ICASSP'88, New York, USA. Baker, J. K. (1975a). "The DRAGON system - an overview." IEEE Trans. ASSP 23: 24-29. Baker, J. K. (1975b). "The Dragon system - an overview." IEEE Trans. ASSP 23(1): 24-29. Baker, J. K., Ed. (1975c). Stochastic modeling for automatic speech understanding. Speech Recognition: Invited paper presented at the 1974 IEEE symposium. New York, Academic Press. Barnwell-III, T. P. (1980). A comparison of parametrically different objective speech quality measures using correlation analysis with subjective quality results. Proc. IEEE ICASSP'80, Denver. Bateman, D. C., D. K. Bye and M. J. Hunt (1992). Spectral contrast normalization and other techniques for speech recognition in noise. Proc. IEEE ICASSP'92, San Francisco, USA. Baum, L. E. (1972). "An inequality and associated miximization technique in statistical estimation for probabilistic functions of Markov processe." Proc. Symp. On Inequalities 3: 1-7. Baum, L. E. and J. A. Egon (1967). "An inequality with applications to statistical estimation for probabilistic functions of Markov process and to a model for ecology." Bull. Amer. Meteorol. Soc. 73: 360-363. Baum, L. E. and T. Petrie (1966). "Statistical inference for probabilistic functions of finite state Markov chains." Ann. Math. Stat. 37: 1554-1563. Baum, L. E., T. Petrie, G. Soules and N. Weiss (1970). "A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains." Annals of Mathematical Statistics 41(1): 164-171. Becchetti, C. and L. P. Ricotti (1999). Speech recognition theory and C++ implementation, John Wiley & Sons. Bellegarda, J. and D. Nahamoo (1989). Tied mixture continuous parameter models for large vocabulary isolated speech recognition. Proc. ICASSP'89, Glasgow, Scotland. Bellegarda, J. and D. Naharnoo (1990). "Tied mixture continuous parameter modeling for speech recognition." IEEE Trans. ASSP 38(12): 2033-2045. Bengio, S. and Y. Bengio (2000a). "Taking on the curse of dimentionality in joint distributions using neural networks." IEEE Trans. Neural Networks 11(3): 550-557. Bengio, Y. (1996). Neural Networks for Speech and Sequence Processing, International Thomson Computer Press. Bengio, Y. and S. Bengio (2000b). Modeling high-dimensional discrete data with multi-layer neural networks. Advances in Neural Information Processing Systems 12. S. A. Jolla, T. K. Leen and K.-R. Miler, MIT Press: 400-406. Bin, J., T. Calhurst, A. EI-Jaroudi, R. lyer, et al. (1999). Recent experiments in large vocabulary conversational speech recognition. Proc. IEEE ICASSP'99, Phoenix. Blimes, J. A. (1998). A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Berkeley, CA, International Computer Science Institute. Bocchieri, E. and B. Mak (1997). Subspace distribution clustering for continuous observation density hidden Markov models. Proc. Eurospeech. Boll, S. F. (1979). "Suppression of acoustic noise in speech using spectral subtraction." IEEE Trans. ASSP 27(2): 113-120. Bou-Ghazale, S. E. and A. 0. Asadi (2000). Hands-free voice activation of personal communication devices. Proc. IEEE ICASSP'2000, Istanbul, Turkey. Bourlard, H., S. Bengio and K. Weber (2001). New approaches towards robust and adaptive speech recognition. Advances in Neural Information Processing Systems 13. T. K. Leen, T. G. Dietterich and V. Tresp, MIT Press: 751-757. Bourlard, H., S. Dupont and C. Ris (1996). Multi-stream speech recognition. Bourlard, H. and N. Morgan (1993). Connectionist Speech Recognition. A Hybrid Approach. Boston, Kluwer Academic Publishers. Bourlard, H. and N. Morgan (1994). Connectionist Speech Recognition, Kluwer Academic Publishers. Bourlard, H., C. J. Wellekens and H. Ney (1984). Connected digit recognition using vector quantization. Proc. IEEE ICASSP'84, San Diego, USA. Burton, D. K. and J. E. Shore (1985). "Speaker-dependent isolated word recognition using speaker-independent vector quantization coodbooks augmented with speaker-specific data." IEEE Trans. ASSP 33(2): 440-443. Chan, A. K. and S. J. Liu (1998). Wavelet Toolware: Software for Wavelet Training, Academic Press. Chen, S. S. and R. A. Gopinath (2001). Gaussianization. Advances in Neural Information Processing Systems 13. T. K. Leen, T. G. Dietterich and V. Tresp, MIT Press: 821-827. Chow, Y. L., M. 0. Dunham, 0. A. Kimball, M. A. Krasner, et al. (1987). BYBLOS: The BBN continuous speech recognition system. Proc. IEEE ICASSP'87, Dallas, USA. Cooke, M. (1993). Modelling Auditory Processing and Organization. U.K., Cambridge University Press. Cosi, P., Ed. (1999). Auditory modeling and neural networks. Speech Processing, Recognition and Artificial Neural Networks. London, Springer- Verlag. Cover, T. M. (1977). "On the possible ordering in the measurement selection problem." IEEE Trans. on Systems, Man, and Cybernetics 7(9): 657-661. Crochiere, R. E. and L. Rabiner (1983). Multirate Digital Signal Processing. Englewood Cliffs, NJ, Prentice Hall. Das, S., R. Bakis, A. Nadas, D. Nahamoo, et al. (1993). Influence of background noise and microphone on the performance of the IBM TANGORA speech recognition system. Proc. IEEE ICASSP'93. Das, S., A. Nadas, D. Nahamoo and M. Picheny (1994). Adaptation techniques for ambience and microphone compensation in the IBM Tangora speech recognition system. Proc. IEEE ICASSP'94, Adelaide, Australia. Daubechies, I. (1990). "The wavelet transform, time-frequency localization and signal analysis." IEEE Trans. IT 36(5): 961 -1005. Daubechies, I., Ed. (1992a). Ten Lectures on Wavelets. CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia, Pennsylvania, SIAM Press. Daubechies, I. (1992b). Ten lectures on wavelets, SIAM. Davis, B. and P. Mermelstein (1980). "Comparison of parametric representations for monosylabic word recognirion in continuously spoken sentences." IEEE Trans. ASSP 28(4): 357-366. Davis, S. B., Ed. (1990). Comparison of parametric representations for monosyllabic word recognition in continuous spoken sentences. Readings in Speech Recognition. de-Boer, E. and H. R. de Jongh (1978). "On cochlea encoding: Potentialities and limitations of the reverse-correlation technique." J. Acoust. Soc. Am. 63(1): 115- 135. de-Boer, E. and P. Kuyper (1968). "Triggered correlation." IEEE Trans. Biomed. Eng. BME-15: 169 - 179. DeIler, J. R., J. G. Proakis and J. H. Hansen (1993). Discrete-Time Processing of Speech Signals. New York, Macmillan Publishing. Dempster, A. P., N. M. Laird and D. B. Rubin (1977). "Maximum likelihood from incomplete data via the EM algorithm." J. Royal Statist. Soc. Ser. B 39(1): 1-38. Devijver, P. A. and J. Kittler (1982). Pattern Recognition: A Statistical Approach. Englewood Cliffs, NJ, Prentice-Hall. Donoho, D. L., Ed. (1993). Nonlinear wavelet methods for recovery of signals, densities, and spectra from indirect and noisy data. Proceedings of the Symposia in Applied Mathematics, American Mathematical Society. Donoho, D. L. (1995). "Denoising by soft-thresholding." IEEE Trans. IT 41(3): 613-627. Donoho, D. L. and I. M. Johnstone (1994). "Ideal spatial adaptation by wavelet shrinkage." Biometrika 81: 425-455. Draper, N. R. and H. Smith (1981). Applied Regression Analysis. New York, Wiley. Duda, R. 0. and P. E. Hart (1973). Pattern Classification and Scene Analysis. New York, Wiley. Duncan Luce, R. (1993). Sound & Hearing A Conceptual Introduction, Lawrence Erlbaum Associates. Elenius, K. and M. Blomberg (1982). Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system. Proc. ICASSP'86. Ellermann, C., S. V. Even, C. Huang and L. Manganaro (1993). "Dragon systems' experiences in small to large vocabulary multi-lingual speech recognition applications." Proc. Eurospeech 3: 2077-2080. Ellis, D. (2001). TANDEM acoustic modeling in large-vocabulary recognition. Proc. ICASSP'2001, Salt Lake City. Ellis, D. and J. A. Bilmes (2000). Using mutual information to design feature combinations. Proc. ICSLP-2000, Beijing. Ellis, D. P. (2000a). Improved recognition by combining different features and different systems. Proc. AVIOS-2000, San Jose. Ellis, D. P. (2000b). Using mutual information to design feature combinations. Proc. ICSLP-2000, Beijing. Ellis, D. P., R. Singh and S. Sivadas (2001). Tandem acoustic modelling in large-vocabulary recognition. Proc. IEEE ICASSP'2001, Salt Lake City. Elman, J. L. (1990). "Finding structure in time." Cognitive Science 14(2): 179-211. Ferguson, J. D. (1980). Hidden Markov Analysis: An Introduction. Princeton, NJ, Institute of Defence Analyses. Fermin, C. D. (2000). Very Detailed Tutorial on the ear, http://www1.omi.tulane.edu/departments/pathology/fermin/Hearing.html. 6000. Forney, G. D. (1973). "The Viterbi algorithm." Proc. IEEE 61: 268-278. Furui, S. (1986a). Speaker independent isolated word recognition based on emphasised spectral dynamics. Proc. ICASSP'86, Tokyo - Japan. Furui, S. (1986b). Speaker independent isolated word recognition based on emphasized spectral dynamics. Proc. IEEE ICASSP'86, Tokyo- Japan. Furui, S. (1986c), "Speaker independent isolated word recognition using dynamic features of speech recognition." IEEE Trans. ASSP 34(2): 52-59. Furui, S. (1986d). "Speaker-independent isolated word recognition using dynamic features of speech spectrum." IEEE Trans. ASSP 34: 52-59. Furui, S. (1988). "A VO-based preprocesor using cepstral dynamic features for speaker-independent large vocabulary word recognition." IEEE Trans. ASSN 36(7): 980-987. Garofolo, J. S., L. F. Lame!, W. M. Fisher, J. G. Fiscus, et al. (1990). DARPA TIM IT Acoustic-Phonetic Continuous Speech Corpus CD-ROM. NISTIR 4930. Gauvain, J.-L. and C.-H. Lee (1994). "Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains." IEEE Trans. SAP 2(2): 291-298. Ghitza, 0., Ed. (1992). Auditory nerve representation as a basis for speech processing. Advances in Speech Signal Processing. New York, Marcel Dekker. Glasberg, B. R. and B. C. Moore (1990). "Derivation of auditory filter shapes from notched-noise data." Hearing Research 47: 103 - 108. Glinski, S. C. (1985). "On the use of vector quantization for connecteddigit recognition." The AT & T Tech, J. 64(5): 1033-1045. Gong, Y. (1995). "Speech recognition in noisy environments: A survey." Speech Communication 16: 261-291. Graps, A. (1995). "An introduction to wavelets." IEEE Computational Science and Engineering 2(2). Gravier, G., M. Sigelle and G. Chollet (1998). "Marrkov random field modelling for speech recognition." Australian J. of Intelligent Information Processing Systems 5(4): 245-251. Gray, J. and J. D. Markel (1976). "Distance measures for speech processing." IEEE Trans. ASSP 24(5): 380-391. Gray Jr, A. H. and J. D. Markel (1974). "A spectral flatness measure for studying the autocorrelation method of linear prediction of speech analysis." IEEE Trans. ASSP ASSP-22: 607 - 217. Gray, R. M. (1984). "Vector quantization." IEEE ASSP Magazine 1(2): 4 - 29. Gray, R. M., A. Buzo, J. Gray and Matsuyama (1980). "Distortion measures for speech processing." IEEE Trans. ASSP 68(4): 367-376. Greenberg, S., T. Arai and R. Silipo (1998). Speech derivation from exceedingly sparse spectral information. Proc. IEEE ICSLP'98. Greenwood, D. (1961). "Critical bandwidth and the frequency coordinates of the basilar membrane." J. Acoust. Soc. Am. 33: 1344 - 1356. Greenwood, D. (1990). "A cochlear frequency-position function for several species--29 years later." J. Accost, Soc. Am, 87(6): 2592 - 2605. Gupta, V. N., M. Lennig and P. Mermelstein (1987a). Integration of acoustic information in a large vocabulary word recognizer. Proc. ICASSP'87, Dallas, USA. Gupta, V. N., M. Lenning and P. Mermelstein (1987b). Integration of acoustic information in a large vocabulary word recognizer. Proc. IEEE ICASSP'87. Haeb-Umbach, R. (1999a). Investigations on interspeaker variability in the feature space. Proc. ICASSP'99, Phoenix, Arizona. Haeb-Umbach, R. (1999b). Investigations on inter-speaker variability in the feature space. Proc. IEEE ICASSP'99, Arizona-USA. Hansen, J. H. and B. L. Pellom (1998). An effective quality evaluation protocol for speech enhancement algorithms. Proc. ICSLP'98, Sydney, Australia. Hanson, B. A. and T. H. Applebaum (1990a). Features for noiserobust speaker-independent word recognition. Proc. Int. Conf. Spoken Language Processing (ICSLP), Kobe-Japan. Hanson, B. A. and T. H. Applebaum (1990b). Robust speaker independent word recognition using static, dynamic, and acceleration features: experiments with lombard and noisy speech. Proc. IEEE ICASSP'90, Albuquerque, NM. Hanson, B. A. and T. H. Applebaum (1990c). Robust speaker independent word recognition using static, dynamic, and acceleration features: experiments with lombard and noisy speech. Proc. ICASSP'90, Albuquerque, NM. Hanson, B. A., T. H. Applebaum and J.-C. Junqua (1996a). Spectral dynamics for speech recognition under adverse conditions. Automatic Speech and Speaker Recognition Advanced Topics. C.-H. Lee, F. K. Soong and K. K. Paliwal. Hanson, B. A., T. H. Applebaum and J.-C. Junqua, Eds. (1996b). Spectral dynamics for speech recognition under adverse conditions. Automatic Speech and Speaker Recognition, Kluwer Academic Publishers. Harborg, E. (1990). Hidden Markov Models Applied to Automatic Speech Recognition,. PhD Thesis, Norwegian Institute of Technology (Trondheim). Harrington, J. and S. Cassidy (1999). Techniques in Speech Acoustics, Kluwer Academic Publishers. Hartmann, W. M. (1998). Signals, Sound, and Sensation, Springer- Verlag. Herrnansky, H. (1990a). "Perceptual linear predictive (PLP) analysis for speech." J. Acoust. Soc. Am. 87: 1738-1752. Hermansky, H. (1990b). "Perceptual linear predictive (PLP) analysis of speech." J. Acoust. Soc. Am. 87(4): 1738-1752. Hermansky, H. (1997). Should recognizers have ears? Proc. ESCA Tutorial and Research Workshop on Robust Speech Recognition for Unknown Communication Channels, France. Hermansky, H. (1999). Analysis in automatic recognition of speech. Speech Processing, Recognition and Artificial Neural Networks. G. Chollet, D. Di Benedetto, A. Esposito and M. Marinaro, Springer-Verlag: 115-137. Hermansky, H., D. Ellis and S. Sharma (2000). TANDEM connectionist feature extraction for conventional HMM systems. Proc. ICASSP2000, Istanbul. Hermansky, H. and N. Malayath (1998). Spectral basis functions from discriminant analysis. ICSLP'98, Sydney, Australia. Hermansky, H. and S. Sharma (1999). Temporal patterns (TRAPS) in ASR of noisy speech. Proc. ICASSP'99, Phoenix, AZ. Hertz, J., A. Krogh and R G. Palmer (1991). Introduction To The Theory Of Neural Computing, Addison-Wesley Publishing Company. Hess, W. (1983). Pitch determination of speech signals. New York, Springer-Verlag. Hochberg, M., S. Rentals, A. J. Robinson and G. D. Cook (1995). Recent improvements to the ABBOT large vocabulary CSR system. Proc. IEEE ICASSP'95. Huang, L.-S. and C.-h. Yang (2000). A novel approach to robust speech endpoint detection in car environment. Proc. IEEE ICASSP'2000, Istanbul, Turkey. Huang, X., A. Acero, F. Allova, M. Y. Hwang, et al. (1995). "Microsoft windows highly intelligent speech recognizer: Whisper." Proc. IEEE ICASSP'95 1: 93-97. Huang, X. D. (1992). Minimizing speaker variation effects for speaker-independent speech recognition. Proceedings of Speech and Natural Language Workshop. Huang, X. D., M. A. Ariki and M. A. Jack (1990a). Hidden Markov Models for Speech Recognition. Edinburgh, Edinburgh University Press. Huang, X. D., Y. Ariki and M. A. Jack (1990b). Hidden Markov Models for Speech Recognition, Edinburgh University Press. Huang, X. D., H. W. Hon, M. Y. Huang and K. F. Lee (1993). "A comparative study of discrete semi-continuous, and continuous hidden Markov models." Computer Speech and Language 7: 359-368. Huang, X. D. and M. A. Jack, Eds. (1990). Semi-continuous hidden Markov models for speech signals. Readings in Speech Recognition, Morgan Kaufmann. Huang, X. D., K.-F. Lee, H. W. Hon and M. Y. Hwang (1991). Improved acoustic modeling for the SPHINX speech recognition system. Proc. IEEE ICASSP'91, Toronto, Canada. Humphries, J. J. and P. C. Woodland (1997). Using accent-specific pronounciation modelling for improved large vocabulary continuous speech recognition. Proc. Eurospeech. Hunt, M. J., S. M. Richardson, D. C. Bateman and A. Piau (1991). An investigation of PLP and IMEL

    Signal processing and acoustic modelling of speech signals for speech recognition systems

    No full text
    Natural man-machine interaction is currently one of the most unfulfilled pledges of automatic speech recognition (ASR). The purpose of an automatic speech recognition system is to accurately transcribe or execute what has been said. State-of-the-art speech recognition systems consist of four basic modules: the signal processing, the acoustic modelling, the language modelling, and the search engine. The subject of this thesis is the signal processing and acoustic modelling modules. We pursue the modelling of spoken signals in an optimum way. The resultant modules can be used successfully for the subsequent two modules. Since the first order hidden Markov model (HMM) has been a tremendously successful mathematically established paradigm, which makes it the up-to-the-minute technique in current speech recognition systems, this dissertation bases all its studies and experiments on HMM. HMM is a statistical framework that supports both acoustic and temporal modelling. It is widely used despite making a number of suboptimal modelling assumptions, which put limits on its full potential. We investigate how the model design strategy and the algorithms can be adapted to HMMs. Large suites of experimental results are demonstrated to expound the relative effectiveness of each component within the HMM paradigm. This dissertation presents several strategies for improving the overall performance of baseline speech recognition systems. The implementation of these strategies was optimised in a series of experiments. We also investigate selecting the optimal feature sets for speech recognition improvement. Moreover, the reliability of human speech recognition is attributed to the specific properties of the auditory presentation of speech. Thus, in this dissertation, we explore the use of perceptually inspired signal processing strategies, such as critical band frequency analysis. The resulting speech representation called Gammatone cepstral coefficients (GTCC) provides relative improvement over the baseline recogniser. We also investigate multiple signal representations for recognition in an ASR to improve the recognition rate. Additionally, we developed fast techniques that are useful for evaluation and comparison procedures between different signal processing paradigms. The following list gives the main contributions of this dissertation: • Speech/background discrimination. • HMM initialisation techniques. • Multiple signal representation with multi-stream paradigms. • Gender based modelling. • Feature vectors dimensionality reduction. • Perceptually motivated feature sets. • ASR training and recognition packages for research and development. Many of these methods can be applied in practical applications. The proposed techniques can be used directly in more complicated speech recognition systems by introducing their resultants to the language and search engine modules.UnpublishedAbdulla, W. H. and N. K. Kasabov (1999a). Speech recognition enhancement via robust CHMM speech background discrimination. Proc. ICONIP/ANZIIS/ANNES'99 International Workshop, New Zealand. Abdulla, W. H. and N. K. Kasabov (1999b). Two pass hidden Markov model for speech recognition systems. Proc. ICICS'99, Singapore. Abdulla, W. H. and N. K. Kasabov (1999c). The concepts of hidden Markov model in speech recognition. IJCNN'99, N. K. Kasabov, W. H. Abdulla (Ed.), Washington, DC, July 10-16, Chapter 4. Abdulla, W. H. and N. K. Kasabov (2000). Feature selection for parallel CHMM speech recognition systems. Proc. of the Fifth Joint Conference on Information Sciences, vol.2, pp 874-878, Atlantic City, New Jersey, USA. Abdulla, W. H. and N. K. Kasabov (2001). Improving speech recognition performance through gender separation. Proc. Artificial Neural Networks and Expert Systems International Conference (ANNES), pp 218- 222, Dunedin, New Zealand. Abramson, N. (1963). Information Theory and Coding. New York, McGraw-Hill. Abrash, V., H. Franco, M. Cohen, N. Morgan, et al. (1992). "Connectionist gender adaptation in hybrid neural network/hidden Markov model speech recognition system." Proc. ICSLP'92. Aertsen, A. and P. Johannesma (1980). "Spectra-temporal receptive fields of auditory neurons in the grass frog. I. Characterization of tonal and natural stimuli." Biol. Cybern. 38: 223 - 234. Allen, J. B. (1995). Speech and hearing in communication. New York, ASA edition, Acoustical Society of America. Alsabti, K., S. Ranka and V. Singh (1999). An efficient K-means clustering algorithm. IPPS/SPDP Workshop on High Performance Data Mining, San Juan, Puerto Rico. Arai, T. and S. Greenberg (1998). Speech intelligibility in the presence of cross channel spectral asynchrony. Proc. IEEE ICASSP'98. Atal, B. S. (1972). "Automatic speaker recognition based on pitch contours." J. Acoust. Soc. Am. 52: 1687-1697. Ata, B. S. and L. R Rabiner (1976). "A pattern recognition approach to voiced-unvoiced-silence classification with application to speech recognition." IEEE Trans. ASSP 24(4): 201-212. Bahl, L. R., P. F. Brown, P. V. de Souza and R L. Mercer (1988a). Speech recognition with continuous parameter hidden Markov models. Proc. IEEE ICASSP'88, New York, NY. Bahl, L. R., P. F. Brown, P. V. de Souza, R. L. Mercer, et al. (1988b). Acoustic Markov models used in Tangora speech recognition system. Proc. IEEE ICASSP'88, New York, USA. Baker, J. K. (1975a). "The DRAGON system - an overview." IEEE Trans. ASSP 23: 24-29. Baker, J. K. (1975b). "The Dragon system - an overview." IEEE Trans. ASSP 23(1): 24-29. Baker, J. K., Ed. (1975c). Stochastic modeling for automatic speech understanding. Speech Recognition: Invited paper presented at the 1974 IEEE symposium. New York, Academic Press. Barnwell-III, T. P. (1980). A comparison of parametrically different objective speech quality measures using correlation analysis with subjective quality results. Proc. IEEE ICASSP'80, Denver. Bateman, D. C., D. K. Bye and M. J. Hunt (1992). Spectral contrast normalization and other techniques for speech recognition in noise. Proc. IEEE ICASSP'92, San Francisco, USA. Baum, L. E. (1972). "An inequality and associated miximization technique in statistical estimation for probabilistic functions of Markov processe." Proc. Symp. On Inequalities 3: 1-7. Baum, L. E. and J. A. Egon (1967). "An inequality with applications to statistical estimation for probabilistic functions of Markov process and to a model for ecology." Bull. Amer. Meteorol. Soc. 73: 360-363. Baum, L. E. and T. Petrie (1966). "Statistical inference for probabilistic functions of finite state Markov chains." Ann. Math. Stat. 37: 1554-1563. Baum, L. E., T. Petrie, G. Soules and N. Weiss (1970). "A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains." Annals of Mathematical Statistics 41(1): 164-171. Becchetti, C. and L. P. Ricotti (1999). Speech recognition theory and C++ implementation, John Wiley & Sons. Bellegarda, J. and D. Nahamoo (1989). Tied mixture continuous parameter models for large vocabulary isolated speech recognition. Proc. ICASSP'89, Glasgow, Scotland. Bellegarda, J. and D. Naharnoo (1990). "Tied mixture continuous parameter modeling for speech recognition." IEEE Trans. ASSP 38(12): 2033-2045. Bengio, S. and Y. Bengio (2000a). "Taking on the curse of dimentionality in joint distributions using neural networks." IEEE Trans. Neural Networks 11(3): 550-557. Bengio, Y. (1996). Neural Networks for Speech and Sequence Processing, International Thomson Computer Press. Bengio, Y. and S. Bengio (2000b). Modeling high-dimensional discrete data with multi-layer neural networks. Advances in Neural Information Processing Systems 12. S. A. Jolla, T. K. Leen and K.-R. Miler, MIT Press: 400-406. Bin, J., T. Calhurst, A. EI-Jaroudi, R. lyer, et al. (1999). Recent experiments in large vocabulary conversational speech recognition. Proc. IEEE ICASSP'99, Phoenix. Blimes, J. A. (1998). A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Berkeley, CA, International Computer Science Institute. Bocchieri, E. and B. Mak (1997). Subspace distribution clustering for continuous observation density hidden Markov models. Proc. Eurospeech. Boll, S. F. (1979). "Suppression of acoustic noise in speech using spectral subtraction." IEEE Trans. ASSP 27(2): 113-120. Bou-Ghazale, S. E. and A. 0. Asadi (2000). Hands-free voice activation of personal communication devices. Proc. IEEE ICASSP'2000, Istanbul, Turkey. Bourlard, H., S. Bengio and K. Weber (2001). New approaches towards robust and adaptive speech recognition. Advances in Neural Information Processing Systems 13. T. K. Leen, T. G. Dietterich and V. Tresp, MIT Press: 751-757. Bourlard, H., S. Dupont and C. Ris (1996). Multi-stream speech recognition. Bourlard, H. and N. Morgan (1993). Connectionist Speech Recognition. A Hybrid Approach. Boston, Kluwer Academic Publishers. Bourlard, H. and N. Morgan (1994). Connectionist Speech Recognition, Kluwer Academic Publishers. Bourlard, H., C. J. Wellekens and H. Ney (1984). Connected digit recognition using vector quantization. Proc. IEEE ICASSP'84, San Diego, USA. Burton, D. K. and J. E. Shore (1985). "Speaker-dependent isolated word recognition using speaker-independent vector quantization coodbooks augmented with speaker-specific data." IEEE Trans. ASSP 33(2): 440-443. Chan, A. K. and S. J. Liu (1998). Wavelet Toolware: Software for Wavelet Training, Academic Press. Chen, S. S. and R. A. Gopinath (2001). Gaussianization. Advances in Neural Information Processing Systems 13. T. K. Leen, T. G. Dietterich and V. Tresp, MIT Press: 821-827. Chow, Y. L., M. 0. Dunham, 0. A. Kimball, M. A. Krasner, et al. (1987). BYBLOS: The BBN continuous speech recognition system. Proc. IEEE ICASSP'87, Dallas, USA. Cooke, M. (1993). Modelling Auditory Processing and Organization. U.K., Cambridge University Press. Cosi, P., Ed. (1999). Auditory modeling and neural networks. Speech Processing, Recognition and Artificial Neural Networks. London, Springer- Verlag. Cover, T. M. (1977). "On the possible ordering in the measurement selection problem." IEEE Trans. on Systems, Man, and Cybernetics 7(9): 657-661. Crochiere, R. E. and L. Rabiner (1983). Multirate Digital Signal Processing. Englewood Cliffs, NJ, Prentice Hall. Das, S., R. Bakis, A. Nadas, D. Nahamoo, et al. (1993). Influence of background noise and microphone on the performance of the IBM TANGORA speech recognition system. Proc. IEEE ICASSP'93. Das, S., A. Nadas, D. Nahamoo and M. Picheny (1994). Adaptation techniques for ambience and microphone compensation in the IBM Tangora speech recognition system. Proc. IEEE ICASSP'94, Adelaide, Australia. Daubechies, I. (1990). "The wavelet transform, time-frequency localization and signal analysis." IEEE Trans. IT 36(5): 961 -1005. Daubechies, I., Ed. (1992a). Ten Lectures on Wavelets. CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia, Pennsylvania, SIAM Press. Daubechies, I. (1992b). Ten lectures on wavelets, SIAM. Davis, B. and P. Mermelstein (1980). "Comparison of parametric representations for monosylabic word recognirion in continuously spoken sentences." IEEE Trans. ASSP 28(4): 357-366. Davis, S. B., Ed. (1990). Comparison of parametric representations for monosyllabic word recognition in continuous spoken sentences. Readings in Speech Recognition. de-Boer, E. and H. R. de Jongh (1978). "On cochlea encoding: Potentialities and limitations of the reverse-correlation technique." J. Acoust. Soc. Am. 63(1): 115- 135. de-Boer, E. and P. Kuyper (1968). "Triggered correlation." IEEE Trans. Biomed. Eng. BME-15: 169 - 179. DeIler, J. R., J. G. Proakis and J. H. Hansen (1993). Discrete-Time Processing of Speech Signals. New York, Macmillan Publishing. Dempster, A. P., N. M. Laird and D. B. Rubin (1977). "Maximum likelihood from incomplete data via the EM algorithm." J. Royal Statist. Soc. Ser. B 39(1): 1-38. Devijver, P. A. and J. Kittler (1982). Pattern Recognition: A Statistical Approach. Englewood Cliffs, NJ, Prentice-Hall. Donoho, D. L., Ed. (1993). Nonlinear wavelet methods for recovery of signals, densities, and spectra from indirect and noisy data. Proceedings of the Symposia in Applied Mathematics, American Mathematical Society. Donoho, D. L. (1995). "Denoising by soft-thresholding." IEEE Trans. IT 41(3): 613-627. Donoho, D. L. and I. M. Johnstone (1994). "Ideal spatial adaptation by wavelet shrinkage." Biometrika 81: 425-455. Draper, N. R. and H. Smith (1981). Applied Regression Analysis. New York, Wiley. Duda, R. 0. and P. E. Hart (1973). Pattern Classification and Scene Analysis. New York, Wiley. Duncan Luce, R. (1993). Sound & Hearing A Conceptual Introduction, Lawrence Erlbaum Associates. Elenius, K. and M. Blomberg (1982). Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system. Proc. ICASSP'86. Ellermann, C., S. V. Even, C. Huang and L. Manganaro (1993). "Dragon systems' experiences in small to large vocabulary multi-lingual speech recognition applications." Proc. Eurospeech 3: 2077-2080. Ellis, D. (2001). TANDEM acoustic modeling in large-vocabulary recognition. Proc. ICASSP'2001, Salt Lake City. Ellis, D. and J. A. Bilmes (2000). Using mutual information to design feature combinations. Proc. ICSLP-2000, Beijing. Ellis, D. P. (2000a). Improved recognition by combining different features and different systems. Proc. AVIOS-2000, San Jose. Ellis, D. P. (2000b). Using mutual information to design feature combinations. Proc. ICSLP-2000, Beijing. Ellis, D. P., R. Singh and S. Sivadas (2001). Tandem acoustic modelling in large-vocabulary recognition. Proc. IEEE ICASSP'2001, Salt Lake City. Elman, J. L. (1990). "Finding structure in time." Cognitive Science 14(2): 179-211. Ferguson, J. D. (1980). Hidden Markov Analysis: An Introduction. Princeton, NJ, Institute of Defence Analyses. Fermin, C. D. (2000). Very Detailed Tutorial on the ear, http://www1.omi.tulane.edu/departments/pathology/fermin/Hearing.html. 6000. Forney, G. D. (1973). "The Viterbi algorithm." Proc. IEEE 61: 268-278. Furui, S. (1986a). Speaker independent isolated word recognition based on emphasised spectral dynamics. Proc. ICASSP'86, Tokyo - Japan. Furui, S. (1986b). Speaker independent isolated word recognition based on emphasized spectral dynamics. Proc. IEEE ICASSP'86, Tokyo- Japan. Furui, S. (1986c), "Speaker independent isolated word recognition using dynamic features of speech recognition." IEEE Trans. ASSP 34(2): 52-59. Furui, S. (1986d). "Speaker-independent isolated word recognition using dynamic features of speech spectrum." IEEE Trans. ASSP 34: 52-59. Furui, S. (1988). "A VO-based preprocesor using cepstral dynamic features for speaker-independent large vocabulary word recognition." IEEE Trans. ASSN 36(7): 980-987. Garofolo, J. S., L. F. Lame!, W. M. Fisher, J. G. Fiscus, et al. (1990). DARPA TIM IT Acoustic-Phonetic Continuous Speech Corpus CD-ROM. NISTIR 4930. Gauvain, J.-L. and C.-H. Lee (1994). "Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains." IEEE Trans. SAP 2(2): 291-298. Ghitza, 0., Ed. (1992). Auditory nerve representation as a basis for speech processing. Advances in Speech Signal Processing. New York, Marcel Dekker. Glasberg, B. R. and B. C. Moore (1990). "Derivation of auditory filter shapes from notched-noise data." Hearing Research 47: 103 - 108. Glinski, S. C. (1985). "On the use of vector quantization for connecteddigit recognition." The AT & T Tech, J. 64(5): 1033-1045. Gong, Y. (1995). "Speech recognition in noisy environments: A survey." Speech Communication 16: 261-291. Graps, A. (1995). "An introduction to wavelets." IEEE Computational Science and Engineering 2(2). Gravier, G., M. Sigelle and G. Chollet (1998). "Marrkov random field modelling for speech recognition." Australian J. of Intelligent Information Processing Systems 5(4): 245-251. Gray, J. and J. D. Markel (1976). "Distance measures for speech processing." IEEE Trans. ASSP 24(5): 380-391. Gray Jr, A. H. and J. D. Markel (1974). "A spectral flatness measure for studying the autocorrelation method of linear prediction of speech analysis." IEEE Trans. ASSP ASSP-22: 607 - 217. Gray, R. M. (1984). "Vector quantization." IEEE ASSP Magazine 1(2): 4 - 29. Gray, R. M., A. Buzo, J. Gray and Matsuyama (1980). "Distortion measures for speech processing." IEEE Trans. ASSP 68(4): 367-376. Greenberg, S., T. Arai and R. Silipo (1998). Speech derivation from exceedingly sparse spectral information. Proc. IEEE ICSLP'98. Greenwood, D. (1961). "Critical bandwidth and the frequency coordinates of the basilar membrane." J. Acoust. Soc. Am. 33: 1344 - 1356. Greenwood, D. (1990). "A cochlear frequency-position function for several species--29 years later." J. Accost, Soc. Am, 87(6): 2592 - 2605. Gupta, V. N., M. Lennig and P. Mermelstein (1987a). Integration of acoustic information in a large vocabulary word recognizer. Proc. ICASSP'87, Dallas, USA. Gupta, V. N., M. Lenning and P. Mermelstein (1987b). Integration of acoustic information in a large vocabulary word recognizer. Proc. IEEE ICASSP'87. Haeb-Umbach, R. (1999a). Investigations on interspeaker variability in the feature space. Proc. ICASSP'99, Phoenix, Arizona. Haeb-Umbach, R. (1999b). Investigations on inter-speaker variability in the feature space. Proc. IEEE ICASSP'99, Arizona-USA. Hansen, J. H. and B. L. Pellom (1998). An effective quality evaluation protocol for speech enhancement algorithms. Proc. ICSLP'98, Sydney, Australia. Hanson, B. A. and T. H. Applebaum (1990a). Features for noiserobust speaker-independent word recognition. Proc. Int. Conf. Spoken Language Processing (ICSLP), Kobe-Japan. Hanson, B. A. and T. H. Applebaum (1990b). Robust speaker independent word recognition using static, dynamic, and acceleration features: experiments with lombard and noisy speech. Proc. IEEE ICASSP'90, Albuquerque, NM. Hanson, B. A. and T. H. Applebaum (1990c). Robust speaker independent word recognition using static, dynamic, and acceleration features: experiments with lombard and noisy speech. Proc. ICASSP'90, Albuquerque, NM. Hanson, B. A., T. H. Applebaum and J.-C. Junqua (1996a). Spectral dynamics for speech recognition under adverse conditions. Automatic Speech and Speaker Recognition Advanced Topics. C.-H. Lee, F. K. Soong and K. K. Paliwal. Hanson, B. A., T. H. Applebaum and J.-C. Junqua, Eds. (1996b). Spectral dynamics for speech recognition under adverse conditions. Automatic Speech and Speaker Recognition, Kluwer Academic Publishers. Harborg, E. (1990). Hidden Markov Models Applied to Automatic Speech Recognition,. PhD Thesis, Norwegian Institute of Technology (Trondheim). Harrington, J. and S. Cassidy (1999). Techniques in Speech Acoustics, Kluwer Academic Publishers. Hartmann, W. M. (1998). Signals, Sound, and Sensation, Springer- Verlag. Herrnansky, H. (1990a). "Perceptual linear predictive (PLP) analysis for speech." J. Acoust. Soc. Am. 87: 1738-1752. Hermansky, H. (1990b). "Perceptual linear predictive (PLP) analysis of speech." J. Acoust. Soc. Am. 87(4): 1738-1752. Hermansky, H. (1997). Should recognizers have ears? Proc. ESCA Tutorial and Research Workshop on Robust Speech Recognition for Unknown Communication Channels, France. Hermansky, H. (1999). Analysis in automatic recognition of speech. Speech Processing, Recognition and Artificial Neural Networks. G. Chollet, D. Di Benedetto, A. Esposito and M. Marinaro, Springer-Verlag: 115-137. Hermansky, H., D. Ellis and S. Sharma (2000). TANDEM connectionist feature extraction for conventional HMM systems. Proc. ICASSP2000, Istanbul. Hermansky, H. and N. Malayath (1998). Spectral basis functions from discriminant analysis. ICSLP'98, Sydney, Australia. Hermansky, H. and S. Sharma (1999). Temporal patterns (TRAPS) in ASR of noisy speech. Proc. ICASSP'99, Phoenix, AZ. Hertz, J., A. Krogh and R G. Palmer (1991). Introduction To The Theory Of Neural Computing, Addison-Wesley Publishing Company. Hess, W. (1983). Pitch determination of speech signals. New York, Springer-Verlag. Hochberg, M., S. Rentals, A. J. Robinson and G. D. Cook (1995). Recent improvements to the ABBOT large vocabulary CSR system. Proc. IEEE ICASSP'95. Huang, L.-S. and C.-h. Yang (2000). A novel approach to robust speech endpoint detection in car environment. Proc. IEEE ICASSP'2000, Istanbul, Turkey. Huang, X., A. Acero, F. Allova, M. Y. Hwang, et al. (1995). "Microsoft windows highly intelligent speech recognizer: Whisper." Proc. IEEE ICASSP'95 1: 93-97. Huang, X. D. (1992). Minimizing speaker variation effects for speaker-independent speech recognition. Proceedings of Speech and Natural Language Workshop. Huang, X. D., M. A. Ariki and M. A. Jack (1990a). Hidden Markov Models for Speech Recognition. Edinburgh, Edinburgh University Press. Huang, X. D., Y. Ariki and M. A. Jack (1990b). Hidden Markov Models for Speech Recognition, Edinburgh University Press. Huang, X. D., H. W. Hon, M. Y. Huang and K. F. Lee (1993). "A comparative study of discrete semi-continuous, and continuous hidden Markov models." Computer Speech and Language 7: 359-368. Huang, X. D. and M. A. Jack, Eds. (1990). Semi-continuous hidden Markov models for speech signals. Readings in Speech Recognition, Morgan Kaufmann. Huang, X. D., K.-F. Lee, H. W. Hon and M. Y. Hwang (1991). Improved acoustic modeling for the SPHINX speech recognition system. Proc. IEEE ICASSP'91, Toronto, Canada. Humphries, J. J. and P. C. Woodland (1997). Using accent-specific pronounciation modelling for improved large vocabulary continuous speech recognition. Proc. Eurospeech. Hunt, M. J., S. M. Richardson, D. C. Bateman and A. Piau (1991). An investigation of PLP and IMEL
    corecore