2,589 research outputs found

    Porting concepts from DNNs back to GMMs

    Get PDF
    Deep neural networks (DNNs) have been shown to outperform Gaussian Mixture Models (GMM) on a variety of speech recognition benchmarks. In this paper we analyze the differences between the DNN and GMM modeling techniques and port the best ideas from the DNN-based modeling to a GMM-based system. By going both deep (multiple layers) and wide (multiple parallel sub-models) and by sharing model parameters, we are able to close the gap between the two modeling techniques on the TIMIT database. Since the 'deep' GMMs retain the maximum-likelihood trained Gaussians as first layer, advanced techniques such as speaker adaptation and model-based noise robustness can be readily incorporated. Regardless of their similarities, the DNNs and the deep GMMs still show a sufficient amount of complementarity to allow effective system combination

    Croatian Speech Recognition

    Get PDF

    Text-Independent Automatic Speaker Identification Using Partitioned Neural Networks

    Get PDF
    This dissertation introduces a binary partitioned approach to statistical pattern classification which is applied to talker identification using neural networks. In recent years artificial neural networks have been shown to work exceptionally well for small but difficult pattern classification tasks. However, their application to large tasks (i.e., having more than ten to 20 categories) is limited by a dramatic increase in required training time. The time required to train a single network to perform N-way classification is nearly proportional to the exponential of N. In contrast, the binary partitioned approach requires training times on the order of N2. Besides partitioning, other related issues were investigated such as acoustic feature selection for speaker identification and neural network optimization. The binary partitioned approach was used to develop an automatic speaker identification system for 120 male and 130 female speakers of a standard speech data base. The system performs with 100% accuracy in a text-independent mode when trained with about nine to 14 seconds of speech and tested with six to eight seconds of speech

    Articulatory features for conversational speech recognition

    Get PDF

    Irrelevant variability normalization in learning HMM state tying from data based on phonetic decision-tree

    Get PDF
    We propose to apply the concept of irrelevant variability normalization to the general problem of learning structure from data. Because of the problems of a diversified training data set and/or possible acoustic mismatches between training and testing conditions, the structure learned from the training data by using a maximum likelihood training method will not necessarily generalize well on mismatched tasks. We apply the above concept to the structural learning problem of phonetic decision-tree based hidden Markov model (HMM) state tying. We present a new method that integrates a linear-transformation based normalization mechanism into the decision-tree construction process to make the learned structure have a better modeling capability and generalizability. The viability and efficacy of the proposed method are confirmed in a series of experiments for continuous speech recognition of Mandarin Chinese.published_or_final_versio

    CLASS - A Study of methods for coarse phonetic classification

    Get PDF
    The objective of this thesis was to examine computer techniques for classifying speech signals into four coarse phonetic classes: vowel-like, strong fricative, weak fricative and silence. The study compared classification results from the K-means clustering algorithm using Euclidian distance measurements with classification using a multivariate maximum likelihood distance measure. In addition to the comparison of statistical methods, this study compared classification using several tree-structured decision making processes. The system was trained on ten speakers using 98 utterances with both known and unknown speakers. Results showed very little difference between the Euclidian distance and maximum likelihood; however, the introduction of the tree structure on both systems had a positive influence on their performance

    PHONOTACTIC AND ACOUSTIC LANGUAGE RECOGNITION

    Get PDF
    Práce pojednává o fonotaktickém a akustickém přístupu pro automatické rozpoznávání jazyka. První část práce pojednává o fonotaktickém přístupu založeném na výskytu fonémových sekvenci v řeči. Nejdříve je prezentován popis vývoje fonémového rozpoznávače jako techniky pro přepis řeči do sekvence smysluplných symbolů. Hlavní důraz je kladen na dobré natrénování fonémového rozpoznávače a kombinaci výsledků z několika fonémových rozpoznávačů trénovaných na různých jazycích (Paralelní fonémové rozpoznávání následované jazykovými modely (PPRLM)). Práce také pojednává o nové technice anti-modely v PPRLM a studuje použití fonémových grafů místo nejlepšího přepisu. Na závěr práce jsou porovnány dva přístupy modelování výstupu fonémového rozpoznávače -- standardní n-gramové jazykové modely a binární rozhodovací stromy. Hlavní přínos v akustickém přístupu je diskriminativní modelování cílových modelů jazyků a první experimenty s kombinací diskriminativního trénování a na příznacích, kde byl odstraněn vliv kanálu. Práce dále zkoumá různé druhy technik fúzi akustického a fonotaktického přístupu. Všechny experimenty jsou provedeny na standardních datech z NIST evaluaci konané v letech 2003, 2005 a 2007, takže jsou přímo porovnatelné s výsledky ostatních skupin zabývajících se automatickým rozpoznáváním jazyka. S fúzí uvedených technik jsme posunuli state-of-the-art výsledky a dosáhli vynikajících výsledků ve dvou NIST evaluacích.This thesis deals with phonotactic and acoustic techniques for automatic language recognition (LRE). The first part of the thesis deals with the phonotactic language recognition based on co-occurrences of phone sequences in speech. A thorough study of phone recognition as tokenization technique for LRE is done, with focus on the amounts of training data for phone recognizer and on the combination of phone recognizers trained on several language (Parallel Phone Recognition followed by Language Model - PPRLM). The thesis also deals with novel technique of anti-models in PPRLM and investigates into using phone lattices instead of strings. The work on phonotactic approach is concluded by a comparison of classical n-gram modeling techniques and binary decision trees. The acoustic LRE was addressed too, with the main focus on discriminative techniques for training target language acoustic models and on initial (but successful) experiments with removing channel dependencies. We have also investigated into the fusion of phonotactic and acoustic approaches. All experiments were performed on standard data from NIST 2003, 2005 and 2007 evaluations so that the results are directly comparable to other laboratories in the LRE community. With the above mentioned techniques, the fused systems defined the state-of-the-art in the LRE field and reached excellent results in NIST evaluations.
    corecore