1,296 research outputs found

    Context-Dependent Acoustic Modeling without Explicit Phone Clustering

    Full text link
    Phoneme-based acoustic modeling of large vocabulary automatic speech recognition takes advantage of phoneme context. The large number of context-dependent (CD) phonemes and their highly varying statistics require tying or smoothing to enable robust training. Usually, Classification and Regression Trees are used for phonetic clustering, which is standard in Hidden Markov Model (HMM)-based systems. However, this solution introduces a secondary training objective and does not allow for end-to-end training. In this work, we address a direct phonetic context modeling for the hybrid Deep Neural Network (DNN)/HMM, that does not build on any phone clustering algorithm for the determination of the HMM state inventory. By performing different decompositions of the joint probability of the center phoneme state and its left and right contexts, we obtain a factorized network consisting of different components, trained jointly. Moreover, the representation of the phonetic context for the network relies on phoneme embeddings. The recognition accuracy of our proposed models on the Switchboard task is comparable and outperforms slightly the hybrid model using the standard state-tying decision trees.Comment: Submitted to Interspeech 202

    An introduction to statistical parametric speech synthesis

    Get PDF

    Speech Synthesis Based on Hidden Markov Models

    Get PDF

    Capacity and Complexity of HMM Duration Modeling Techniques

    Get PDF
    The ability of a standard hidden Markov model (HMM) or expanded state HMM (ESHMM) to accurately model duration distributions of phonemes is compared with specific duration-focused approaches such as semi-Markov models or variable transition probabilities. It is demonstrated that either a three-state ESHMM or a standard HMM with an increased number of states is capable of closely matching both Gamma distributions and duration distributions of phonemes from the TIMIT corpus, as measured by Bhattacharyya distance to the true distributions. Standard HMMs are easily implemented with off-the-shelf tools, whereas duration models require substantial algorithmic development and have higher computational costs when implemented, suggesting that a simple adjustment to HMM topologies is perhaps a more efficient solution to the problem of duration than more complex approaches

    Improving large vocabulary continuous speech recognition by combining GMM-based and reservoir-based acoustic modeling

    Get PDF
    In earlier work we have shown that good phoneme recognition is possible with a so-called reservoir, a special type of recurrent neural network. In this paper, different architectures based on Reservoir Computing (RC) for large vocabulary continuous speech recognition are investigated. Besides experiments with HMM hybrids, it is shown that a RC-HMM tandem can achieve the same recognition accuracy as a classical HMM, which is a promising result for such a fairly new paradigm. It is also demonstrated that a state-level combination of the scores of the tandem and the baseline HMM leads to a significant improvement over the baseline. A word error rate reduction of the order of 20\% relative is possible

    Automatic Speech Segmentation Based on HMM

    Get PDF
    This contribution deals with the problem of automatic phoneme segmentation using HMMs. Automatization of speech segmentation task is important for applications, where large amount of data is needed to process, so manual segmentation is out of the question. In this paper we focus on automatic segmentation of recordings, which will be used for triphone synthesis unit database creation. For speech synthesis, the speech unit quality is a crucial aspect, so the maximal accuracy in segmentation is needed here. In this work, different kinds of HMMs with various parameters have been trained and their usefulness for automatic segmentation is discussed. At the end of this work, some segmentation accuracy tests of all models are presented

    Croatian Speech Recognition

    Get PDF

    HTK - Tutorial (Part I + II)

    Get PDF
    corecore