44 research outputs found

    Articulatory Control of HMM-based Parametric Speech Synthesis using Feature-Space-Switched Multiple Regression

    Get PDF

    Vowel Creation by Articulatory Control in HMM-based Parametric Speech Synthesis

    Get PDF
    This paper presents a method to produce a new vowel by articulatory control in hidden Markov model (HMM) based parametric speech synthesis. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as external auxiliary variables. The dependency between acoustic and articulatory features is modelled by a group of linear transforms that are either estimated context-dependently or determined by the distribution of articulatory features. Vowel identity is removed from the set of context features used to ensure compatibility between the contextdependent model parameters and the articulatory features of a new vowel. At synthesis time, acoustic features are predicted according to the input articulatory features as well as context information. With an appropriate articulatory feature sequence, a new vowel can be generated even when it does not exist in the training set. Experimental results show this method is effective in creating the English vowel /2 / by articulatory control without using any acoustic samples of this vowel

    Speaker-Independent Mel-cepstrum Estimation from Articulator Movements Using D-vector Input

    Get PDF

    Modelling Speech Dynamics with Trajectory-HMMs

    Get PDF
    Institute for Communicating and Collaborative SystemsThe conditional independence assumption imposed by the hidden Markov models (HMMs) makes it difficult to model temporal correlation patterns in human speech. Traditionally, this limitation is circumvented by appending the first and second-order regression coefficients to the observation feature vectors. Although this leads to improved performance in recognition tasks, we argue that a straightforward use of dynamic features in HMMs will result in an inferior model, due to the incorrect handling of dynamic constraints. In this thesis I will show that an HMM can be transformed into a Trajectory-HMM capable of generating smoothed output mean trajectories, by performing a per-utterance normalisation. The resulting model can be trained by either maximisingmodel log-likelihood or minimisingmean generation errors on the training data. To combat the exponential growth of paths in searching, the idea of delayed path merging is proposed and a new time-synchronous decoding algorithm built on the concept of token-passing is designed for use in the recognition task. The Trajectory-HMM brings a new way of sharing knowledge between speech recognition and synthesis components, by tackling both problems in a coherent statistical framework. I evaluated the Trajectory-HMM on two different speech tasks using the speaker-dependent MOCHA-TIMIT database. First as a generative model to recover articulatory features from speech signal, where the Trajectory-HMM was used in a complementary way to the conventional HMM modelling techniques, within a joint Acoustic-Articulatory framework. Experiments indicate that the jointly trained acoustic-articulatory models are more accurate (having a lower Root Mean Square error) than the separately trained ones, and that Trajectory-HMM training results in greater accuracy compared with conventional Baum-Welch parameter updating. In addition, the Root Mean Square (RMS) training objective proves to be consistently better than the Maximum Likelihood objective. However, experiment of the phone recognition task shows that the MLE trained Trajectory-HMM, while retaining attractive properties of being a proper generative model, tends to favour over-smoothed trajectories among competing hypothesises, and does not perform better than a conventional HMM. We use this to build an argument that models giving a better fit on training data may suffer a reduction of discrimination by being too faithful to the training data. Finally, experiments on using triphone models show that increasing modelling detail is an effective way to leverage modelling performance with little added complexity in training

    Articulatory features for conversational speech recognition

    Get PDF

    Acoustic Modelling for Under-Resourced Languages

    Get PDF
    Automatic speech recognition systems have so far been developed only for very few languages out of the 4,000-7,000 existing ones. In this thesis we examine methods to rapidly create acoustic models in new, possibly under-resourced languages, in a time and cost effective manner. For this we examine the use of multilingual models, the application of articulatory features across languages, and the automatic discovery of word-like units in unwritten languages

    Modelling and Interpolation of Austrian German and Viennese

    Get PDF
    Abstract An HMM-based speech synthesis framework is applied to both Standard Austrian German and a Viennese dialectal variety and several training strategies for multi-dialect modeling such as dialect clustering and dialect-adaptive training are investigated. For bridging the gap between processing on the level of HMMs and on the linguistic level, we add phonological transformations to the HMM interpolation and apply them to dialect interpolation. The crucial steps are to employ several formalized phonological rules between Austrian German and Viennese dialect as constraints for the HMM interpolation. We verify the effectiveness of this strategy in a number of perceptual evaluations. Since the HMM space used is not articulatory but acoustic space, there are some variations in evaluation results between the phonological rules. However, in general we obtained good evaluation results which show that listeners can perceive both continuous and categorical changes of dialect varieties by using phonological transformations employed as switching rules in the HMM interpolation

    The application of continuous state HMMs to an automatic speech recognition task

    Get PDF
    Hidden Markov Models (HMMs) have been a popular choice for automatic speech recognition (ASR) for several decades due to their mathematical formulation and computational efficiency, which has consistently resulted in a better performance compared to other methods during this period. However, HMMs are based on the assumption of statistical independence among speech frames, which conflicts with the physiological basis of speech production. Consequently, researchers have produced a substantial amount of literature to extend the HMM model assumptions and incorporate dynamic properties of speech into the underlying model. One such approach involves segmental models, which addresses a frame-wise independence assumption. However, the computational inefficiencies associated with segmental models have limited their practical application. In recent years, there has been a shift from HMM-based systems to neural networks (NN) and deep learning approaches, which offer superior performance com- pared to conventional statistical models. However, as the complexity of neural models increases, so does the number of parameters involved, requiring a greater dependency on training data to optimise model parameters. This present study extends prior research on segmental HMMs by introducing a Segmental Continuous-State Hidden Markov Model (CSHMM) examining a resolution to the issue of inter-segmental continuity. This is an alternative approach when compared to contemporary speech modelling methods that rely on data-centric NN techniques, with the goal of establishing a statistical model that more accurately reflects the speech production process. The Continuous-State Segmental model offers a flexible mathematical framework which can impose a continuity constraint between adjoining segments addressing a fundamental drawback of conventional HMMs, namely, the independence assumption. Additionally, the CSHMM also benefits from a practical training and decoding algorithm which overcomes the computational inefficiency inherent in conventional decoding algorithms for traditional Segmental HMMs. This study has formulated four trajectory-based segmental models using a CSHMM framework. CSHMMs have not been extensively studied for ASR tasks due to the absence of open-source standardised speech tool-kits that enable convenient exploration of CSHMMs. As a result, to perform sufficient experiments in this study, training and decoding software has been developed, which can be accessed in (Seivwright, 2015). The experiments in this study report baseline phone recognition results for the four distinct Segmental CSHMM systems using the TIMIT database. These baseline results are compared against a simple Hidden Markov Model-Gaussian Mixture Model (HMM- GMM) system. In all experiments, a compact acoustic feature representation in the form of bottleneck features (BNF), is employed, motivated by an investigation into the BNFs and their relationship to articulatory properties. Although the proposed CSHMM systems do not surpass discrete-state HMMs in performance, this research has demonstrated a strong association between inter-segmental continuity and the corresponding phonetic categories being modelled. Furthermore, this thesis presents a method for achieving finer control over continuity between segments, which can be expanded to investigate co-articulation in the context of CSHMMs
    corecore