566 research outputs found

    ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION

    Get PDF
    Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria. Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal. Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system

    Estimating articulatory parameters from the acoustic speech signal

    Get PDF

    Articulatory-WaveNet: Deep Autoregressive Model for Acoustic-to-Articulatory Inversion

    Get PDF
    Acoustic-to-Articulatory Inversion, the estimation of articulatory kinematics from speech, is an important problem which has received significant attention in recent years. Estimated articulatory movements from such models can be used for many applications, including speech synthesis, automatic speech recognition, and facial kinematics for talking-head animation devices. Knowledge about the position of the articulators can also be extremely useful in speech therapy systems and Computer-Aided Language Learning (CALL) and Computer-Aided Pronunciation Training (CAPT) systems for second language learners. Acoustic-to-Articulatory Inversion is a challenging problem due to the complexity of articulation patterns and significant inter-speaker differences. This is even more challenging when applied to non-native speakers without any kinematic training data. This dissertation attempts to address these problems through the development of up-graded architectures for Articulatory Inversion. The proposed Articulatory-WaveNet architecture is based on a dilated causal convolutional layer structure that improves the Acoustic-to-Articulatory Inversion estimated results for both speaker-dependent and speaker-independent scenarios. The system has been evaluated on the ElectroMagnetic Articulography corpus of Mandarin Accented English (EMA-MAE) corpus, consisting of 39 speakers including both native English speakers and Mandarin accented English speakers. Results show that Articulatory-WaveNet improves the performance of the speaker-dependent and speaker-independent Acoustic-to-Articulatory Inversion systems significantly compared to the previously reported results

    Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab

    Get PDF
    Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process. The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values. The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features. The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages

    Articulatory representations to address acoustic variability in speech

    Get PDF
    The past decade has seen phenomenal improvement in the performance of Automatic Speech Recognition (ASR) systems. In spite of this vast improvement in performance, the state-of-the-art still lags significantly behind human speech recognition. Even though certain systems claim super-human performance, this performance often is sub-par across domains and across datasets. This gap is predominantly due to the lack of robustness against speech variability. Even clean speech is extremely variable due to a large number of factors such as voice characteristics, speaking style, speaking rate, accents, casualness, emotions and more. The goal of this thesis is to investigate the variability of speech from the perspective of speech production, put forth robust articulatory features to address this variability, and to incorporate these features in state-of-the-art ASR systems in the best way possible. ASR systems model speech as a sequence of distinctive phone units like beads on a string. Although phonemes are distinctive units in the cognitive domain, their physical realizations are extremely varied due to coarticulation and lenition which are commonly observed in conversational speech. The traditional approaches deal with this issue by performing di-, tri- or quin-phone based acoustic modeling but are insufficient to model longer contextual dependencies. Articulatory phonology analyzes speech as a constellation of coordinated articulatory gestures performed by the articulators in the vocal tract (lips, tongue tip, tongue body, jaw, glottis and velum). In this framework, acoustic variability is explained by the temporal overlap of gestures and their reduction in space. In order to analyze speech in terms of articulatory gestures, the gestures need to be estimated from the speech signal. The first part of the thesis focuses on a speaker independent acoustic-to-articulatory inversion system that was developed to estimate vocal tract constriction variables (TVs) from speech. The mapping from acoustics to TVs was learned from the multi-speaker X-ray Microbeam (XRMB) articulatory dataset. Constriction regions from TV trajectories were defined as articulatory gestures using articulatory kinematics. The speech inversion system combined with the TV kinematics based gesture annotation provided a system to estimate articulatory gestures from speech. The second part of this thesis deals with the analysis of the articulatory trajectories under different types of variability such as multiple speakers, speaking rate, and accents. It was observed that speaker variation degraded the performance of the speech inversion system. A Vocal Tract Length Normalization (VTLN) based speaker normalization technique was therefore developed to address the speaker variability in the acoustic and articulatory domains. The performance of speech inversion systems was analyzed on an articulatory dataset containing speaking rate variations to assess if the model was able to reliably predict the TVs in challenging coarticulatory scenarios. The performance of the speech inversion system was analyzed in cross accent and cross language scenarios through experiments on a Dutch and British English articulatory dataset. These experiments provide a quantitative measure of the robustness of the speech inversion systems to different speech variability. The final part of this thesis deals with the incorporation of articulatory features in state-of-the-art medium vocabulary ASR systems. A hybrid convolutional neural network (CNN) architecture was developed to fuse the acoustic and articulatory feature streams in an ASR system. ASR experiments were performed on the Wall Street Journal (WSJ) corpus. Several articulatory feature combinations were explored to determine the best feature combination. Cross-corpus evaluations were carried out to evaluate the WSJ trained ASR system on the TIMIT and another dataset containing speaking rate variability. Results showed that combining articulatory features with acoustic features through the hybrid CNN improved the performance of the ASR system in matched and mismatched evaluation conditions. The findings based on this dissertation indicate that articulatory representations extracted from acoustics can be used to address acoustic variability in speech observed due to speakers, accents, and speaking rates and further be used to improve the performance of Automatic Speech Recognition systems

    Articulatory features for conversational speech recognition

    Get PDF

    Recent development of the HMM-based speech synthesis system (HTS)

    Get PDF
    A statistical parametric approach to speech synthesis based on hidden Markov models (HMMs) has grown in popularity over the last few years. In this approach, spectrum, excitation, and duration of speech are simultaneously modeled by context-dependent HMMs, and speech waveforms are generate from the HMMs themselves. Since December 2002, we have publicly released an open-source software toolkit named “HMM-based speech synthesis system (HTS)” to provide a research and development toolkit for statistical parametric speech synthesis. This paper describes recent developments of HTS in detail, as well as future release plans

    Modelling Speech Dynamics with Trajectory-HMMs

    Get PDF
    Institute for Communicating and Collaborative SystemsThe conditional independence assumption imposed by the hidden Markov models (HMMs) makes it difficult to model temporal correlation patterns in human speech. Traditionally, this limitation is circumvented by appending the first and second-order regression coefficients to the observation feature vectors. Although this leads to improved performance in recognition tasks, we argue that a straightforward use of dynamic features in HMMs will result in an inferior model, due to the incorrect handling of dynamic constraints. In this thesis I will show that an HMM can be transformed into a Trajectory-HMM capable of generating smoothed output mean trajectories, by performing a per-utterance normalisation. The resulting model can be trained by either maximisingmodel log-likelihood or minimisingmean generation errors on the training data. To combat the exponential growth of paths in searching, the idea of delayed path merging is proposed and a new time-synchronous decoding algorithm built on the concept of token-passing is designed for use in the recognition task. The Trajectory-HMM brings a new way of sharing knowledge between speech recognition and synthesis components, by tackling both problems in a coherent statistical framework. I evaluated the Trajectory-HMM on two different speech tasks using the speaker-dependent MOCHA-TIMIT database. First as a generative model to recover articulatory features from speech signal, where the Trajectory-HMM was used in a complementary way to the conventional HMM modelling techniques, within a joint Acoustic-Articulatory framework. Experiments indicate that the jointly trained acoustic-articulatory models are more accurate (having a lower Root Mean Square error) than the separately trained ones, and that Trajectory-HMM training results in greater accuracy compared with conventional Baum-Welch parameter updating. In addition, the Root Mean Square (RMS) training objective proves to be consistently better than the Maximum Likelihood objective. However, experiment of the phone recognition task shows that the MLE trained Trajectory-HMM, while retaining attractive properties of being a proper generative model, tends to favour over-smoothed trajectories among competing hypothesises, and does not perform better than a conventional HMM. We use this to build an argument that models giving a better fit on training data may suffer a reduction of discrimination by being too faithful to the training data. Finally, experiments on using triphone models show that increasing modelling detail is an effective way to leverage modelling performance with little added complexity in training

    Learning How to Speak: Imitation-Based Refinement of Syllable Production in an Articulatory-Acoustic Model

    Get PDF
    Philippsen A, Reinhart F, Wrede B. Learning How to Speak: Imitation-Based Refinement of Syllable Production in an Articulatory-Acoustic Model. Presented at the Forth Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics (ICDL-EpiRob), Genoa, Italy.This paper proposes an efficient neural network model for learning the articulatory-acoustic forward and inverse mapping of consonant-vowel sequences including coarticulation effects. It is shown that the learned models can generalize vowels as well as consonants to other contexts and that the need for supervised training examples can be reduced by refining initial forward and inverse models using acoustic examples only. The models are initially trained on smaller sets of examples and then improved by presenting auditory goals that are imitated. The acoustic outcomes of the imitations together with the executed actions provide new training pairs. It is shown that this unsupervised and imitation-based refinement significantly decreases the error of the forward as well as the inverse model. Using a state-of-the-art articulatory speech synthesizer, our approach allows to reproduce the acoustics from learned articulatory trajectories, i.e. we can listen to the results and rate their quality by error measures and perception

    Speaker-Specific Adaptation of Maeda Synthesis Parameters for Auditory Feedback

    Get PDF
    The Real-time Articulatory Speech Synthesizer (RASS) is a research tool in the Marquette Speech and Swallowing lab that simultaneously collects acoustic and articulatory data from human participants. The system is used to study acoustic-to-articulatory inversion, articulatory-to-acoustic synthesis mapping, and the effects of real-time acoustic feedback. Electromagnetic Articulography (EMA) is utilized to collect position data via sensors placed in a subject’s mouth. These kinematic data are then converted into a set of synthesis parameters that controls an articulatory speech synthesizer, which in turn generates an acoustic waveform matching the associated kinematics. Independently from RASS, the synthesized acoustic waveform can be further modified before it is returned to the subject, creating the opportunity for involuntary learning through controlled acoustic feedback. In order to maximize the impact of involuntary learning, the characteristics of the synthetically generated speech need to closely match those of the participant. There are a number of synthesis parameters that cannot be directly controlled by subjects’ articulatory movements such as fundamental frequency and parameters corresponding to physiological measures such as vocal tract length and overall vocal tract size. The goal of this work is to develop a mechanism for automatically determining RASS internal synthesis parameters that provide the closest synthesis parameter match to a subject’s acoustic characteristics, ultimately increasing the system’s positive effect on involuntary learning.. The methods detailed in this thesis examine the effects of altering both time-independent and time-dependent synthesis parameters to increase the acoustic similarity between subjects’ real and synthesized speech. The fundamental frequency and first two formant values are studied in particular across multiple vowels to determine the time-independent parameter settings. Time-dependent parameter analysis is performed through the use of a real-time parameter-tracking configuration. Results of this work provide a way of adapting the Maeda synthesis parameters in RASS to be speaker-specific and individualize the study of auditory feedback. This investigation will allow researchers to better customize the RASS system for individual subjects and alter involuntary learning outcomes
    corecore